LOO for Multivariate Probit

fergusjchadwick · February 14, 2022, 3:50pm

Continuing the discussion from Feedback request: Multivariate Probit Regression with GP:

Brief problem summary: @bgoodri has developed this really handy implementation of the multivariate probit (code below). Unfortunately, it universally gets terrible k-hats in LOO. @martinmodrak made the excellent point in another thread and I wanted to continue discussion here to see if this is a viable way to get a robust LOO estimate (or whether this model is simply a poor candidate for LOO):

Pareto-K is almost guaranteed to be bad if you just add up the contributions to target from this parametrization. In this case, you don’t compute the log likelihood of the observed values given the linear predictors. You actually compute the log likelihood given the linear predictors AND the nuisance parameters. Since the observed values have huge influence on the associated nuisance parameters, loo correctly treats them as having large influence on the model and thus having high k-hat.

Martin goes on to suggest:

It could also make some kind of weird sense to compute the multivariate normal log likelihood of the nuisance parameters given the linear predictors and feed that to loo, but I can’t think completely clearly if that would correspond to a meaningful quantity or not.

I can very much see the logic behind this, but I don’t understand either LOO or the MVProbit implementation sufficiently to determine whether such a version of LOO would make sense (maybe @avehtari could comment?).

Thanks in advance!

Here is @bgoodri’s code (taken from https://github.com/stan-dev/example-models/blob/master/misc/multivariate-probit/probit-multi-good.stan):

data {
  int<lower=1> K;
  int<lower=1> D;
  int<lower=0> N;
  array[N, D] int<lower=0, upper=1> y;
  array[N] vector[K] x;
}
parameters {
  matrix[D, K] beta;
  cholesky_factor_corr[D] L_Omega;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
}
model {
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(beta) ~ normal(0, 5);
  // implicit: u is iid standard uniform a priori
  {
    // likelihood
    for (n in 1 : N) {
      vector[D] mu;
      vector[D] z;
      real prev;
      mu = beta * x[n];
      prev = 0;
      for (d in 1 : D) {
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real bound; // threshold at which utility = 0
        bound = Phi(-(mu[d] + prev) / L_Omega[d, d]);
        if (y[n, d] == 1) {
          real t;
          t = bound + (1 - bound) * u[n, d];
          z[d] = inv_Phi(t); // implies utility is positive
          target += log1m(bound); // Jacobian adjustment
        } else {
          real t;
          t = bound * u[n, d];
          z[d] = inv_Phi(t); // implies utility is negative
          target += log(bound); // Jacobian adjustment
        }
        if (d < D) {
          prev = L_Omega[d + 1, 1 : d] * head(z, d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}
generated quantities {
  corr_matrix[D] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
}

avehtari · February 16, 2022, 5:38pm

Yep, @martinmodrak’s point is accurate. You can see the same behavior when implementig overdispersed Poisson with n nuisance parameters Roaches cross-validation demo

I recommend cross-validating in the old fashioned way, that is, do K-fold-CV by re-fitting the model K times. There is vignette that might help Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

martinmodrak · February 17, 2022, 12:11pm

I tried reading how that’s done in the roaches example and I admit I am at loss on how the kfold method is able to compute the likelihood for the held-out observations - at some point the nuisance parameters (i.e. the per-row random effect) for the held-out observations have to be used, but it seems quite unclear to me how that happens… Are those sampled as new levels from the hyperprior? Is then a single sample used to compute log-likelihood? And should this be usable generally for all types of nuisance parameters?

avehtari · February 18, 2022, 5:42pm

Yes. It’s part of rstanarm magic. rstanarm generates posterior draws for the random effect for a new group (this can be done in generated quantities by generating random draws from the population prior). rstanarm know that when it is predicting for new data with group factor that is not part of the original data, it uses the random draws for the new group. Able to use rstanarm predictions functions for new groups was there before kfold.

If by “nuisance” you mean useful group level parameters (don’t hurt their feelings by calling them nuisance), then yes.

martinmodrak · February 20, 2022, 9:03pm

Thanks for the answer! So, to check my understanding, I’ll try to frame this in more general terms:

I have a model for data y_1,...,y_k, and two sets of parameters \theta and \nu_1, ..., \nu_k (all of which are potentially vectors). The likelihood for the full model decomposes as:

p(y_1,...,y_k | \theta) = \prod_i p(y_i | \theta) \\ p(y_i | \theta) = \int p(y_i | \theta, \nu_i) p(\nu_i | \theta) \mathrm{d}\nu_i

Here, the \nu_i are the definitely non-nuisance and very respecatble parameters.

When I am trying to perform k-fold CV, than after fitting without i-th part of the data I have posterior samples \theta_{(-i),s}. I am interested in the quantity:

\log p(y_i | y_{-i}) = \log \int p(y_i | \theta_{-i}) p(y{-i} | \theta_{-i}) \mathrm{d}\theta_{-i} \approx \\ \approx \log \frac{1}{S} \sum_s p(y_i | \theta_{(-i),s})

Where y_{-i} is the data without i-th element.

If I understand Aki correctly, then I have two two options how to compute p(y_i | \theta_{(-i),s}):

Variant 1 - Explicit integration:

p(y_i | \theta_{(-i),s}) = \int p(y_i | \theta_{(-i),s}, \nu_i) p(\nu_i | \theta_{(-i),s}) \mathrm{d}\nu_i

Variant 2 - Sampling:

For each s, draw a single new sample \nu_{i,s} according to p(\nu_i | \theta_{(-i),s})
p(y_i | \theta_{(-i),s}) \approx p(y_i | \theta_{(-i),s}, \nu_{i,s})

Obviously, when feasible, Variant 1 should always be 100% fine. If I understand Aki right, then either

a) Variant 1 and Variant 2 are different estimators of the same elpd, possibly differing only in variance of the estimates or
b) (weaker variant) Variant 1 and Variant 2 will compute different values, but on average will lead to the same model comparison results.

Does that sound right? Am I missing something?

(I don’t need to see proofs or anything just trying to see if my understanding is correct)

Thanks very much!

A quick computational check supporting the answer a)

We’ll use the fact the neg. binomial can be rewritten as Poisson with gamma-distributed mean. Here is the Stan model nb.stan:

data {
  int<lower=0> N;
  int y[N];
}

parameters {
  real<lower=0> mu;
  real<lower=0> phi;
}

model {
  y ~ neg_binomial_2(mu, phi);
}

generated quantities {
  vector[N] log_lik_explicit;
  vector[N] log_lik_sample;
  for(n in 1:N) {
    real gamma_var = gamma_rng(phi, phi);
    log_lik_sample[n] = poisson_lpmf(y[n] | mu * gamma_var);
    log_lik_explicit[n] = neg_binomial_2_lpmf(y[n] | mu, phi);
  }
}

Then we’ll compute 10-fold cross validation as well as directly using loo for a single fit:

library(rstan)
library(loo)
mod_nb <- stan_model("nb.stan")

N <- 50
y <- rnbinom(N, mu = 5, size = 1)

fold <- kfold_split_random(K = 10, N = N)

log_pd_kfold_explicit <- matrix(nrow = 4000, ncol = N)
log_pd_kfold_sample <- matrix(nrow = 4000, ncol = N)

seed <- 1565233
for(k in 1:10){
  data_train <- list(y = y[fold != k],
                     N = sum(fold != k)
  )
  data_test <- list(y = y[fold == k], N = sum(fold == k))
  fit <- sampling(mod_nb, data = data_train, seed = seed, refresh = 0)
  gen_test <- gqs(mod_nb, draws = as.matrix(fit), data= data_test)
  log_pd_kfold_explicit[, fold == k] <- extract_log_lik(gen_test,parameter_name = "log_lik_explicit")
  log_pd_kfold_sample[, fold == k] <- extract_log_lik(gen_test,parameter_name = "log_lik_sample")
}

(elpd_kfold_explicit <- elpd(log_pd_kfold_explicit))
# Computed from 4000 by 50 log-likelihood matrix using the generic elpd function
#
#      Estimate   SE
# elpd   -124.1  7.6
# ic      248.1 15.2

(elpd_kfold_sample <- elpd(log_pd_kfold_sample))
# Computed from 4000 by 50 log-likelihood matrix using the generic elpd function
#
#      Estimate   SE
# elpd   -124.5  7.8
# ic      248.9 15.6


# Comparison shows almost no difference
(lc <- loo_compare(elpd_kfold_explicit, elpd_kfold_sample))
#        elpd_diff se_diff
# model1  0.0       0.0   
# model2 -0.4       0.3  

# Directly using loo
fit_all <- sampling(mod_nb, data  = list(y = y, N = N))
# No problems with explicit form
loo_explicit <- loo(fit_all, pars = "log_lik_explicit")
# All k-hats are crazy high when using samples
loo_sample <- loo(fit_all, pars = "log_lik_sample")

avehtari · February 21, 2022, 5:03pm

Yes, see, e.g.

Aki Vehtari, Tommi Mononen, Ville Tolvanen, Tuomas Sivula and Ole Winther (2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. Journal of Machine Learning Research , 17(103):1−38.

Define fine?

Yes, when you do leave-one-out CV.

It’s good to make difference to two variants discussed in

Merkel, Furr, and Rabe-Hesketh (2019). Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods. Psychometrika 84:802-829.

where the other variant corresponds to leave-one-group-out which is different if each latent parameter has more than one observation.

Topic		Replies	Views
Calculating LOO-CV for a multinormal regression model Modeling loo	32	2280	April 7, 2020
How to calculate log_lik in generated quantities of a multivariate regression model CmdStan cmdstan , loo	15	3756	March 9, 2022
LOO Model Comparison Alternative Modeling rstan , techniques , loo , cmdstanr	3	94	March 27, 2025
Computation of WAIC and LOO for structured data Modeling loo	14	1593	September 9, 2019
Alternative to LOO for simulation studies Modeling loo , posterior-predictive , model-comparison	14	1175	November 17, 2020

LOO for Multivariate Probit

Related topics