LOO for Multivariate Probit

Continuing the discussion from Feedback request: Multivariate Probit Regression with GP:

Brief problem summary: @bgoodri has developed this really handy implementation of the multivariate probit (code below). Unfortunately, it universally gets terrible k-hats in LOO. @martinmodrak made the excellent point in another thread and I wanted to continue discussion here to see if this is a viable way to get a robust LOO estimate (or whether this model is simply a poor candidate for LOO):

Pareto-K is almost guaranteed to be bad if you just add up the contributions to target from this parametrization. In this case, you don’t compute the log likelihood of the observed values given the linear predictors. You actually compute the log likelihood given the linear predictors AND the nuisance parameters. Since the observed values have huge influence on the associated nuisance parameters, loo correctly treats them as having large influence on the model and thus having high k-hat.

Martin goes on to suggest:

It could also make some kind of weird sense to compute the multivariate normal log likelihood of the nuisance parameters given the linear predictors and feed that to loo, but I can’t think completely clearly if that would correspond to a meaningful quantity or not.

I can very much see the logic behind this, but I don’t understand either LOO or the MVProbit implementation sufficiently to determine whether such a version of LOO would make sense (maybe @avehtari could comment?).

Thanks in advance!

Here is @bgoodri’s code (taken from example-models/probit-multi-good.stan at master · stan-dev/example-models · GitHub):

data {
  int<lower=1> K;
  int<lower=1> D;
  int<lower=0> N;
  array[N, D] int<lower=0, upper=1> y;
  array[N] vector[K] x;
}
parameters {
  matrix[D, K] beta;
  cholesky_factor_corr[D] L_Omega;
  array[N, D] real<lower=0, upper=1> u; // nuisance that absorbs inequality constraints
}
model {
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(beta) ~ normal(0, 5);
  // implicit: u is iid standard uniform a priori
  {
    // likelihood
    for (n in 1 : N) {
      vector[D] mu;
      vector[D] z;
      real prev;
      mu = beta * x[n];
      prev = 0;
      for (d in 1 : D) {
        // Phi and inv_Phi may overflow and / or be numerically inaccurate
        real bound; // threshold at which utility = 0
        bound = Phi(-(mu[d] + prev) / L_Omega[d, d]);
        if (y[n, d] == 1) {
          real t;
          t = bound + (1 - bound) * u[n, d];
          z[d] = inv_Phi(t); // implies utility is positive
          target += log1m(bound); // Jacobian adjustment
        } else {
          real t;
          t = bound * u[n, d];
          z[d] = inv_Phi(t); // implies utility is negative
          target += log(bound); // Jacobian adjustment
        }
        if (d < D) {
          prev = L_Omega[d + 1, 1 : d] * head(z, d);
        }
        // Jacobian adjustments imply z is truncated standard normal
        // thus utility --- mu + L_Omega * z --- is truncated multivariate normal
      }
    }
  }
}
generated quantities {
  corr_matrix[D] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);
}

Yep, @martinmodrak’s point is accurate. You can see the same behavior when implementig overdispersed Poisson with n nuisance parameters Bayesian data analysis - roaches cross-validation demo

I recommend cross-validating in the old fashioned way, that is, do K-fold-CV by re-fitting the model K times. There is vignette that might help Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

2 Likes

I tried reading how that’s done in the roaches example and I admit I am at loss on how the kfold method is able to compute the likelihood for the held-out observations - at some point the nuisance parameters (i.e. the per-row random effect) for the held-out observations have to be used, but it seems quite unclear to me how that happens… Are those sampled as new levels from the hyperprior? Is then a single sample used to compute log-likelihood? And should this be usable generally for all types of nuisance parameters?

Yes. It’s part of rstanarm magic. rstanarm generates posterior draws for the random effect for a new group (this can be done in generated quantities by generating random draws from the population prior). rstanarm know that when it is predicting for new data with group factor that is not part of the original data, it uses the random draws for the new group. Able to use rstanarm predictions functions for new groups was there before kfold.

If by “nuisance” you mean useful group level parameters (don’t hurt their feelings by calling them nuisance), then yes.

2 Likes

Thanks for the answer! So, to check my understanding, I’ll try to frame this in more general terms:

I have a model for data y_1,...,y_k, and two sets of parameters \theta and \nu_1, ..., \nu_k (all of which are potentially vectors). The likelihood for the full model decomposes as:

p(y_1,...,y_k | \theta) = \prod_i p(y_i | \theta) \\ p(y_i | \theta) = \int p(y_i | \theta, \nu_i) p(\nu_i | \theta) \mathrm{d}\nu_i

Here, the \nu_i are the definitely non-nuisance and very respecatble parameters.

When I am trying to perform k-fold CV, than after fitting without i-th part of the data I have posterior samples \theta_{(-i),s}. I am interested in the quantity:

\log p(y_i | y_{-i}) = \log \int p(y_i | \theta_{-i}) p(y{-i} | \theta_{-i}) \mathrm{d}\theta_{-i} \approx \\ \approx \log \frac{1}{S} \sum_s p(y_i | \theta_{(-i),s})

Where y_{-i} is the data without i-th element.

If I understand Aki correctly, then I have two two options how to compute p(y_i | \theta_{(-i),s}):

Variant 1 - Explicit integration:

p(y_i | \theta_{(-i),s}) = \int p(y_i | \theta_{(-i),s}, \nu_i) p(\nu_i | \theta_{(-i),s}) \mathrm{d}\nu_i

Variant 2 - Sampling:

  1. For each s, draw a single new sample \nu_{i,s} according to p(\nu_i | \theta_{(-i),s})
  2. p(y_i | \theta_{(-i),s}) \approx p(y_i | \theta_{(-i),s}, \nu_{i,s})

Obviously, when feasible, Variant 1 should always be 100% fine. If I understand Aki right, then either

a) Variant 1 and Variant 2 are different estimators of the same elpd, possibly differing only in variance of the estimates or
b) (weaker variant) Variant 1 and Variant 2 will compute different values, but on average will lead to the same model comparison results.

Does that sound right? Am I missing something?

(I don’t need to see proofs or anything just trying to see if my understanding is correct)

Thanks very much!

A quick computational check supporting the answer a)

We’ll use the fact the neg. binomial can be rewritten as Poisson with gamma-distributed mean. Here is the Stan model nb.stan:

data {
  int<lower=0> N;
  int y[N];
}

parameters {
  real<lower=0> mu;
  real<lower=0> phi;
}

model {
  y ~ neg_binomial_2(mu, phi);
}

generated quantities {
  vector[N] log_lik_explicit;
  vector[N] log_lik_sample;
  for(n in 1:N) {
    real gamma_var = gamma_rng(phi, phi);
    log_lik_sample[n] = poisson_lpmf(y[n] | mu * gamma_var);
    log_lik_explicit[n] = neg_binomial_2_lpmf(y[n] | mu, phi);
  }
}

Then we’ll compute 10-fold cross validation as well as directly using loo for a single fit:

library(rstan)
library(loo)
mod_nb <- stan_model("nb.stan")

N <- 50
y <- rnbinom(N, mu = 5, size = 1)

fold <- kfold_split_random(K = 10, N = N)

log_pd_kfold_explicit <- matrix(nrow = 4000, ncol = N)
log_pd_kfold_sample <- matrix(nrow = 4000, ncol = N)

seed <- 1565233
for(k in 1:10){
  data_train <- list(y = y[fold != k],
                     N = sum(fold != k)
  )
  data_test <- list(y = y[fold == k], N = sum(fold == k))
  fit <- sampling(mod_nb, data = data_train, seed = seed, refresh = 0)
  gen_test <- gqs(mod_nb, draws = as.matrix(fit), data= data_test)
  log_pd_kfold_explicit[, fold == k] <- extract_log_lik(gen_test,parameter_name = "log_lik_explicit")
  log_pd_kfold_sample[, fold == k] <- extract_log_lik(gen_test,parameter_name = "log_lik_sample")
}

(elpd_kfold_explicit <- elpd(log_pd_kfold_explicit))
# Computed from 4000 by 50 log-likelihood matrix using the generic elpd function
#
#      Estimate   SE
# elpd   -124.1  7.6
# ic      248.1 15.2

(elpd_kfold_sample <- elpd(log_pd_kfold_sample))
# Computed from 4000 by 50 log-likelihood matrix using the generic elpd function
#
#      Estimate   SE
# elpd   -124.5  7.8
# ic      248.9 15.6


# Comparison shows almost no difference
(lc <- loo_compare(elpd_kfold_explicit, elpd_kfold_sample))
#        elpd_diff se_diff
# model1  0.0       0.0   
# model2 -0.4       0.3  

# Directly using loo
fit_all <- sampling(mod_nb, data  = list(y = y, N = N))
# No problems with explicit form
loo_explicit <- loo(fit_all, pars = "log_lik_explicit")
# All k-hats are crazy high when using samples
loo_sample <- loo(fit_all, pars = "log_lik_sample")

1 Like

Yes, see, e.g.

Define fine?

Yes, when you do leave-one-out CV.

It’s good to make difference to two variants discussed in

where the other variant corresponds to leave-one-group-out which is different if each latent parameter has more than one observation.

1 Like