Stacked elpd_loo estimation

Hello,

I’m working on a project with Bayesian stacking of stan_glm() linear models in rstanarm. To compare the performance of the stack to the individual member models (stored in fits_list), I want to calculate the elpd_loo of the stack, but I am unsure how to do so correctly. Here’s a code snippet of how I calculate the stacked predictions for the outcome “y”:

# Compute posterior predictive density 

ypredStack <- matrix(NA, nrow = n_draws, ncol = n_obs)
loglik_mat <- matrix(NA, nrow = n_draws, ncol = n_obs)

for (d in 1:n_draws) {
  k <- sample(1:length(wtsStacking, size = 1, prob = wtsStacking)
  ypredStack[d, ] <- posterior_predict(fits_list[[k]], draws = 1)
  loglik_mat[d,] <- t(as.matrix(log_lik(fits_list[[k]])[d,])) #is this correct?
}
stacked_loo <- loo(loglik_mat) 

The above code takes 1 draw from the log-likelihood matrix for the chosen model for each iteration over all draws. However, the loo results from this matrix shows the stack performs notably worse than 6/7 of the individual linear models in the ensemble - which is confusing to me.

My question is - is this method a valid way to extract a log likelihood matrix for my ensemble? If not, is there a better method that allows me to make elpd_loo (or some other metric) comparisons directly?

Thank you for any and all guidance!

With Bayesian model averaging, where each model gets a weight in a simplex, you can just weight the likelihoods to create a mixture and carry that over into the log domain to get a log likelihood. If you restrict stacking weights to a simplex, you can define a mixture in the same way.

In practice, this means for each of the data items in your data set, you need to calculate a weighted average of the likelihoods on the log scale. With N data items and K models, it’s

log_lik[n] = log(SUM_k w[k] * likelihood[n, k])
           = LOG_SUM_EXP_k log(w[k]) + log(likelihood[n, k])

where w[k] is the weight of the k-th model with w forming a simplex (non-negative values that sum to 1), and likelihood[n, k] is the likelihood of the n-th data item in the k-th model.

1 Like

@Bob_Carpenter did give the correct answer, but further complication is that the weights have been optimized to maximize elpd, and thus the resulting weighted elpd is overoptimistic. To get a fair comparison against the individual model, the stacking procedure needs to be cross-validated, too (e.g. using K-fold-CV).

Cool. Thanks for the clarification.

Thank you both for the feedback and clarifications. I’ll keep this in mind going forward!