Question on model stacking -- averaging posteriors of parameters, not predictions

The situation is as follows:

  • We are running a large model that takes quite a long time to run. It takes too long to realistically use the the approach in the “leave-future-out” vignette
  • It includes a GP whose kernel has two parameters. We supply them as data. One of them has a fairly substantial impact on the predictions and inferences. (Unfortunately this kernel is not one of those that can take advantage of the basis function approach)

Ideally we would model the kernel parameters directly. Unfortunately this is not realistic for the number of timesteps we have (even with the new ability to move the cholesky decomposition to the GPU)

What are thinking of doing is the following:

  • Run several models in parallel with different kernel parameters
  • For each model, run a second version with a 90 day holdout period
  • Once the models are done fitting, take the elpd of the 90 day holdout period (by averaging the log density across draws for each timestep)
  • Use loo::stacking_weights to obtain model averaging weights
  • Weight the draws of the non-heldout models by these weights
  • Resample to get a weighted set of draws

is this… okay? Any watchouts with this approach? Any recommendations?


Tagging @avehtari to make sure he doesn’t miss this.

1 Like

In general it’s ok, and e.g. Adrian Raftery did this kind of holdout based model averaging for weather models.

Just note that with hold out elpd, you don’t need pass “list of pointwise log-likelihood matrices” to stacking_weights, but need to pass an object that looks like a loo object. There’s an example in the end of the help text, see the part after the comment “# calling stacking_weights or pseudobma_weights directly”

See also relevant discussion in Stacking for non-mixing Bayesian computations: The curse and blessing of multimodal posteriors.


Thank you @avehtari and @maxbiostat

Just note that with hold out elpd, you don’t need pass “list of pointwise log-likelihood matrices” to stacking_weights , but need to pass an object that looks like a loo object. There’s an example in the end of the help text, see the part after the comment “# calling stacking_weights or pseudobma_weights directly”

The documentation was clear enough so that figuring this out was no problem. The only thing that I didn’t go into detail about, and that I think is correct, is that to calculate the held out elpd, I take the original [n_draws, n_timesteps] log_lik matrix, and then take the mean across draws, so that I end up with a vector that is n_timesteps long (for each model). The rationale is that since these log likelihoods are on actually-held-out data, I don’t need to do importance weighting to get the expectation. I believe that’s right, but figured I’d ask.


You need to take the mean in linear space, e.g., compute log(colMeans(exp(log_lik))). See a vignette about helper functions for holdout case Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo


Well I’m super glad I asked! I was getting very sparse weights when there didn’t seem to be huge differences between models; this may be why.

Maybe I’m dense, but wouldn’t this be “log expected predictive density”, versus “expected log predictive density”? E.g. we’re taking the log of the expectation of the linear density, not the expectation of the log of the density.


That be better done using log_sum_exp, right? If one can’t do this inside a Stan program, the R package matrixStats has a logSumExp function that’s handy. I’m sure numpy has a stable implementation as well.

1 Like

Predictive density implies expectation over the posterior, and the expected in elpd refers to the expectation over the unknown future data distribution. See Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC and A survey of Bayesian predictive methods for model assessment, selection and comparison for more details.

I show that code to show the order of the computation not as the recommended way to do it. I referred to the loo vignette for the actual computation as it has something easier and specifically designed for this case (which internally does the log-sum-exp-thing).

1 Like

Stacking tends to give sparse weights if the models are very similar. There is no need to average over the models if their predictive distributions are similar. @yuling demonstrates this in the appendix of Bayesian hierarchical stacking.

1 Like

Got it, this makes sense!

Interesting. So, our goal is not to make predictions about the future, but to obtain accurate inferences about unobserved parameters. The idea is to use holdout accuracy to obtain weights with which to stack the posteriors of these parameters from different models. It is often the case that multiple models will give similar predictions, but will have very different estimates of the parameters. Is using the stacking weights still the right approach here?

1 Like

None of the common model averaging approaches consider how different the posteriors are. Stacking optimizes the weights to get best predictions, while BMA with model probabilities or LOO-weights (in stacking paper we called this Pseudo-BMA(+)) based on prior predictive or LOO-predictive performance, respectively, so that models that have similar performance get similar weights. Since you are doing the model averaging specifically as approximated integration over some parameters, the closest approximation would be to use BMA with model probabilities, but that is challenging to compute. The next closest would be LOO-weights. And stacking might give you better predictions. You could test with a smaller model and data, so that you could run SBC, how much difference there is in inferring the desired posterior with different approaches (sample everything, BMA with LOO-weights, and stacking).


@avehtari this is all very clear, thank you for the detail and responsiveness!!

i think for now we will go with Pseudo-BMA(+), since it’s very similar to what we were already planning, and it’ll give similar weights to models with similar performance (which is what we want).

the idea of making the stacked model more “robust” (in a sense) by giving higher weight to models that have different posteriors, even if they get to similar predictions, is interesting – at least to me. intuitively, it’s like saying “these models all got to the same place, but in very different ways. based on the predictive accuracy alone we have no way of knowing which ‘way’ is correct, so we should average across them”. but i don’t really have enough to sink my teeth into in terms of how to do that, so will stick with something approximate for now.


Actually, I had another idea, after reading Bayesian Hierarchical Stacking.

The size of our input data relative to the number of holdout observations is probably too large to make the weights w depend on the inputs x, but we can use the idea of using a bayesian model to estimate the stacking weights, versus purely optimizing. I wrote a simple stan model (using the paper as a reference, but much simpler since the only parameters are the weights) to do this:

data {
  int<lower=0> K; // number of models
  int<lower=0> N; // number of data points
  matrix[N, K] elpd_point; // log predictive density
  real<lower=0> alpha; // prior on dirichlet distribution
transformed data {
  vector[K] alpharep = rep_vector(alpha, K);
  matrix[N, K] exp_lpd_point = exp(elpd_point);
parameters {
  simplex[K] w;

model {
  w ~ dirichlet(alpharep);
  for(n in 1:N) {
    target += log(exp_lpd_point[n, ] * w);

The results go from:

   stacking_weight psuedo_bma_weight
m1      0.00000003           0.00008
m2      0.00000216           0.03071
m3      0.99999771           0.96889
m4      0.00000010           0.00032


w[1] w[2] w[3] w[4] 
0.09 0.18 0.61 0.11 

and for reference when using loo_compare, I get:

   elpd_diff se_diff
m3   0.0       0.0  
m2  -5.7       2.6  
m4  -9.8       2.3  
m1 -13.4       3.3  

The “bayesian model stacking approach” gives results that make much more intuitive sense to me. This seems to me to be equivalent to the model in the paper, except for the omission of the dependence on x. The question I have is basically, can you do this?


Yes, using posterior mean of w rather than the MAP makes sense. We refer to it as the Bayesian solution of stacking. Ideally you may test different weighting methods on another independent hold out data.