Implementation of model stacking using exact leave-one-out cross-validation

Hi there! I would like to implement model stacking using exact leave-one-out cross-validation for a problem at hand. However, I have a pharmaceutical/medical background and I am lacking a bit of the statistical notation and terminology. So far I’ve read Vehtari et al., 2016 and Yao et al., 2018 as well as watched some of Aki Vehtaris (@avehtari) videos (Aki Vehtari - Videos) about this topic. They are really helpful, however, I would like to double-check if my proposed workflow seems to be correct. For the sake of my understanding I plan to implement it from scratch. This would be my proposed workflow, which is based on the two papers mentioned above:

We consider K linear candidate models M_k which we want to combine via bayesian model stacking for a given dataset with N observations. For the sake of simplicity we assume to only have one model parameter \theta for the slope.

  1. For each M_k we write a representation in Stan.
  2. We split our data in k-folds for cross-validation (CV). If k << N we can call it exact k-fold CV and if k = N we can call it exact leave-one-out (LOO) CV. Let’s assume we go for exact LOO CV.
  3. For each fold we have N-1 datapoints y_{-i} on which each model M_k is being trained, resulting in S posterior draws per k^{th} model (\theta_{k, s}).
  4. Additionally we have one held-back datapoint y_i per fold. For each \theta_{k,s} we can evaluate the model M_k and calculate the likelihood for the held-back y_i, leading to p(y_i|\theta_{k,s}). This likelihood would be calculated based on \theta_{k,s} and the specified error model.
  5. Per model M_k we can then calculate the overall likelihood for y_i across the S posterior draws:
p(y_i|y_{−i}, M_k) = \frac{1}{S} \sum_{s=1}^S p(y_i|\theta_{k,s})
  1. When choosing the logarithmic scoring rule, we can define the optimisation problem to find the optimal set of weights w while constraining S^K_1 = \{w \in [0, 1]^K : \sum^K_{k=1} w_k = 1\} by:
\max_{w \in S_1^K} \frac{1}{n} \sum^n_{i=1} \log \sum^K_{k=1}w_k ~p(y_i|y_{-i}, M_k)
  1. We can use these optimised w to linearily weight the number of posterior predictive draws per model M_k to get the final predictive distribution for a potential future datapoint.

My questions:

  • Is this workflow in general correct?
  • I am a bit confused that we often see in code snippets a log_lik[i] definition in the generated quantities { } block of Stan. To my understanding the generated quantities { } block is being evaluated after each draw and for the maximisation problem (step 6) we need to apply the weight to the likelihood in the normal domain before we are summing up the likelihoods of K models and only then taking the log. If I generate the log_lik[i] per draw I would need to bring it back to the normal domain, potentially with the log-sum-exp trick to avoid underflow. Also the loo package seems to take a S \times n log-likelihood matrix as an argument to do its calculations. Is this correct or do I have a mistake in my thinking? Not unlikely that I am mixing up something.
  • If we use the generated quantities { } block to get the (log-)likelihood, we get this for each draw. But in the end we are rather interested in the LOO predictive density for y_i, which would involve both the variability coming from the S draws and the error model. Can we just calculate this predictive density by taking the mean of the single likelihoods from the S draws as mentioned in step 5? I guess there I am lacking the statistical background.

Thanks a lot in advance!

See Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo for instructions on how to do K-fold-CV with Stan. Following the instructions, you get elpd_kfold object which you can give to stacking_weights() function.

1 Like

Thanks @avehtari for the quick reply! The vignette is helpful, however, I prefer to implement it one time from scratch before relying on a package just for the sake of my own understanding.

I guess my main confusion came from the use of log-likelihoods when the main equations (like the optimisation function)

\max_{w \in S_1^K} \frac{1}{n} \sum^n_{i=1} \log \sum^K_{k=1}w_k ~p(y_i|y_{-i}, M_k)

rely on plain likelihoods. I am not too familiar with potential numerical over- and underflow issues when switching between log-likelihood and likelihood. So I was not sure if I am overseeing something as you could theoretically also directly calculate the plain likelihood in the generated quantities block, right? I guess Stan is just more stable with log-likelihoods? After digging a bit into the loo R code it makes more sense now. In the elpd function the elpd gets calculated by elpd <- colLogSumExps(ll) - log(nrow(ll)) so I guess we are applying the Log-Sum-Exp trick at this point to avoid any over- or underflow. But otherwise we are essentially calculating the mean likelihood for y_i across the S posterior draws and then taking the \log to get the elpd_i. Then in the optimisation function of loo_model_weights we are applying the weights to the exponentiated elpds, which is in line with the mentioned equation from the original paper.

So I guess I can give it a try with a simply toy example to see if I can implement it from scratch. But does the overall workflow looks alright or am I missing something? Thanks a lot for your support!

To be precise p(y_i|y_{-i}, M_k) is predictive density and not likelihood. The possible confusion comes as in the PSIS-LOO computation we compute pointwise (log) likelihoods to be able to remove the corresponding contribution from the full data posterior.

It’s not just Stan, it’s anywhere where floating point presentation is used. I’ve collected some useful links to learn about floating point accuracy in my BDA course material you may find useful.

Yes.

…calculating an estimate for posterior predictive density (as an empirical mean of predictive densities conditional on posterior draws)…

Yes.

Yes, and you can then compare to the results by code I pointed to.

1 Like

Thanks a lot, your help is highly appreciated. And sorry for the terminology issues.

Perfect, I will give it a shot.

No need to be sorry. Our documentation is not perfect and confusion is likely.