Calculating elpd values for a stan model (k fold cross validation)

Hi,

I’m trying to use K-Fold cross-validation as defined in Aki’s et al. (2015) paper, ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, to estimate the elpd.

I have divided my dataset into 10 and updated my model to account for training and testing datasets. This gives me 10 different log_lik matrices (one for each model run). I’m confused how to combine these 10 matrices into one to calculate the elpd values using log(mean(exp(combined_matrix))) (as per the paper).

I’d appreciate any and all help since I’m a new to all this. Thanks!

Best,
Ruchi

Let’s say you have 10 observations and divide these into 5 blocks (I use small numbers for illustration)
{1,2} {3,4} {5,6} {7,8} {9,10}
First you leave out the block {1,2} and fit the model with all the other blocks, and compute test log_lik’s for the observations 1 and 2. When you repeat this for all the blocks, in the end you have computed test log_lik for each observation and you should have 5 log_lik matrices with Sx2 elements (where S is the number of draws). You then combine then these to a one matrix with Sx10 elements and then compute log(mean(exp(combined_matrix))) . If you divide your data in random permuted order such construct your combined matrix so that columns for the observations are in the original observation order.

Did this clarify the computation?

1 Like

Yes, thank you so much for this!

@avehtari
Just curious; could k-fold cross validation be done within a single model?
I.e., within one stan file, have “5 models” each with some N left out; , then in gen. quantities estimate log_lik for the respective left out N?

I have seen someone to do that, but I would not do that as then you need to modify also parameters and model block and you may run into memory problems. Running the same model K times should not take more time (the same amount of likelihood computations, but less memory) and then you need to modify only the data and generated quantities blocks.

Sorry for reviving this old thread. I’m a little confused about the way elpd is computed here. Why are we combining all the log_lik matrices into a Sx10 matrix and then running log(mean(exp(combined_matrix)))? Doesn’t this contradict equations (20) and (21) in said paper?

\widehat{elpd}_i = log \bigg( \frac{1}{S} \sum_{s=1}^S p(y_i | \theta^{k,s}) \bigg)

\widehat{elpd}_{xval} = \sum_{i=1}^n \widehat{elpd}_i

I thought that one should do this instead:
For each partition in {{1,2},{3,4},…,{9,10}}:

  1. We have the Sx2 matrix (let’s call it log_lik_mat_sub)
  2. We exponentiate each element the columnwise mean to get a 1x2 vector
  3. Take log of each element of the 1x2 vector and take the mean of this vector

We then have 5 mean values [m_1, m_2, ..., m_5] and we then take the mean of the of this vector (i.e. \widehat{elpd}_{xval} = \frac{1}{5} \sum_{i=1}^5 m_i).

Would be really helpful if someone could clarify!

With 10 observations i=1,…,10. We leave two observations out once, but in the above we are still assuming pointwise predictions which can be collected to Sx10 matrix.

I’m not able to parse this sentence.

Mean of 5 means of pairs is the same as mean of 10 values, but there is a slight difference in the variance computation (mean of square is different than square of mean), and in this case we prefer to make the computation similar to LOO.

Thanks very much for the response!

Sorry about the lack of clarity in my previous post. Realised that the second sentence wasn’t legible at all.

To be clear, I’m mainly unsure about the command “log(mean(exp(combined_matrix))”.

Formally, let X \in \mathbb{R}^{S \times 10} denote the matrix of log-likehoods (i.e. X = combined_matrix).

Is the command doing:

\textbf{Version A}

  1. For every element x_{ij} \in X, let y_{ij} = \exp(x_{ij}).
  2. Compute the mean of all y_{ij}. That is, compute z = \frac{1}{S \times 10} \sum_{i}^S\sum_{j}^{10}y_{ij}.
  3. Then let \widehat{elpd}_{xval} = \log(z)

Or is it doing:

\textbf{Version B}

  1. For every element x_{ij} \in X, let y_{ij} = \exp(x_{ij}).
  2. Compute the mean for each column. That is,\forall j \in \{1,2,...,10\}, compute z_j = \frac{1}{S} \sum_i^S y_{ij}.
  3. Then let \widehat{elpd}_{xval} = \sum_j^{10} \log(z_j)

I was under the impression that Version B is the correct one (or am i mistaken?).

Another quick question on the variance computation: if I want to compute the standard error of Q, where Q = \frac{1}{10}\widehat{elpd}_{xval}, can I do this by simply taking the standard deviation of the set \{\widehat{elpd}_i\}_i^{10} and divide by the square root of 10?

i.e. se(Q) = \frac{sd(\{\widehat{elpd}_i\}_i^{10})}{\sqrt{10}}?

Yes. Predictive densities are computed in step 2. Sum of log predictive densities is computed in step 3.

You would divide if you would compute mean of log predictive densities, but if you compute sum you would multiply bye the square root of 10 (since 10/sqrt(10)=sqrt(10)).

I have a question about averaging across observations within samples (and then ‘logsumexp’-ing) versus first ‘logsumexping’ across samples and then averaging across observations.
From the definition of expected utility (e.g. equation 2 in ‘Comparison of Bayesian predictive methods for model selection’, Piironen and Vehtari 2016), it makes sense to average across observations after calculating the expected log predictive density per observation.

Are there cases where the reverse makes sense? Also in the stan documentation (27.1:, https://mc-stan.org/docs/2_24/stan-users-guide/evaluating-posterior-predictive-section.html ) they actually first sum across observation within each MCMC sample and then advise to do calculate the expectation over samples.

If this is question is specifically about k-fold-cv grouped data and you are interested in the joint predictive distribution for a group, you would sum log-predictive densities for each group (corresponding to product of densities), and then average over posterior draws.

Did you post a wrong link? I see average only over M posterior draws.

the use case is model selection for i.i.d. observations;

in the stan program of 27.1.1 the block

generated quantities {
  real log_p = normal_lpdf(y_tilde | alpha + beta * x_tilde, sigma);
}

will generate the sum of log probs across observations of the N_tilde test data points (1 log_p value for each of the M draws).
taking log_sum_exp(log_p) - log(M) will then be what? the posterior predictive likelihood of the joint of the test y’s? That’s something else as the elpd right?

Yes, joint. It can still be called elpd. elpd can be pointwise or joint, except in leave-one-out case it can only be computed pointwise and in our loo paper we used also term expected log pointwise predictive density (elppd), but often pointwise and extra p is dropped for saving ink.

1 Like