Using Leave-one-group-out (LOGO) CV seems attractive to me - especially with a well-resolved and well represented phylogeny (I’ve had a long day (in Australia) today so I can’t explain my reasoning all that well). Anyway, I would like to compare the results I get from loo which are problematic, with high Pareto k estimates. Maybe I missed something crucial in @avehtari’s blog plot, but it is possible to implement LOGO-CV using brms or brms in conjunction with loo.

Thanks in advance for any help or advance

Please also provide the following information in addition to your question:

Beyond @avehtari’s blog plot, can anyone direct me to a citable source regarding leave-one-group out CV so I can learn more about it. Thanks in advance!

I hope you don’t mind me hijacking the thread for a question about leave-one-group-out cross-validation that’s been bugging me.

Suppose that I want to model N observations nested in J groups (for example, N responses from J participants in a psychological study). Going back to @avehtari 's blog post (https://andrewgelman.com/2018/08/03/loo-cross-validation-approaches-valid/), this example would correspond to the school rather than state case, and one would want to estimate the model’s prediction accuracy for a new participant.

Assuming that observations within each group are independent, couldn’t I sum the log-likelihoods of all points within a group, and then estimate the model’s out-of-sample predictive accuracy by applying PSIS-LOO to the J joint log-likelihoods? I have read the relevant article and posts, but have to admit that some of the details are beyond me. Is there a reason why this wouldn’t be valid, or why k-fold stratified cross-validation would be preferable?

Yes, this is correct in theory. In practice leaving out several observations may change the posterior too much for importance sampling to work. You can try it, and Pareto-k diagnostic will tell if it works. If it doesn’t work there is also option to use quadrature integration as in https://arxiv.org/abs/1802.04452

Your prediction proved entirely accurate. I derived the log likelihood of N = 7210 observations nested in J = 302 groups, and summed the log likelihoods of all observations within a group to get the joint log likelihoods. I then applied PSIS-LOO to both sets of log likelihoods to calculate the expected log predictive density using both LOO and LOGO. Comparing the results, I first found that the estimates are in a similar ballpark: elpd_loo = -3074.5 (47.3) and elpd_logo = -3150.5 (90.2). I then looked at the Pareto-k diagnostics, which were unambiguous:

I take away that I will have to consider carefully what the relevant prediction task is, use stratified k-fold cross-validation, and/or look into the paper you’ve referenced. Again, thanks for the help.

To follow up, I have settled on stratified 10-fold cross-validation. It wasn’t that difficult to rewrite the Stan code to accommodate k-fold cross-validation. If anyone is trying to do something similar, please feel free to email me for the relevant R and Stan code.

I’m curious to hear your general approach for doing leave-one-group-out k-fold CV in Stan. I’ve been trying to do the same thing, and my approach seems to be an ugly hack. I roughly follow this approach: https://datascienceplus.com/k-fold-cross-validation-in-stan/#share-wrapper, by including a holdout vector as data. I also include an index vector for the start locations (jj_start) of where the group index changes (i.e. first row for a new group) as data. I then fix all the groups level parameters of the group I’m holding out in transformed parameters to 0 (we need to do this right?), and do some ugly chopping of my data to get group summed log_lik in generated quantities for J groups (people in my case with repeated measures).

generated quantities{
vector[J] log_lik;
for (j in 1:J){
{
int start;
int end;
start = jj_start[j];
if(j<J){
end = (jj_start[j+1]-1);
}else{
end = N;
}
log_lik[j] = normal_lpdf(y[start:end] | a_person[j] + block(x,start,1,end-start+1,K) * beta, sigma);
}
}}

I’m not even sure this is right, and I know it can’t be the best way to do leave-one-group-out CV. I tried to dig through brms code to get a better idea, but haven’t figured any of that out. Hopefully you’ve found a better way!

You can check out the R loo package. It doesn’t do proper leave-one-out, but approximates it. Full leave one out is usually prohibitively expensive.

If you really want to do it, then you need a way of specifying train vs. test instance from the outside.

data {
int<lower = 0> N;
vector[N] y;
int<lower = 1, upper = N> left_out;
model {
// train on everything but held out
append_row(y[ :left_out - 1], y[left_out + 1: ]) ~ foo(...);
generated quantities {
// test on held out
real log_lik = foo(y[left_out] | ...); // held out log likelihood

Then you run from the outside collecting the log density. But you have to be careful to use log_sum_exp on the outside to get the algebra right to average the log likelihoods into the log posterior likelihood ratehr than the posterior log likelihood. But you need to work on the log scale. That is on the outside, you need to do log_sum_exp(extract(fit)$log_lik) - log(N) (this is the log of the posterior mean of the held out likelihood rather than the posteiror mean of the log of the held out likelihood).

Thanks Bob. Your approach is much better, and thanks for the reminder to compute the log of posterior mean. Two questions for anyone…
(1) To use loo approximations, I would need to get log_lik terms at the group level. Anyone know a good way to sum the log_lik terms by group without storing the full pointwise log_lik?

(2) The loo approximations may not work for my group level prediction task (i.e. small number of groups (20) with high inter-group variability), so I may still need leave-one(or many)-group-out CV. I haven’t quite got the above @Bob_Carpenter approach to work when I hold out entire groups in a multilevel model because of the way I index the group parameters (e.g. jj[i])…has anyone done something like @Bob_Carpenter is suggesting while leaving groups out? I’ll report back if I figure it out.

For now I’ve been creating separate training and test data outside of Stan and redefining the Stan model with fewer group parameters (see linear regression example below with repeated measurements by person and one person held out). Then I calculate the log_lik without the person level parameters. Is this correct leave-one-group-out CV?

data{
// training set
int<lower=1> N; // num records
int<lower=1> J; // num participants (groups)
int<lower=1> K; // num predictors
int<lower=1,upper=J> jj[N]; // participant index
real y[N]; // outcome
// testing set
int<lower=1> N_test;
real y_test[N_test];
}
parameters{
vector[J] a_person; // varying ints by person
real a;
real<lower=0> sigma_person;
real<lower=0> sigma;
}
model{
sigma ~ student_t( 3 , 0 , 2 );
sigma_person ~ student_t( 3 , 0 , 2 );
a ~ normal( 0 , 4 );
a_person ~ normal( 0 , sigma_person );
for ( i in 1:N ) {
y[i] ~ normal( a + a_person[jj[i]], sigma);
}
}
generated quantities{
// log_lik for held out person
real log_lik = normal_lpdf(y_test | a , sigma ); // no a_person term
}

You can do the calculation within Stan and just store the sums.

This wasn’t my idea! I think it’s pretty standard. Are you having problems coding it? It will be a lot of index fiddling, which is sort of the bread and butter of coding.

I haven’t had any time to try out the index fiddling in Stan, but hope to later in the month. I’m still curious if the approach I used above would do the same thing…that is (1) create test/train data subsets outside Stan, (2) define a reduced form model with number of parameters to match the number of groups in the test data (in my simple example, just one less intercept than the full model), (3) fit with the test data, and (4) get the log density for the held out data ignoring group level parameters.

That last step is the one I’m most concerned about since I’ve never done anything like it before (ignoring parameters when collecting the log density).

I don’t see why it shouldn’t work. It’s the predictive log likelihood you’re looking at here, so all the parameters aren’t involved for generating a new held out person.