Calculating LOO-CV for a multinormal regression model

Hi LOO experts,

I’m trying to use the loo package for comparing between STAN models in which I use a multinormal regression (see one of the models attachedmnorm_noPE.stan (4.1 KB) ). These are hierarchical models, consisting of 50 subjects and 40 trials per subject. I currently followed the instructions for computing the pointwise log-likelihood, by implementing the multi_normal_cholesky_lpdf function in the generated quantities block.

At the moment loo seems to produce an unreliable\too optimistic estimation, as I have too many pareto-k values that cross the 0.7 threshold (this is not a huge amount though, around 6%-5% in each of the three models I checked; see attached PDFLOO-CV_mnorm_res.pdf (262.3 KB) ). I was wondering what would be the best way to proceed from here, since I encountered various options:

  1. This post - https://mc-stan.org/loo/articles/loo2-non-factorizable.html#approximate-loo-cv-using-integrated-importance-sampling - suggests a different approximation for implementing loo on multivariate normal models. Should I use this? It seems it requires defining a function within my STAN code, so are there any existing examples for how to do it in a similar model?
  2. It seems that another option would be to run K-fold cross-validation, as suggested in the loo paper and in several posts. Is this suitable for my model, given the relatively small sample? and what should be a good size of the partition in my case?
  3. I was also wondering whether the loo should be performed at the subject level rather than at the trial level (this is what I currently tried); anyway, not really sure how to do it.

Looking forward to your advice,

Best regards,

Ofir Shany

It depends if you are interested checking whether the phenomenon you are studying generalizes to new subjects or to new trials for the old subjects. You can do it by summing log_lik values for each subject (each subject has several trials and thus several log_lik values, and you just some these to get as many new sum_log_lik values as you have subjects) and then you can use loo function as usual.

Depends on how fast each MCMC sampling is. If it takes just couple seconds, you can use kfold with n folds. Or do the leave-one-subject-out (there are helper functions to make fold indices)

You could also ry new momenta matching loo available in github version of loo package.

But before any of these, I recommend to tell more about your model. p_loo is very high compared to the number of observations, and you have very flexible model. Maybe you are having such “random” effects that could be integrated out for easier inference?

Hi,
Thanks for your prompt response.
I’ve ran the loo at the participant level as you suggested - I think it looks better, but still not great (see pic.-

- is this a reasonable result?
Regarding your final comment - I was wondering what you meant by “a very flexible model” and “random effects that could be integrated out for easier inference”?
A few words about the model - it assesses the relation between a feedback participants got on performing a public speech (yJud in the model) and their own rating of their performance after they got the feedback (yPost-updated rating of previous speech, yPreB-rating of upcoming second speech they are about to perform, yRec-memory of judges’ feedback). The model also includes ratings collected before the feedback (yPreA & yPreFb - ratings provided before and after the speech was delivered, res[respectively). Rating were provided on a 0-10 scale. I know that ordinal models will suit this data better - but we tried those and it takes a tremendous amount of time.

Thanks again,

Ofir

Much better and with mm-loo it would be a breeze to get rid of the rest of high k’s. Or you could do leave-one-subject-out with kfold-approach.

Show me the model code and I can be more specific.

Sure, here is the code for one of the models -
mnorm_noPE.stan (4.1 KB)

Thanks again.

Hi,
At the meantime I’m trying to get the mm_loo running.
I found two different codes on github, so i’m a bit confused:

  1. The first one is “loo_moment_matching” which I found here - https://github.com/stan-dev/loo/tree/master/R. I assumed the sufficient input for this would be my stanFit object and the loo, but is asked for additional input such as “post_draws”.
  2. The second code is the “loo.r” code in here - https://github.com/topipa/rstan/tree/develop/rstan/rstan/R. Is this supposed to replace the original loo?

So, what would be the right way to install and run the mm_loo? and are there any vignettes?

Thanks and best regards,

Ofir

Those uniform priors are bad and probably causing problems. Having upper constraint for variance type parameters can be bad if the upper constraint is far from the where the most of the posterior mass is.

Nsubj is 50? And Ntotal is 2000? How large is k? Where are the predictors? I’m trying to count the number of parameters as in your first loo results p_loo was larger than 1000, which is a lot if Ntotal is 2000.

Was the model exactly same for the participant (subj?) level loo result but with summed log_liks?

Install from github using (this instruction can also be found in README.md which is visible when you go to https://github.com/stan-dev/loo)

devtools::install_github("stan-dev/loo")

See the vignette https://github.com/stan-dev/loo/blob/master/vignettes/loo2-moment-matching.Rmd

mm-loo update will eventually get to CRAN and vignette to mc-stan.org/loo, but we are happy if someone tests it before that.

Hi,
• I’ve installed the mm-loo and ran the vignette successfully. However, when I tried it on my data, I get this message at some point:
“Moment matching observation 4
Moment matching observation 10
Error in validate_ll(log_ratios) : NAs not allowed in input.”
It seems like it’s related to the relative effective size computation. The “quad” parameters in my model have NaN values in the diagonal – could this be the source of the error?
• Regarding the computation of log-likelihood at the subject level: There was an error in my code for extracting this. After correcting the error, this approach actually doesn’t yield the better results I uploaded previously, but rather very high/bad k-pareto values for all observations (all observations>2). Is there any other computation I can try to do on the trial-level log-lik to assess this? I was further confused about this subject-level log-lik since you wrote: “Was the model exactly same for the participant (subj?) level loo result but with summed log_liks?”, as I didn’t run a separate model for the participant level.
• Regarding the model specifics: Nsubj is 50, Ntotal is 2000 and k=6. I’m not sure about how to count the parameters in the context of p_loo. In the parameters block I have about 20-30 parameters. However, in the model we estimate a covariance matrix of 6*6 for each participant, of which we take only the lower diagonal. This alone results in 750 parameters, and if you add the 6 “tau” parameters per subject then it’s over 1000 parameters. This is the case for model 2, which is the example model I sent you and also the most simple model; in model 1 and 3, 3 and 2 covariance matrices are produced for each subject respectively based on a conditioning, so this doubles or triples the 750 parameters accordingly (and more hyperpriors are obviously added to these model as well – I’ve attached model 1 as well, which is the most complex model mnorm_withPE.stan (5.5 KB) ). Is the high p_loo reasonable given this number of parameters?
• To sum, the main questions are 1) how to switch subject-level loo comparison, 2) whether to proceed with mm_loo or rather to try k-fold, 3) possibly alter the model.

Many thanks again for your great help and advice,

Ofir

relative effect size computation check only log_lik, does it have NaN’s?

I did wonder those good results. Leaving out more data is likely to lead to bigger computational problems.

Asses what? Did you mean to write subject level?

I asked because the results looked to good to be true if you were using the same model.

Just count in the context of your model. How many parameters you are defining in parameters block? I couldn’t count it because I didn’t know what is k.

If you estimate them, then you have parameters defining those covariance matrices. How many parameters do you use to define one covariance matrix? Do you really have a separate covariance matrix for each participant? If you have, do you have a hierarchical prior?

Then your model is very flexible and it’s likely that PSIS fails. High p_loo is reasonable but PSIS fails because the model is so flexible. This is the second case in the list for " Interpreting p_loo when Pareto k is large" in LOO Glossary https://mc-stan.org/loo/reference/loo-glossary.html.

I recommend to use K-fold-CV which would be useful also if you do leave-one-subject-out cross-validation.

With the current models try k-fold. For altering the model, think if you could have just one covariance for all subjects or hierarchical prior for covariances.

Hello again,

I have managed to run the mm_loo properly, but as you suspected it didn’t improve the situation much. I had NaNs because of a couple of divergent transitions which probably coincided with the high k-pareto, so I ran the model again and without transitions the mm_loo worked fine – so just wanted to mention this.

Now I wish to run k-fold as you suggested and I‘m searching for some vignettes or examples for how to implement it in rstan (rather than rstanarm). There is the example in your loo paper, but it’s a bit brief so I couldn’t get a clear picture of the whole pipeline, and I was mostly unsure about how I can devise something that will suit my hierarchical model which contains subject-level parameters (I want to keep these covariance matrices at the subject-level, as I’m mostly interested in individual differences regarding their parameters). Suppose that a “_t” and “_h” suffixes denote the training holdout data, as in the example in the paper; is it correct to produce the log-lik as follows (this is a simplification of the model I previously sent):


data{

int <lower=1> Ntotal_t ; // number of trials - training

int <lower=1> s_t[Ntotal_t] ; // index of subject at a given question – training

int <lower=1> Nsubj_t ; // number of participants – training

int y1_t[Ntotal_t] ; // predictor 1 – training

int y2_t[Ntotal_t] ; // predictor 2 – training

int y3_t[Ntotal_t] ; // predictor 2 – training

--- same for holding data ----

}

...

generated quantities {

vector[Ntotal_t] log_lik_t;

vector[Ntotal_h] log_lik_h;

for (n in 1:Ntotal_t) {

log_lik_t[n] = multi_normal_cholesky_lpdf([y1_t[n], y2_t [n], y3_t [n]] | [mu1[s_t[n]], mu2[s_t [n]], mu3[s_t [n]], quad [s_t [n],,]);

}

for (n in 1:Ntotal_h) {

log_lik_h[n] = multi_normal_cholesky_lpdf([y1_h[n], y2_h [n], y3_h [n]] | [mu1[s_h[n]], mu2[s_h [n]], mu3[s_h [n]], quad [s_h [n],,]);

}

I also found this example - https://datascienceplus.com/k-fold-cross-validation-in-stan/

which seems to offer a somewhat different approach, but I saw you had some reservations about it here – Compare models with k-fold cv. I was wondering if some parts of it are valid, in particular the manner in which the log_lik is produced for the different folds.

Is there any existing vignette on this issue, covering the whole pipeline from fold creation, running the models n_fold times and up to model comparison?

In addition, regarding the number of folds - I have 50 subjects but my model runs for a pretty long time (~2.5-3 hr). What would you recommend as a reasonable amount of folds?

Many thanks again,

Ofir

There was error, but you found my explanation how to fix it.

I’m not aware anything better than that data science article. K-fold-CV is often model specific so it’s not easy to make very generic example. I’m happy to help if you want to share your case as a case example.

I would probably test first with 10 folds and then you would get the results in one day.

I suspect there is still a problem with the model as given the amount of data this is quite long time and observing longer than expected running time of often indicates model problems. I did list several problems you could fix, but I didn’t notice if you’ve tried fixing those.

Hi,
Following your recommendations I changed the uniform priors that were placed on variance parameters to cauchy. However, this did not change sampling duration or the results. I still have uniform priors on the group’s Mu’s though, if you think it’s worth changing as well. You also suggested hierarchical priors for the subject-level covariance matrices - these are included in the model, and I added comments where they appear. As I said in a previous post, I’m mostly interested in the subject-level parameters of the covariance matrix, so I don’t want to switch to a one covariance matrix for all subjects (the purpose of the model is to test the influence of external feedback on self-ratings of a speech performance); but perhaps this is why it takes it a long time (the duration I reported was for a sampling of 4 chains with 10,000 iterations each)? I’m attaching the most simple model I have below, and afterwards my understanding of how to implement the k-fold based on https://datascienceplus.com/k-fold-cross-validation-in-stan/ :

data{
int <lower=1> Nsubj ; // number of participants=50
int <lower=1> Ntotal ; // number of questions=40 per subj*50 subj=2000
int <lower=1> nYlevels ; // number of responses on a rating scale (0-10 scale, so 11 responses)
int <lower=1> s[Ntotal] ; // index of subject at a given question
int <lower=1> k ; // num. of predictors=6(all y's)
int  yPreA[Ntotal] ; // rating #1-before 1st speech 
int  yPreFb[Ntotal] ; // rating #2-after 1st speech
int  yJud[Ntotal] ;  // judges' feedback on 1st speech
int  yPost[Ntotal] ; // rating #3-after feedback
int  yPreB[Ntotal] ;  // rating #4-upcmoing 2nd speech
int  yRec[Ntotal] ;  // memory of feedback
}

parameters {
real <lower=0> mu_tau  ;
real <lower=0> sigma_tau ;
real <lower=0> omega_corr_pos ;
vector<lower=0>[k] tau[Nsubj] ; // for subj. cov. mtx.
cholesky_factor_corr[k] Omega_pos[Nsubj] ; // for subj. cov. mtx.

vector[Nsubj] muPreASub ; // mu for yPreA
real <lower=1,upper=nYlevels> muPreAGroup ;
real <lower=0> sigma2PreAGroup ;
vector[Nsubj] muPreFbSub ;  // mu for yPreFb
real <lower=1,upper=nYlevels> muPreFbGroup ;
real <lower=0> sigma2PreFbGroup ;
vector[Nsubj] muJudPosSub ;  // mu for yJud
real <lower=1,upper=nYlevels> muJudPosGroup ;
real <lower=0> sigma2JudPosGroup ;
vector[Nsubj] muPostSub ; // mu for yPost
real <lower=1,upper=nYlevels> muPostGroup ;
real <lower=0> sigma2PostGroup ;
vector[Nsubj] muPreBSub ; // mu for yPreB
real <lower=1,upper=nYlevels> muPreBGroup ;
real <lower=0> sigma2PreBGroup ;
vector[Nsubj] muJudRecSub ; // mu for yRec
real <lower=1,upper=nYlevels> muJudRecGroup ;
real <lower=0>  sigma2JudRecGroup ;
}

transformed parameters{
matrix[k, k] quad_pos[Nsubj];
for (i in 1:Nsubj) {
quad_pos[i,,] = diag_pre_multiply(tau[i,], Omega_pos[i,,]);
}
}

model {

// hierarchical  priors for subj. covariance matrix:
mu_tau ~ cauchy(0, 2.5);
sigma_tau ~ cauchy(0, 5);
omega_corr_pos ~ cauchy(0, 5);

// subj. covariance matrix params:
for (i in 1:Nsubj) {
tau[i,] ~ gamma((mu_tau/sigma_tau)^2, mu_tau/(sigma_tau^2));
Omega_pos[i,,] ~ lkj_corr_cholesky(omega_corr_pos);
}

// all y's params:
sigma2PreAGroup ~ cauchy(0, 5); 
muPreAGroup ~ uniform(1 , nYlevels);
muPreASub ~ normal(muPreAGroup, sqrt(sigma2PreAGroup)) ;
sigma2PreFbGroup ~ cauchy(0, 5); 
muPreFbGroup ~ uniform(1 , nYlevels);
muPreFbSub ~ normal(muPreFbGroup, sqrt(sigma2PreFbGroup)) ;
sigma2JudPosGroup ~ cauchy(0, 5); 
muJudPosGroup ~ uniform(1 , nYlevels);
muJudPosSub ~ normal(muJudPosGroup, sqrt(sigma2JudPosGroup)) ;
sigma2PostGroup ~ cauchy(0, 5); 
muPostGroup ~ uniform(1 , nYlevels);
muPostSub ~ normal(muPostGroup, sqrt(sigma2PostGroup)) ;
sigma2PreBGroup ~ cauchy(0, 5); 
muPreBGroup ~ uniform(1 , nYlevels); 
muPreBSub ~ normal(muPreBGroup, sqrt(sigma2PreBGroup)) ;
sigma2JudRecGroup ~ cauchy(0, 5); 
muJudRecGroup ~ uniform(1 , nYlevels); 
muJudRecSub ~ normal(muJudRecGroup, sqrt(sigma2JudRecGroup)) ;
for (i in 1:Ntotal) {
[yPreA[i], yPreFb[i], yJud[i], yPost[i], yPreB[i], yRec[i]] ~ multi_normal_cholesky([muPreASub[s[i]], 
muPreFbSub[s[i]], muJudPosSub[s[i]], muPostSub[s[i]], muPreBSub[s[i]], muJudRecSub[s[i]]], quad_pos[s[i],,]);
}
}

generated quantities {
vector[Ntotal] log_lik;
for (n in 1:Ntotal) {
log_lik[n] = multi_normal_cholesky_lpdf([yPreA[n], yPreFb[n], yJud[n], yPost[n], yPreB[n], yRec[n]] | 
[muPreASub[s[n]], muPreFbSub[s[n]], muJudPosSub[s[n]], muPostSub[s[n]], muPreBSub[s[n]], muJudRecSub[s[n]]],
quad_pos[s[n],,]);
}
}

And possible implementation of k-fold:

data{
....
int<lower=0,upper=1> holdout[Ntotal]; 
  //index whether the observation should be held out (1) or used (0) 
}

....

model {
// params....
// we leave this:
for (i in 1:Ntotal) {
[yPreA[i], yPreFb[i], yJud[i], yPost[i], yPreB[i], yRec[i]] ~ multi_normal_cholesky([muPreASub[s[i]], 
muPreFbSub[s[i]], muJudPosSub[s[i]], muPostSub[s[i]], muPreBSub[s[i]], muJudRecSub[s[i]]], quad_pos[s[i],,]);
}
  //likelihood holding out some data
    for(n in 1:Ntotal){
        if(holdout[n] == 0){
            target += multi_normal_cholesky_lpdf([yPreA[n], yPreFb[n], yJud[n], yPost[n], yPreB[n], yRec[n]] | 
[muPreASub[s[n]], muPreFbSub[s[n]], muJudPosSub[s[n]], muPostSub[s[n]], muPreBSub[s[n]], muJudRecSub[s[n]]],
quad_pos[s[n],,]);
        }
    }
}

generated quantities {
vector[Ntotal] log_lik;
for (n in 1:Ntotal) {
log_lik[n] = multi_normal_cholesky_lpdf([yPreA[n], yPreFb[n], yJud[n], yPost[n], yPreB[n], yRec[n]] | 
[muPreASub[s[n]], muPreFbSub[s[n]], muJudPosSub[s[n]], muPostSub[s[n]], muPreBSub[s[n]], muJudRecSub[s[n]]],
quad_pos[s[n],,]);
}
}

Does the k-fold seem right? and any more thoughts on the model?
Thanks for you advice,

Ofir

EDIT: escaped code and added syntax highlighting.

Cauchy can still be too weak, but at least the base geometry should be better as all cauchy’s seem to be for lower constrained parameters (see https://avehtari.github.io/rhat_ess/rhat_ess.html for illustration why lower constraint saves the day).

I would at least try or investigate the posterior for those parameters whether it looks bad. You can also check Bulk-ESS and Tail-ESS (or n_eff in older interfaces) for all parameters and check if there are some which have much lower value than others.

Sorry I missed that.

That makes sense.

To limit the possible causes for a long computation time, can you check the number of leapfrog steps n_leapfrog__? If there are large number of leapfrog steps per iteration then there can still be a problem with the posterior and better priors could help. If the number of leapfrog steps per iteration is not large, then it’s likely that multi_normal_cholesky and for loopin that is the most time consuming part and speeding that up would need GPUs or map_rect.

I don’t see why “we leave this” should be left?

I would at least try or investigate the posterior for those parameters whether it looks bad. You can also check Bulk-ESS and Tail-ESS (or n_eff in older interfaces) for all parameters and check if there are some which have much lower value than others.

I checked minimal Bulk- and Tail-ESS and they seem fine (4862 & 6768, respectively). However, I do use thin=4, and I saw that you wrote in the link you sent that thinning may “throw away useful information also for convergence diagnostic”. By the way, I was wondering if thinning could also interfere with loo performance?

can you check the number of leapfrog steps n_leapfrog__ ?

n_leapfrog is 63

I don’t see why “we leave this” should be left?

Right, that was a mistake. However, in the link I sent, why isn’t there a conditioning based on the holdout indices in the generated quantities as well? Also, should the distribution of the model parameters be written in a target+= format as well? does it matter?

Thanks,

Ofir

They are good.

Yes, compared to not thinning, but not probably an issue here. These ESS values are that big, that you probably could run shorter chains and drop thinning to get faster runs.

No problem there

Not bad for that many parameters. So it seems that it’s really the multi_normal which is taking lot of time and then I can’t help more. I’m pinging @bbbales2 just because this is related to other models we’ve been speeding up.

All values are stored and using the correct values is done outside the Stan code.

Doesn’t matter. += always computes the normalizing constant and ~ doesn’t, which may affect the speed (and Bayes factors), but not posterior sampling.

Hi,
I ran k-fold with 10 folds as you suggested. I first created folds with the kfold_split_random function (k=10, n=50 subjects). Then I repeated each index 40 times (this is the number of trials), so the fold indices will match the number of trials (2000).
I then ran the 10 fits, this time with 4 chains, 5000 iterations+1000 warmup and no thin. Next, I extracted the log-likelihood by using “loglik_model1 <- extract_log_lik_K([list of 10 fits here],[list of 10 holdout indexing here])” , and ran “kfold(loglik_model1)”. Currently I did this only for one model. I don’t know what would be considered as reasonable values here, but it seems the values I got are high (at least relative to those I got with the loo approximation):

elpd_kfold: -25844.18
se_elpd_kfold: 137.5749

With the loo approximation on the same model (but with 10,000 iterations) elpd_loo estimate was -20275 and its SE was 99.8.
So before I proceed to the additional models, are these values anywhere near ok? what may be considered as a good estimate and SE given my number of parameters? and should I try more folds (maybe run only 2 chains to save some time)?

Thanks again,

Ofir

Not higher but lower. Is this the same model 1 as in pdf in your first post?
The difference is bigger than I would have expected, but this might be due to multinormal itself having several parameters. Can you compute also pointwise lpd? Then it would be possible to look at pointwise differences lpd and elpd to check whether this due to usual multinormal behavior or bad model misspecification.

Is this the same model 1 as in pdf in your first post?

Yes, although in the PDF in that post it was before I switched the uniform priors with cauchy. For the same model with cauchy (again, with 10,000 iter rather than 5000 as I ran for the k-fold - maybe this matters) elpd-loo is -20277.28 and SE_elpd_loo is 100.149. However, I should note that my conclusion that switching to cauchy yielded similar results was premature; I only tested it on the most complex model (model #1, attached as a stan file in earlier posts), assuming that it will probably work for the less complex ones, but for some reason I get a lot of divergent transitions for the most simple model (the model I wrote here a few posts above). So i’m still checking my code, but maybe i’ll eventually return to the uniform priors.

Can you compute also pointwise lpd?

How do I do this?

Thanks again,
Ofir

Hi,
Regarding this -

Can you compute also pointwise lpd?

I guess this means to run loo on the log_lik matrix I extracted from the k-fold? I tried this, and I get highly extreme values:
elpd_loo = -140341219
se_elpd_loo = 5487112

I suspect that these values are driven by “oulier” log_lik values. I think the quantiles of the log_lik matrix, and a plot of its values show this:Rplot_logLik

       0%           25%           50%           75%          100% 

-1.706794e+06 -4.288897e+01 -2.170324e+01 -1.350202e+01 -3.280073e+00

Is this indeed the case? And how can it be handled?

Thanks’

Ofir