Bad Pareto k diagnostic with good chain diagnostics

nerpa · April 22, 2021, 8:02pm

I have a model with 88 participants, where each has ~15,000 data points. After fitting my model in Stan and getting flawless chain diagnostics, I try to calculate LOOIC values using the loo package in R, but I get the following:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      0      0.0%  <NA>      
 (0.5, 0.7]   (ok)        0      0.0%  <NA>      
   (0.7, 1]   (bad)       0      0.0%  <NA>      
   (1, Inf)   (very bad) 88    100.0%  1

My predictive checks look great, but does it mean I cannot trust the loo stats in this case? What does it really mean?

jsocolar · April 22, 2021, 8:32pm

Perhaps this might help:

nerpa · April 22, 2021, 9:01pm

I read this, thanks! In my case, p_loo << the number of parameters p, but my PPC look great. I really have no idea how to deal with this.

jsocolar · April 22, 2021, 9:14pm

Are you certain that your PPCs are sensitive to the relevant forms of misspecification?

nerpa · April 22, 2021, 9:15pm

if this is helpful, here is the rest of the loo output:

Computed from 4000 by 88 log-likelihood matrix

          Estimate      SE
elpd_loo -249868.2 13582.2
p_loo       5197.1    69.5
looic     499736.4 27164.4
------
Monte Carlo SE of elpd_loo is NA.

I have a total of ~25k parameters.

nerpa · April 22, 2021, 9:19pm

I am not at all… Just to make sure - PPC stands for posterior predictive checks, right?

jsocolar · April 22, 2021, 9:31pm

Yup, posterior predictive check.

Just a bit of background on what I mean about sensitivity:
In general a posterior predictive check is based on some summary of the data (either graphical or a particular discrepancy measure that is a function of the dataset). This summary is computed for the real data as well as for replicate simulated datasets. But there are many different summaries that you could use (in fact they are infinite). And not all summaries will be sensitive to every form of misspecification–in fact most summaries will be sensitive to fairly restricted forms of misspecification. Here’s an example of a paper that carefully developed multiple PPCs tailored to be sensitive to violations of different specific model assumptions

nerpa · April 22, 2021, 9:36pm

Thank you.

avehtari · April 23, 2021, 7:30am

nerpa:

Computed from 4000 by 88 log-likelihood matrix

          Estimate      SE
elpd_loo -249868.2 13582.2
p_loo       5197.1    69.5
looic     499736.4 27164.4
------
Monte Carlo SE of elpd_loo is NA.

I have a total of ~25k parameters.

The loo output shows that you have 4000 by 88 log-likelihood matrix, which means you are doing leave-one-participant-out cross-validation, which then means that you are leaving out ~15,000 data points. You mention ~25k parameters, which means ~284 parameters per participant, and I would assume some of those parameters are participant specific. When you leave out all data points for one participant, the participant specific parameters get information only from the prior and thus the leave-one-participant-out posterior is very different compared to the full data posterior, and PSIS-LOO fails.

The model seems to be quite flexible and it is possible that PSIS-LOO fails just because of the high felxibility and not because of model misspecification, but on the other hand PPC will be optimistic with flexible models and can fail to diagnose misspecification.

If the leave-one-participant-out is what you want (and it can be), I recommend to run K-fold-CV instead with the same log_lik computation as now.

If you would like to do leave-one-data-point-out, you need to compute log_lik differently, but as you have that much data it would be better to do it functionally outside of Stan, see example in Using Leave-one-out cross-validation for large data • loo

nerpa · April 23, 2021, 9:10am

Thank you for the reply!

I have a longitudinal design where I test the same participants over multiple weeks with the same tasks, so the vast majority of parameters and participant- and week-specific.

What are my alternatives? I have a few competing models and I need to choose one based on some information criterion. I tried using WAIC, but I’ve got the warning 88 (100.0%) p_waic estimates greater than 0.4. We recommend trying loo instead so I moved to LOOIC.

Right now, I calculate the log_lik in the generated quantities block with the following general structure:

generated quantities {

real log_lik[N];

for (n in 1:N) { // loop over participants
real log_lik[n] = 0;
      for (w in 1:W) { // loop over weeks
            log_lik[n] += bernoulli_logit_lpmf(data1[n,w] | theta1[n,w]);
                .
                .
                .
            log_lik[n] += bernoulli_lpmf(data2[n,w] | theta2[n,w]);
       }
}

}

So I have one aggregated log_lik value per participant (N=88). Does it makes sense in my case?

Thanks for the pointer! I will report back.

avehtari · April 23, 2021, 11:54am

WAIC estimates the same thing as LOO, but uses computationally less stable approach and the self diagnostic is also less reliable. Thus, there is no need to consider WAIC at all. The LOOIC in loo package is just the LOO with log score multiplied by -2, but there is no benefit in multiplying the results with -2, so you can also just look at the LOO log score, ie, elpd.

See CV-FAQ for answers what are all your alternatives.

Only you can answer whether it makes sense that you estimate the predictive performance for predictions for a new participant (approximated with leave-one-participant-out) or for predictions for a new observation for an existing participant (approximated by leave-one-observation-out). So which one do you want to know? They will tell you slightly different information related to your hierarchical model (see more in CV-FAQ).

This indicates that you could consider also leave-one-week-out or leave-one-task-out cross-validations. All these different leave-something-out-patterns would tell how much information is shared between 1) patients, 2) weeks, 3) tasks, 4) individual observations, so you would target different aspects of your hierarchical model. If one of these patterns is more relevant for your research question, then choose that one. Please tell more about your research question if you want more advice. (One reason I don’t like how information criteria are usually presented is that it hides the connection to the actual research question).

Note that if you don’t aggregate at all (you could aggregate also over weeks or tasks), it’s possible that log_lik matrix will be very big and make things slow, and then you should look at the vignette Using Leave-one-out cross-validation for large data • loo

avehtari · April 23, 2021, 6:57pm

See also a case study Cross-validation for hierarchical models, which has also a link to a video discussing alternative data splits.

nerpa · April 26, 2021, 7:50am

Thank you @avehtari ! And sorry for the delay – I wanted to think about your reply.

I think you have a great point saying this and thank you for making me think about it. Since the research question is about temporal dynamics over weeks, I think it will make much more sense to do a leave-one-WEEK-out and not leave-one-SUBJECT-out that I currently have. I will just have to rerun the stand alone generated quantities calculations with a different way to calculate the total log_lik, but I hope to have the answer in a few days.

Thanks again!

Topic		Replies	Views
A quick note what I infer from p_loo and Pareto k values Modeling loo	35	16451	August 21, 2022
Some off pareto values, 99% ok - issue? General fitting-issues , loo , arviz	2	823	May 30, 2022
Posterior Predictives look good, but PSIS-LOO and WAIC are bad Modeling rstan , techniques , loo	6	1765	October 24, 2021
Good PP check and R square but large Pareto k values Modeling performance , loo	10	2243	September 2, 2020
Interpret pareto k diagnostic Modeling rstan , fitting-issues , loo	3	1955	August 3, 2023

Bad Pareto k diagnostic with good chain diagnostics

Related topics