Various questions about interpretation of loo results

@avehtari Looking at the vignette,

Computed from 4000 by 262 log-likelihood matrix

         Estimate     SE
elpd_loo  -6236.9  725.4
p_loo       284.9   69.1
looic     12473.8 1450.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     240   91.6%   206       
 (0.5, 0.7]   (ok)         7    2.7%   48        
   (0.7, 1]   (bad)        8    3.1%   7         
   (1, Inf)   (very bad)   7    2.7%   1         
See help('pareto-k-diagnostic') for details.

I didn’t notice the Min. n_eff column until now. The latest guideline is that n_eff should be at least 100 times the number of chains. Does this apply to loo also? What about \widehat R? Also, does loo check these statistics itself or is it recommended to use the usual procedure (rstan’s summary function) to evaluate the log_lik vector?

I understand the logic of looking at observations associated with large k values (outliers or unexpected given the posterior). Is there a useful interpretation of elpd_loo or the SE of this quantity? Or is elpd_loo only useful for model comparison?

There was a recent article about Bayesian Comparison of Latent Variable Models: Conditional Versus Marginal Likelihoods. Do I understand correctly that loo should not be used to compare latent variable models without integrating out the latent variables? Apparently, the blavaan package has some code to integrate out latent variables. Any idea if this code is specific to blavaan models or if it is generic? If it is generic, maybe it can be moved out of blavaan some more generic package like latentStan (I made that up)? What about Pareto k values? Do k values still retain their useful interpretation in the context of latent variable models?

Thanks for posting the questions. It seems we should clarify the documentation a bit related to convergence diagnostics and loo

See also loo-glossary

From the glossary

  • If p_loo > p , then the model is likely to be badly misspecified. If the number of parameters p<<N , then PPCs are also likely to detect the problem. See the case study at Roaches cross-validation demo for an example. If p is relatively large compared to the number of observations, say p>N/5 (more accurately we should count number of observations influencing each parameter as in hierarchical models some groups may have few observations and other groups many), it is possible that PPCs won’t detect the problem.

You have p_loo=285 > p=262.

From the glossary

If k>0.7 , then importance sampling is not able to provide useful estimate for that component/observation. Pareto k is also useful as a measure of influence of an observation. Highly influential observations have high k values. Very high k values often indicate model misspecification, outliers or mistakes in data processing. See Section 6 of Gabry et al. (2019) for an example.

You have several k>0.7, that is, importance sampling is failing as the full posterior and leave-one-out posteriors are too different.

It is likely that the problem is now mostly in importance sampling and not in MCMC.

That applies mostly to MCMC to make it more likely that Rhat and n_eff computations are reliable. You could compute Rhats and n_eff’s for exp(log_lik) (see Convenience function for computing relative efficiencies — relative_eff • loo) if you think you have a problem with MCMC sampling.

Before using loo, it is recommended that you have checked that sampling works with Rhat, n_eff, divergences, E-BMFI, etc. loo checks only the combined n_eff and khat, but if combined n_eff’s are large and Pareto k’s are small, there is no need to check Rhat for each exp(log_lik) separately.

For discrete models, elpd_loo can be interpreted as log probabilities. For continuous models elpd_loo can be compared to baseline model. Large SE indicates problems. If Monte Carlo SE of elpd_loo is NA, then the result is very unreliable.

It can be used and it works sometimes, but if you have latent variable model with n latent variables, it seems that in your case you would need to marginalize in order to get reliable result. If you are using the latent variables just to add overdispersion, consider to use instead overdispersed observation model.

I’m not familiar with blavaan

Yes.

1 Like

FYI, Edgar Merkle pointed me to http://semtools.r-forge.r-project.org/ where he hosts the code that is described in his paper.