How to solve pareto k warnings for a multivariate normal model?

lionel68 · March 26, 2018, 1:23pm

We collected a set of ecosystem variables (tree biomass, insect diversity, soil carbon stocks …) and we would like to regress those against some predictors such as fragmentation level. The models used so far are multivariate normal models (see multivariate_v1.stan (1.6 KB)), all parameters converged and effective sample size is large enough. Yet when trying to compare the different models via loo, we get large numbers of bad and very bad k diagnostic values (90% of data points with k > 0.7). Looking at posterior predictive plots (standard deviation vs mean, red dot observed data, blue dots posterior samples):

It seems that the models usually overestimate the standard deviation of the variables. Do you think that this is what is causing these bad Pareto-k behavior? How can it be solved? I looked a bit at skewed multivariate distributions but it does look scary to me and wanted to check if it would make sense to go that way.

avehtari · March 26, 2018, 8:36pm

My quick guess is that since

vector[K] y[N];

each y[n] is a vector of length K, and if K is large, then it’s more likely that importance sampling fails. Alternative would be really bad model misspecification, but it that alternative is less likely if you have 90% of k’s > 0.7.

btw. in log_lik computation, I think you could you write

log_lik[n] = multi_normal_cholesky_lpdf(y[n] | X[n,] * beta, L_Sigma)

I recommend testing K-fold-CV (where K is different K than your K in Stan code)

lionel68 · July 25, 2018, 5:45am

For the record, after searching a bit and finding this nice example code, I attach an example R-script with simulated data (stan_kfold_crossvalidation.r (3.7 KB)) and an example stan model doing k-fold cross-validation (normal_model_basic_cv.stan (945 Bytes)).

@avehtari: I am wondering if we can directly plug-in the log-likelihood matrix of the heldout data into loo (ie the output of function extract_log_lik_K in the script)? I tried comparing the implementation given in the contributed talk and the output from loo and I get very similar results, but maybe this is not right …

avehtari · July 25, 2018, 11:56am

Great!

the log-likelihood matrix of the heldout data is directly the values we want, we don’t need loo loo function for that. If you construct a similar object that kfold function returns, then you can use compare function from loo package to compute the difference and SE.

Topic		Replies	Views
Interpret pareto k diagnostic Modeling rstan , fitting-issues , loo	3	484	August 3, 2023
Loo: High Pareto k diagnostic values for beta binomial regression General loo	2	1456	March 17, 2018
Recommendations for what to do when k exceeds 0.5 in the loo package? Modeling loo	21	6996	March 8, 2018
Bad Pareto k diagnostic with good chain diagnostics General	12	1350	April 26, 2021
Integrated loo and multivariate autoregressive model Modeling loo	1	405	November 5, 2022

How to solve pareto k warnings for a multivariate normal model?

Related Topics