How to solve pareto k warnings for a multivariate normal model?

lionel68 · March 26, 2018, 1:23pm

We collected a set of ecosystem variables (tree biomass, insect diversity, soil carbon stocks …) and we would like to regress those against some predictors such as fragmentation level. The models used so far are multivariate normal models (see multivariate_v1.stan (1.6 KB)), all parameters converged and effective sample size is large enough. Yet when trying to compare the different models via loo, we get large numbers of bad and very bad k diagnostic values (90% of data points with k > 0.7). Looking at posterior predictive plots (standard deviation vs mean, red dot observed data, blue dots posterior samples):

It seems that the models usually overestimate the standard deviation of the variables. Do you think that this is what is causing these bad Pareto-k behavior? How can it be solved? I looked a bit at skewed multivariate distributions but it does look scary to me and wanted to check if it would make sense to go that way.

avehtari · March 26, 2018, 8:36pm

My quick guess is that since

vector[K] y[N];

each y[n] is a vector of length K, and if K is large, then it’s more likely that importance sampling fails. Alternative would be really bad model misspecification, but it that alternative is less likely if you have 90% of k’s > 0.7.

btw. in log_lik computation, I think you could you write

log_lik[n] = multi_normal_cholesky_lpdf(y[n] | X[n,] * beta, L_Sigma)

I recommend testing K-fold-CV (where K is different K than your K in Stan code)

lionel68 · July 25, 2018, 5:45am

For the record, after searching a bit and finding this nice example code, I attach an example R-script with simulated data (stan_kfold_crossvalidation.r (3.7 KB)) and an example stan model doing k-fold cross-validation (normal_model_basic_cv.stan (945 Bytes)).

@avehtari: I am wondering if we can directly plug-in the log-likelihood matrix of the heldout data into loo (ie the output of function extract_log_lik_K in the script)? I tried comparing the implementation given in the contributed talk and the output from loo and I get very similar results, but maybe this is not right …

avehtari · July 25, 2018, 11:56am

Great!

the log-likelihood matrix of the heldout data is directly the values we want, we don’t need loo loo function for that. If you construct a similar object that kfold function returns, then you can use compare function from loo package to compute the difference and SE.

Topic		Replies	Views
Interpret pareto k diagnostic Modeling rstan , fitting-issues , loo	3	1677	August 3, 2023
High Pareto-k values for the same observations across different models: Can I still use loo to compare these models? Modeling loo	2	583	November 5, 2018
Calculating LOO-CV for a multinormal regression model Modeling loo	32	2284	April 7, 2020
Recommendations for what to do when k exceeds 0.5 in the loo package? Modeling loo	21	7515	March 8, 2018
PSIS and LOOIC returning different Pareto k values Modeling rstan , fitting-issues , loo	1	62	October 25, 2024

How to solve pareto k warnings for a multivariate normal model?

Related topics