How to solve pareto k warnings for a multivariate normal model?

We collected a set of ecosystem variables (tree biomass, insect diversity, soil carbon stocks …) and we would like to regress those against some predictors such as fragmentation level. The models used so far are multivariate normal models (see multivariate_v1.stan (1.6 KB)), all parameters converged and effective sample size is large enough. Yet when trying to compare the different models via loo, we get large numbers of bad and very bad k diagnostic values (90% of data points with k > 0.7). Looking at posterior predictive plots (standard deviation vs mean, red dot observed data, blue dots posterior samples):

It seems that the models usually overestimate the standard deviation of the variables. Do you think that this is what is causing these bad Pareto-k behavior? How can it be solved? I looked a bit at skewed multivariate distributions but it does look scary to me and wanted to check if it would make sense to go that way.

My quick guess is that since

vector[K] y[N];

each y[n] is a vector of length K, and if K is large, then it’s more likely that importance sampling fails. Alternative would be really bad model misspecification, but it that alternative is less likely if you have 90% of k’s > 0.7.

btw. in log_lik computation, I think you could you write

log_lik[n] = multi_normal_cholesky_lpdf(y[n] | X[n,] * beta, L_Sigma)

I recommend testing K-fold-CV (where K is different K than your K in Stan code)

For the record, after searching a bit and finding this nice example code, I attach an example R-script with simulated data (stan_kfold_crossvalidation.r (3.7 KB)) and an example stan model doing k-fold cross-validation (normal_model_basic_cv.stan (945 Bytes)).

@avehtari: I am wondering if we can directly plug-in the log-likelihood matrix of the heldout data into loo (ie the output of function extract_log_lik_K in the script)? I tried comparing the implementation given in the contributed talk and the output from loo and I get very similar results, but maybe this is not right …

1 Like

Great!

the log-likelihood matrix of the heldout data is directly the values we want, we don’t need loo loo function for that. If you construct a similar object that kfold function returns, then you can use compare function from loo package to compute the difference and SE.