We collected a set of ecosystem variables (tree biomass, insect diversity, soil carbon stocks …) and we would like to regress those against some predictors such as fragmentation level. The models used so far are multivariate normal models (see multivariate_v1.stan (1.6 KB)), all parameters converged and effective sample size is large enough. Yet when trying to compare the different models via loo, we get large numbers of bad and very bad k diagnostic values (90% of data points with k > 0.7). Looking at posterior predictive plots (standard deviation vs mean, red dot observed data, blue dots posterior samples):

It seems that the models usually overestimate the standard deviation of the variables. Do you think that this is what is causing these bad Pareto-k behavior? How can it be solved? I looked a bit at skewed multivariate distributions but it does look scary to me and wanted to check if it would make sense to go that way.

My quick guess is that since

```
vector[K] y[N];
```

each y[n] is a vector of length K, and if K is large, then it’s more likely that importance sampling fails. Alternative would be really bad model misspecification, but it that alternative is less likely if you have 90% of k’s > 0.7.

btw. in log_lik computation, I think you could you write

```
log_lik[n] = multi_normal_cholesky_lpdf(y[n] | X[n,] * beta, L_Sigma)
```

I recommend testing K-fold-CV (where K is different K than your K in Stan code)

For the record, after searching a bit and finding this nice example code, I attach an example R-script with simulated data (stan_kfold_crossvalidation.r (3.7 KB)) and an example stan model doing k-fold cross-validation (normal_model_basic_cv.stan (945 Bytes)).

@avehtari: I am wondering if we can directly plug-in the log-likelihood matrix of the heldout data into `loo`

(ie the output of function extract_log_lik_K in the script)? I tried comparing the implementation given in the contributed talk and the output from loo and I get very similar results, but maybe this is not right …

1 Like

Great!

the log-likelihood matrix of the heldout data is directly the values we want, we don’t need loo `loo`

function for that. If you construct a similar object that `kfold`

function returns, then you can use `compare`

function from `loo`

package to compute the difference and SE.