Recommendations for what to do when k exceeds 0.5 in the loo package?

This is the output for my “base model” using loo:

> print(loo_1)
Computed from 6000 by 1795 log-likelihood matrix

         Estimate    SE
elpd_loo  -9190.1  91.6
p_loo      1263.3  47.6
looic     18380.2 183.2

Pareto k diagnostic values:
                         Count  Pct 
(-Inf, 0.5]   (good)     925   51.5%
 (0.5, 0.7]   (ok)       537   29.9%
   (0.7, 1]   (bad)      265   14.8%
   (1, Inf)   (very bad)  68    3.8%
See help('pareto-k-diagnostic') for details.

The help file says:

If the estimated tail shape parameter k exceeds 0.5, the user should be warned, although in practice we have observed good performance for values of k up to 0.7. Even if the PSIS estimate has a finite variance, the user should consider sampling directly from p(θ^s | y_{-i}) for the problematic i, use k-fold cross-validation, or use a more robust model.

My goal is to compare this base model to more complex models using loo. But given that I’m getting this warning I’m not exactly sure how should I proceed. Is there any example that discussed how to use k-fold cross-validation for the problematic i? Is that the right approach or should I just drop this “base model” from the horse race?

The results indicate a really bad model or really flexible model. Can you tell a bit more about the model?

You have n=1795 and baseline model has p_loo=1263, so either the base model is already complex or really bad. Maybe complex is not a good word here. Did you get better results with the more elaborate model?
If the more elaborate model does not have bad k values and has much higher elpd_loo, you can justify that your baseline model is bad.

loo paper [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC has also k-fold-cv example. Unfortunately we don’t have yet more examples. but they are on todo list and loo 2.0 (or 2.1) has some additional support.

To answer this, I would need to know more about these models and the prediction task.

1 Like

My “base” model is the Poisson model I described in this thread.

Saying that my base model is already complex is a fair description, that is why I called “base” instead of “simple.” The reasons why I started with that model are a bit complex and hopefully not relevant for this discussion. That said, I actually want to compare this model with simpler and more complex versions.

A follow-up question, is this plot a good reason to think that this base model is not really bad?

If I understand this correctly, loo indicates that my second model is better than the base model. That said, I still have some problems with k.

> print(loo_2)
Computed from 6000 by 1795 log-likelihood matrix

         Estimate    SE
elpd_loo  -9079.3  84.5
p_loo      1346.5  48.3
looic     18158.7 169.0

Pareto k diagnostic values:
                         Count  Pct 
(-Inf, 0.5]   (good)     721   40.2%
 (0.5, 0.7]   (ok)       676   37.7%
   (0.7, 1]   (bad)      320   17.8%
   (1, Inf)   (very bad)  78    4.3%
See help('pareto-k-diagnostic') for details.
> 
> print(compare(loo_1, loo_2), digits = 3)
elpd_diff        se 
  105.062    22.444 

Thanks, I will carefully read this Saturday morning, and probably come back with more questions after that.

Looking forward to loo 2.0. Do you know when are you going to release it?

Thanks a lot for your amazing work and all the help!

Oh that problematic one…

Here https://rawgit.com/avehtari/modelselection_tutorial/master/roaches.html is also an example of bad Poisson model with high k’s and p_loo>n, but where switching to negbin helped. Poisson is often problematic as the variance is same as mean. Negbin is not necessarily better if common dispersion parameter is used, but it’s worth testing as it’s so simple to switch.

Can you tell the values of I and T? Based on the other thread T=5, but I couldn’t find value for I.
This would help to understand whether khats are big due to weak prior or model misspecification.

The model in the thread you linked is simple, but can be flexible if I is large and priors are vague,

The posterior predictive checking can give very good fits also if the model is too flexible (too many parameters with weak prior). Could you plot also LOO-PIT (implemented in bayesplot)?

It seems both of your models have weak priors and thus observations are highly influential. This can happen also with well specified flexible models (I have experience with GPs and GMRFs) and it’s the weakness of importance sampling that it fails (and waic would fail too)

The comparison indicates that the second model is better, but having that many high k’s is problematic.

This might also be useful for you (just don’t use waic)
E. C. Merkle, D. Furr, S. Rabe-Hesketh Bayesian model assessment: Use of conditional vs marginal likelihoods [1802.04452] Bayesian comparison of latent variable models: Conditional vs marginal likelihoods

Soonish…

2 Likes

T=5, and I=359

First time that I try to do this, not sure if i did it right:

# Extract pointwise log-likelihood and compute LOO
log_lik_1 <- extract_log_lik(fit)
loo_1 <- loo(log_lik_1)
print(loo_1)

# marginal predictive check using LOO probability integral transform
psis <- psislw(-log_lik_1, cores = 32)
ppc_loo_pit(bsdata$y, as.matrix(y_sim), 
            lw = psis$lw_smooth)

image

Could you point me to an explanation for how to read this figure?

Thanks, I added to my reading pile for this weekend :-)

This is my first time trying to fit a negative binomial model with Stan. In my Poisson model the model and generated quantities blocks are:

model {
  sigma_a ~ normal(0, 1);
  sigma_b ~ normal(0, 1);
  a_std ~ normal(0,1);
  b_std ~ normal(0,1);
  gamma ~normal(0,1);
  beta ~ normal(0,1); 
  y ~ poisson_log(log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time);   // likelihood
}


generated quantities {
  //vector[I * T] y_sim;
  vector[I * T] log_lik;
  vector[I * T] y_hat = log(n) + x_beta + gamma[index_time] + a[practice] + b[practice] .* time;
 // for (it in 1:I * T) {
  //  y_sim[it] = poisson_log_rng(y_hat[it]);
 // }
 for(it in 1:I*T){
    log_lik[it] = poisson_log_lpmf(y[it] | log(n[it]) + x_beta[it] + gamma[index_time[it]] + a[practice[it]] + b[practice[it]] .* time[it]);
 }
}

After reading this example, i changed these blocks to:

model {
  phi ~ cauchy(0, 3);
  sigma_a ~ normal(0, 1);
  sigma_b ~ normal(0, 1);
  a_std ~ normal(0,1);
  b_std ~ normal(0,1);
  gamma ~normal(0,1);
  y ~ neg_binomial_2_log(log(n) + gamma[index_time] + a[practice] + b[practice] .* time, phi);   // likelihood
}


generated quantities {
  vector[I * T] y_sim;
  vector[I * T] log_lik;
  vector[I * T] y_hat = log(n) + gamma[index_time] + a[practice] + b[practice] .* time;
 for (it in 1:I * T) {
  y_sim[it] = neg_binomial_2_log_rng(y_hat[it], phi);
 }
 for(it in 1:I*T){
    log_lik[it] = neg_binomial_2_log_lpmf(y[it] | log(n[it]) + gamma[index_time[it]] + a[practice[it]] + b[practice[it]] .* time[it], phi);
 }
}

where real<lower=0> phi;

Alas, when I try to fit this model I get a bunch of errors:

[1] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
 [2] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
 [3] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
 [4] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
 [5] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
 [6] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
 [7] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
 [8] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
 [9] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
[10] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
[11] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
[12] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
[13] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
[14] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
[15] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
[16] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
[17] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
[18] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
[19] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
[20] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
[21] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:"                                                                               
[22] "Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be > 0!  (in 'model68685a6b6cb4_stan_68685cba1da' at line 36)"                                                         
[23] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"                                                       
[24] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                                                                                     
[25] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                                                             
[26] "  Exception: neg_binomial_2_log_rng: Random number that came from gamma distribution is 9.9818e+009, but must be less than 1.07374e+009  (in 'model68685a6b6cb4_stan_68685cba1da' at line 45)"
[1] "error occurred during calling the sampler; sampling not done"

Oh it’s the old version. For the new better version see Fig 9 in [1709.01449] Visualization in Bayesian workflow and code in GitHub - jgabry/bayes-vis-paper: 'Visualization in Bayesian workflow' by Gabry, Simpson, Vehtari, Betancourt, and Gelman. (JRSS discussion paper and code). I will explain this old figure here. It’s a qq-plot comparing uniform distribution and cdf-values (PIT-values). There is excess of LOO-PIT values which are close to zero and close to 1, which means that the left-out data is more dispersed than what you predict which means you have a problem! Since you model has so many parameters it’s possible that PPC does not reveal this same problem (and it seems you case is great case-study example of this PPC vs. LOO-PIT behavior). Based on this figure I would try negbin instead of Poisson.

Variance of negbin (with neg_binomial_2 parameterization) is mu + mu^2/phi. Cauchy prior for phi let’s phi to get too close to 0, and constraint real<lower=0> phi is underflowing to zero. You need to have a zero avoiding prior or as a quick test you can try <lower=0.1> (but the zero avoiding prior would be more elegant)

Thanks! Out of curiosity, will the newer version of the plot make it to cran anytime soon?

I did. If I make this change and I run 4 chains, things don’t work:

some chains had errors; consider specifying chains = 1 to debughere are whatever error messages were returned
[[1]]
Stan model 'stan-72e478f44dfb' does not contain samples.

[[2]]
Stan model 'stan-72e478f44dfb' does not contain samples.

My guess is that this is the same problem we are discussing on this thread. @bbbales2 thoughts?

Things get even weirder. I tried to fit this with only one chain, and everything runs smoothly. My guess is that this has something to do with the starting values when you do 4 chains vs only 1 chain. So, given that this works with only one chain, I thought I should share some output:

Computed from 1000 by 1795 log-likelihood matrix

         Estimate   SE
elpd_loo  -8716.7 46.2
p_loo       553.4 21.2
looic     17433.4 92.3

Pareto k diagnostic values:
                         Count  Pct 
(-Inf, 0.5]   (good)     1290  71.9%
 (0.5, 0.7]   (ok)        388  21.6%
   (0.7, 1]   (bad)       106   5.9%
   (1, Inf)   (very bad)   11   0.6%
See help('pareto-k-diagnostic') for details.

image

@avehtari if I understood you correctly, this figure shows that the negative binomial is doing a much better job than the Poisson model. Is that correct?

Could you show me the more elegant solution?

@jonah knows about that

Did you try different random seeds for four chains? I think you need initialize chains more carefully, or do some scaling of the parameters, Now some random initial values are way beyond anything sensible.

Yes! Also Pareto k’s are much better, and elpd_loo is much better, so all diagnostics tell negbin is much better than Poisson. I recommend also plotting the marginal posterior of phi.

I have previously used inverse-gamma prior, and you could try that, too, but to give a well justified recommendation would require some thinking or googling. I’ll get back to this in a few days unless you are able to find something before that.

1 Like

Sorry, I’m new to this… I have no clue how to initialize chains more carefully, nor how to try different seeds for the four chains, nor what you have in mind when you say that I should do some scaling of the parameters. Could you point me in the right direction?

The code in that repo only uses the most recent CRAN version ofbayesplot. In that case, the function is ppc_dens_overlay

2 Likes

see ?stan and option seed. Try runinng stan(...., seed=1337) with different seed values.

see ?stan and option init. The default is to randomly generate initial values between -2 and 2 on the unconstrained support. If you transform -2 and 2 from unconstrained to constrained space, are these initial values sensible?

If you are unlucky initial values are all 2, and if n=1 and x=1, then
log(n) + x*2 + 2 + exp(2)*2 + epx(2)*2*4
is about 79 which when exponentiated to Poisson parameter is about 2e34, which is a lot. You probably need to initialize log(a_std) and log(b_std) with much smaller range than [-2, 2] or scale them in Stan code by dividing them with a big number. For example, try dividing ‘a’ and ‘b’ by 100. Then maximum Poisson parameter value (with n=1 and x=1) is about 311, which should work fine.

This is something we should add in wiki / manual (if it’s not already there). Sorry I didn’t realize earlier this is the likely problem in the other thread,

1 Like

I have been corresponding with avehtari who suggested that I post to this site. I too, have been getting the k>0.5 problem too often to ignore. So I began to look for alternatives. Here is my best shot so far.

  1. Noting that elpd = 1/mean(Importance Ratio) , I plotted the log of the importance ratios and saw that the histogram looked very much like a translated gamma distribution.

  2. Using the script below I fit a loggamma distribution to the importance ratios. Using the formula for the mean of the (transated) loggamma distribution, I then calculated elpd_tlg.

  3. The script also calculates the elpd_loo and elpd_raw. I ran the script with my model on 200 datasets comparing the three calculations. To test the sample variance I ran the stan model with 5 different seeds to get a rough estimate of the sample standard deviation. Here are the conclusions.

  4. The sample sd for elpd_loo and elpd_tlg tend to be noticeably smaller than the sd of elpd_raw

  5. The sample sd for elpd_tlg tends be marginally smaller that sd elpd_loo

  6. The parameters that I get for the loggamma distribution are such that the mean exists for all 200 x 5 samples. There is no analog of the k>0.5 situation.

  7. The standard deviations (even for elpd_loo) I get are much smaller than the standard deviation output by the loo package.

I have a writeup and a spreadsheed with all the details. My next post will contain the script

function to fit loggamma distribution by the method of moments

reference - p227 of Loss Distributions - Hogg and Klugman 1984

fit_loggamma=function(x){
minlogx=min(log(x))
tlogx=log(x)-minlogx
m1=mean(tlogx) # translate to make all numbers positive
m2=mean(tlogx^2)
alpha=m1^2/(m2-m1^2)
lambda=m1/(m2-m1^2)
mean_tlg=1/(1-1/lambda)^alpha*exp(minlogx) #use exp(minlogx) to untranslate
output=list(mean_tlg = mean_tlg,
alpha_tlg = alpha,
lambda_tlg = lambda)
return(output)
}

function to get elpd by three different formulas

elpd_calc=function(loglik){

calc #1 with the “loo” package

loo_psis <- loo(loglik)

calculate importance ratios

impratio=matrix(0,dim(loglik)[1],dim(loglik)[2])
for (i in 1:dim(loglik)[2]){
impratio[,i]=1/exp(loglik[,i])
}

calculate raw elpd

prloo_raw=rep(0,dim(loglik)[2])
for (i in 1:dim(loglik)[2]){
prloo_raw[i]=1/mean(impratio[,i])
}
elpd_raw=sum(log(prloo_raw))

calculate elpd_tlg

prloo_tlg=rep(0,dim(loglik)[2])
lambda_tlg=NULL
for(i in 1:dim(loglik)[2]){
tlg=fit_loggamma(impratio[,i])
prloo_tlg[i]=1/tlg$mean_tlg
lambda_tlg=c(lambda_tlg,tlg$lambda_tlg)
}
elpd_tlg=sum(log(prloo_tlg))
output=list(elpd_psis = round(loo_psis$elpd_loo,3),
elpd_raw = round(elpd_raw,3),
elpd_tlg = round(elpd_tlg,3),
se_elpd_psis = round(loo_psis$se_elpd_loo,3),
max_k_psis = round(max(loo_psis$pareto_k),3),
min_lam_tlg = round(min(lambda_tlg),3))
return(output)
}

This would have been good as a separate thread (now it might get lost in here)

This is ok when true, but this will be true only for certain models and likelihoods and thus useful only in special cases.

PSIS paper https://arxiv.org/abs/1507.02646 has also example where smaller variance is achieved if a distribution can be fit to all importance ratios instead of just the M largest ones.

I didn’t see that in your script. Could you post that part, too, so that I can check where is the difference.

1 Like

It would be nice to know when “this” is true. For example in my immediate problem my model statement is of the form y ~ normal(mu, sigma) where the mu and sigma are derived from complex combinations of other parameters.

I put the “se_elpd_psis = round(loo_psis$se_elpd_loo,3)” in the output of the “elpd_calc” function in my script above.

Here is a sample calculation that I am referring to. I ran my model with different seeds to get a ballpark estimate of the sample variability.

stan Seed 12345 23456 34567 45678 56789
elpd_psis 49.811 49.892 50.323 48.922 49.498
se_elpd_psis = 5.824
Sample Standard deviation of elpd_psis = 0.520

I was expecting that the values from the two calculations of the standard deviation to be similar. What am I missing?

I have a spreadsheet with 2000 such calculations that I would be happy to post if I could figure out how to do it. The above calculation is representative of what I got in the 2000 such calculations.

Yes it would, and the problem is that it’s not trivial (or at least not trivial for me) for a general model.

My guess (before seeing your comment on the model) was that with normal and with certain models for mu and sigma, it can be close even if not true, but there the conditions for mu and sigma are non-trivial. You were lucky that it works in your problem.

These are completely two different things. loo package returns se describing the uncertainty about the future data. What you compute here is the Monte Carlo error estimate for the uncertainty due to finite draws in MCMC. See Section 5.1 in [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

I would like to be a little better than “lucky.” To aid my understanding of this, could you point me to some examples with different distributions of the importance ratio?

I looked at the paper. Your pointing me to Section 5.1 is very helpful. One of the things you said there was “One can also compute Monte Carlo standard errors arising from the finite number of simulation draws using the formula from Gelman et al. (2013) which uses both between and within-chain information and is implemented in Stan.”

Could you point me to the formula in BDA3 (Chapter 7?) and also how I might capture it in Stan?

Sure: https://arxiv.org/abs/1507.02646

This Monte Carlo error is usually negligible compared to se computed by equation (23) in that paper, and in most cases you can ignore it. loo 2.0 estimates that, too, by combining N_eff computation described in Stan manual (more up to date than BDA3 (it’s annoying we can’t update the paper which has been published with a better reference)) and Monte Carlo error for importance sampling as described in https://arxiv.org/abs/1507.02646 and Monte Carlo theory, methods and examples

When will loo 2.0 be up on Cran? The version that I can get is 1.1.

This was discussed in this same thread before… Before summer. Meanwhile you can get it from github.