Problems in model comparsion with loo-package for a self-written stan model with explanatory variables and hierarchy

I wrote a stan program to model count data with a negative-binomial distribution, with the log mean parameter depending on a group association ‘xHOUSE’ (with hierarchy, 12 levels) and an explanatory variable ‘xTREAT’ (a treatement, 4 levels), and the dispersion parameter depending on the explanatory variable ‘xTREAT’. The data points (n=675) can be seen as individuals examined at different time points with different treatments, while I am not interested in correlation by time at the moment. There are two different count measures (different nature of data, ‘outdoor’ and ‘indoor’) per data point, so the model is essentially doubled with independent parameters by adding the log-likelihoods. The model code is below (tested). There are generated quantities which I excluded here. I wrote my own model to be more flexible, and since I am interested in a measure whose distribution is unknown (at least to me) and which I estimate by a function of forward simulated data points. There are 44 parameters in total.
I want to analyse this model with the loo-package and compare to an alternative model including an additional hierarchical layer (data points grouped by weeks). However, the Pareto-k value is so high (above 1 for all data points,see plot below) that I suspect that there is a conceptual problem in the way I wrote the model. Thank you very much for any help.

There were no sampling problems and my posterior predictive checks (forward simulation by ‘generated quantities’, for each group separately and for the general population) looked OK to me. I read or at least skimmed the following resources https://avehtari.github.io/modelselection/roaches.html, https://avehtari.github.io/modelselection/CV-FAQ.html, https://avehtari.github.io/modelselection/rats_kcv.html, Vehtari, A., Gelman, A., and Gabry, J. (2017a). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), but I have little theoretical background in the corresponding methods. I can of course provide more background and explanation.


data{
  int<lower=0> n; // number of 'house-days' (experimental days * houses)
  int<lower=0> yHLC[n]; // data HLC
  int<lower=0> yCDC[n]; // data CDC
  int xHOUSE[n]; // feature HOUSE
  int xTREAT[n]; // feature TREATment
  int<lower=0> priorscale; // to control scale of prior
}
parameters{
  real logaIN; // mean of lograte indoors
  real logaOUT; // mean of lograte outdoors
  real<lower=0> sigma_alphaIN; // standard deviation of lograte indoors
  real<lower=0> sigma_alphaOUT; // standard deviation of lograte outdoors
  vector[12] thetaIN; // standardised deviation of indoor lograte from mean of lograte indoors
  vector[12] thetaOUT; // standardised deviation of outdoor lograte from mean of lograte outdoors
  vector[3] logbetaIN; // treatment effect indoor 
  vector[3] logbetaOUT; // treatment effect outdoor
  vector[4] logpsiIN; // one scale parameter per treatment group and per indoor, so far no hierarchy
  vector[4] logpsiOUT; // one scale parameter per treatment group and per outdoor, so far no hierarchy
}
transformed parameters{
  vector[n] loglambdaIN; // lograte indoor
  vector[n] loglambdaOUT; // lograte indoor
  vector[n] phiIN; // scale parameter indoor 
  vector[n] phiOUT; // scale parameter outdoor
  {
    vector[4] logbetaINall = append_row(0, logbetaIN);
    vector[4] logbetaOUTall = append_row(0, logbetaOUT);
    loglambdaIN = logaIN + thetaIN[xHOUSE] * sigma_alphaIN + logbetaINall[xTREAT];
    loglambdaOUT = logaOUT + thetaOUT[xHOUSE] * sigma_alphaOUT + logbetaOUTall[xTREAT];
    phiIN = exp(logpsiIN[xTREAT]); 
    phiOUT = exp(logpsiOUT[xTREAT]); 
  }
}
model{
//priors and hyperpriors
  target += normal_lpdf(logaIN | 0, priorscale);
  target += normal_lpdf(logaOUT | 0, priorscale);
  target += 2*cauchy_lpdf(sigma_alphaIN | 0,priorscale);
  target += 2*cauchy_lpdf(sigma_alphaOUT | 0,priorscale);
  target += normal_lpdf(logbetaIN| 0, priorscale);
  target += normal_lpdf(logbetaOUT| 0, priorscale);
  target += normal_lpdf(logpsiIN| 0, priorscale);
  target += normal_lpdf(logpsiOUT| 0, priorscale);
// hierarchical parameters  
  target += normal_lpdf(thetaIN| 0, 1);
  target += normal_lpdf(thetaOUT| 0, 1);  
// model
  target += neg_binomial_2_log_lpmf(yCDC | loglambdaIN, phiIN); 
  target += neg_binomial_2_log_lpmf(yHLC | loglambdaOUT, phiOUT); 
}
generated quantities{
    vector[n] log_lik;
    for (j in 1:n) {
      log_lik[j] = neg_binomial_2_log_lpmf(yCDC[j] | loglambdaIN, phiIN) + neg_binomial_2_log_lpmf(yHLC[j] | loglambdaOUT, phiOUT);
    }
}

pareto_k

I am curious to see what @avehtari or someone else says, cuz I’m not very familiar with this stuff, but I’ll make a guess at answering.

I do not think high pareto-k necessarily means your model is wrong, it just means leaving a data point out changes the predictions a lot.

This section in the FAQ suggests that lots of model flexibility can lead to this: https://avehtari.github.io/modelselection/CV-FAQ.html#16_What_to_do_if_I_have_many_high_Pareto_(k)’s

But I guess even though lots of high pareto-ks might technically be possible, it makes you wonder if the model is messed up in some way. Based on that FAQ, I’d try simplifying things.

I think first break the model into outdoor/indoor data and see if things behave differently there.

Dear bbbales2
Thank you very much for your answer. Yes, I know the pareto-k isn’t directly a model assessment tool, but (if I understand correctly) high pareto-k for a given, held-out data point means that this data point is very unlikely conditional on a big portion (I think 20%) of the posterior, or equivalently that leaving out this data point would change the posterior considerably. And in my case this seems to be the case for all data points. Since 44 parameters vs. 675 data points looks not much in danger of over-fitting to me, I can only imagine a gross model misspecification. However, the graphical posterior predictive checks I performed looked OK to me.

Following https://avehtari.github.io/modelselection/CV-FAQ.html#16_What_to_do_if_I_have_many_high_Pareto_(k)’s and https://mc-stan.org/loo/reference/loo-glossary.html, the random effects may make the model too flexible. However, in my model there are always about 16 data points per hierarchical group, hence it doesn’t seem to be a case of ‘models with parameters which see the information only from one observation each’ (from the first reference). Maybe my question should be more openly: What model selection method should I use for the data and model at hand?

Your right, I should separate the indoor from the outdoor data to make things more transparent, sorry for not having done this before posting my problem. However, the parameters and the data is completely separate so in principle this shouldn’t matter. Anyway, I will upload a new version.

Is that right? Should it be

3 Likes

Ojeoje, yes of course. Sorry, this is a bit embarrassing, I didn’t want to use the forum for debugging my code, but I didn’t expect a problem on this level! (Also, I thought this dimensionality problem would thrown an error.)

Now the pareto-k plot looks much more reasonable:
pareto_k_new

Thank you very much!

4 Likes

We all make these mistakes

3 Likes

I’m late for the party, but when I saw that many Pareto k’s where larger than 10, I did expect coding error. Great that @ahartikainen spotted the error.

1 Like