LOO hierarchical models: very low p_loo without warnings nor errors

I am struggling to understand the estimate of p_loo after fitting a hierarchical model and performing leave-one-out cross-validation. The model is built to fit biological data that follow a skew normal distribution. This distribution has four parameters which have been transformed into hyperparameters in order to understand more precisely how each hyperparameter depends on the biological variables of interest. The biological variables are seven and five are dummy, one is continuous and one is the interaction between a dummy variable and the continuous variable.

Each of the four hyperparameters is defined as a linear regression with one intercept and seven slopes (there are seven biological variables), so eight parameters for each hyperparameter. In total, the model contains 32 parameters (four hyperparameters times eight parameters). All of this is translated into the stan file:

data {
  int<lower=0> N;                //sample size
  int<lower=0> K;                //number of dummy variables
  matrix[N, K] X;                //matrix with dummy variables 
  vector[N] ratio;               //continuous independent variable
  int<lower=0, upper=1> y[N];    //binary dependent variable

parameters {
  real<lower=-4, upper=4> alpha_intercept;
  real<lower=0, upper=1> b_intercept;
  real<lower=-2, upper=2> c_intercept;
  real<lower=0.0001, upper=3> d_intercept;
  vector[K] alpha_coeff;       //slopes hyperparameter alpha
  vector[K] b_coeff;           //slopes hyperparameter b
  vector[K] c_coeff;           //slopes hyperparameter c
  vector[K] d_coeff;           //slopes hyperparameter d

transformed parameters {
  vector[N] alpha_hyp;
  vector[N] b_hyp;
  vector[N] c_hyp;
  vector[N] d_hyp;
  vector[N] y_hat;
  alpha_hyp = alpha_intercept + X * alpha_coeff;    //linear regression alpha
  b_hyp = b_intercept + X * b_coeff;                //linear regression b
  c_hyp = c_intercept + X * c_coeff;                //linear regression c
  d_hyp = d_intercept + X * d_coeff;                //linear regression d
  for (i in 1:N) {
    y_hat[i] = 0.01 + b_hyp[i] * exp(-0.5 * ((ratio[i] - c_hyp[i]) / d_hyp[i])^2) * (1 + erf(alpha_hyp[i] * (ratio[i] - c_hyp[i]) / (1.414214 * d_hyp[i])));

model {
  alpha_coeff ~ normal(0, 1);        //prior regression coefficients hyperparameter alpha
  b_coeff ~ normal(0, 1);            //prior regression coefficients hyperparameter b
  c_coeff ~ normal(0, 1);            //prior regression coefficients hyperparameter c
  d_coeff ~ normal(0, 1);            //prior regression coefficients hyperparameter d
  alpha_hyp ~ gamma(12, 7);          //prior hyperparameter alpha
  b_hyp ~ beta(2, 2);                //prior hyperparameter b
  c_hyp ~ normal(-0.1, 0.5);         //prior hyperparameter c
  d_hyp ~ gamma(7, 8);               //prior hyperparameter d
  y ~ bernoulli_logit(logit(y_hat));

generated quantities {
  vector[N] log_lik;
  for (n in 1:N) {
    log_lik[n] = bernoulli_logit_lpmf(y[n] | logit(0.01 + b_hyp[n] * exp(-0.5 * ((ratio[n] - c_hyp[n]) / d_hyp[n])^2) * (1 + erf(alpha_hyp[n] * (ratio[n] - c_hyp[n]) / (1.414214 * d_hyp[n])))));

The fitting ends without warnings nor errors as well as the loo cross-validation (I have followed this vignette. And, this is the loo output:

Computed from 24000 by 4330 log-likelihood matrix

. Estimate SE
elpd_loo -2224.2 27.1
p_loo 4.7 0.1
looic 4448.5 54.2

Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).

My reasoning behind this large discrepancy between the p_loo (4.7) and the real number of parameters (32) is:

  1. Most of the posterior distributions are centred around zero.
  2. Only the four intercepts show a posterior distribution whose 95% confidence intervals do not overlap zero.

Nevertheless, there are also five-six parameters that are estimated to be close to zero but without the 95% CI spanning over zero. If my reasoning is correct, I would have expected the p_loo to be around seven or eight. Am I making sense?

  • What diagnostics can I run to explore this issue (no divergent transitions and low Pareto k estimates)?
  • Does the stan file reflect what is described in the first paragraph?

I have tried to reduced the number of hyperparameters and thus the number of parameters. For example, with only one hyperparameter, the model contains the three parameters that were not transformed into hyperparameters plus one intercept and seven slopes of the hyperparameters. So, the total number of parameters for the model with one hyperparameter is 11 (3 + 8). This returns a very similar p_loo of the one above which would make sense if my earlier reasoning is correct.

Any comments, suggestions or requests for further explanations are very appreciated.

I think everything is fine.

It’s more relevant how the posterior is different from prior and how strong posterior dependencies are.

Compare prior to posterior and examine posterior dependencies.

Looking at your model, you have very strong constraints on intercepts (lower and upper should be used only for logical or physical constraints like variances being positive and probabilities between 0 and 1, and otherwise smooth priors should be used)

What is the equation for y_hat? Couldn’t figure out, but seems like something with strong dependency between parameters.

In generated quantities you could re-use y_hat as it is transformed parameter.

Why do you have logit twice in y ~ bernoulli_logit(logit(y_hat)); ?

Hi @avehtari,

Thank you for your helpful reply and sorry for the late reaction, I have been simulating some data to check whether the estimates from the model comparisons behaved as expected. They did and I am more confident about my interpretation of the loo output.

Anyway, back to your comments, yes the posterior dependencies are strong and this is also part of the reason why there are strong constraints on the intercepts. This type of information comes from the non-hierarchical version of the model above that was instead fitted using weakly informative prior as recommended by you and the rest of the Stan team.

It defines a skew normal distribution and yes, there is strong dependency between parameters but it is the one that can best described the data.

Thanks for this tip!

Next question:

We prefer to work on the probability scale so that we can feed the Stan output directly into our downstream analyses.

See https://mc-stan.org/docs/2_21/functions-reference/bernoulli-logit-distribution.html

bernoulli_logit(alpha) = bernoulli_logit(logit^{-1}(alpha))

So you could just write
y ~ bernoulli(yhat)

and avoid extra transformations

Also working on probability scale is likely to have more numerical problems, but I couldn’t figure out what you yhat[i] assignment is doing (adding 0.01 looks suspicious)

I should have thought about using just the bernoulli function! Thanks for the advice.

0.01 looks indeed suspicious in that equation but because of the strong posterior dependencies in this hierarchical model we decided to put simply the estimate of the non-hierarchical version of this model. Perhaps it would have been better to simply state that when using the skew normal equation in a hierarchical model, strong posterior dependencies produced divergent transitions. Setting that particular parameter to 0.01 solved this problem because this parameter was found to be highly correlated with b_hyp parameter.