Reparameterizing to avoid low E-BFMI warning

I’m attempting to fit a hierarchical binary logistic regression model, but have received a low E-BFMI warning for all four chains:

## E-BFMI indicated possible pathological behavior:
##   Chain 1: E-BFMI = 0.128
##   Chain 2: E-BFMI = 0.169
##   Chain 3: E-BFMI = 0.179
##   Chain 4: E-BFMI = 0.141
## E-BFMI below 0.2 indicates you may need to reparameterize your model.

I’m quite new to Stan so I’m not really sure where to start with the suggestion to reparameterize my model. (I think) I have used non-centred parameterizations across the board, as well as following the Stan User’s Guide advice to reparameterize the half-Cauchy priors on my SDs and use Cholesky factorization for my LKJ prior:

parameters {
  real alpha0;                            // Intercept
  vector[n_issues] alphaj;                // Issue-specific effects
  real<lower=0,upper=pi()/2> sigma_issue_unif;  // Transformed SD of issue-specific effects
  vector[n_fam] gamma;                    // Random family effects
  real<lower=0,upper=pi()/2> sigma_gamma_unif;  // Transformed SD of family random effects
  cholesky_factor_corr[n_issues] L_Omega; // Cholesky factor of correlation
  real<lower=0,upper=pi()/2> sigma_eta_unif;    // Transformed SD of family-issue-specific effects
  matrix[n_issues,n_fam] z;
}
transformed parameters {
  real<lower=0> sigma_issue;              // SD of issue-specific effects
  real<lower=0> sigma_gamma;              // SD of family random effects
  real<lower=0> sigma_eta;                // SD of family-issue-specific effects
  matrix[n_issues,n_fam] eta;            // Family-issue-specific effects
  sigma_issue = 0.07 * tan(sigma_issue_unif);
  sigma_gamma = 0.07 * tan(sigma_gamma_unif);
  sigma_eta = 0.07 * tan(sigma_eta_unif);
  for (n in 1:n_fam)
    eta[,n] = sigma_eta * L_Omega * z[,n];
}
model {
  alpha0 ~ logistic(0,1);                 // Logistic prior for the intercept
  L_Omega ~ lkj_corr_cholesky(2);         // LKJ prior on Cholesky factor
  for (n in 1:n_fam)
    z[,n] ~ std_normal();
  alphaj ~ std_normal();                  // Issue-specific effects
  gamma ~ std_normal();                   // Random family effects
  concern ~ bernoulli_logit(alpha0 + sigma_issue*xissues*alphaj + sigma_gamma*xfam*gamma +
    xfull*to_vector(eta)); // Logistic regression model
}
generated quantities {
   corr_matrix[n_issues] Omega;           // Correlation between issues
   Omega = multiply_lower_tri_self_transpose(L_Omega);
}

The issue appears to be in the primitive SD parameters, specifically sigma_issue_unif and sigma_eta_unif (sigma_gamma_unif looks fine, as do all of the other parameter vs energy__ scatterplots).

The model takes a while to run so it’s a bit difficult for me to just try things to see what sticks. Am I better off using a half-Cauchy prior directly instead of reparameterizing? Would a weaker, stronger or different prior help matters (I can’t remember where I got 0.07 from)?

Any advice on speeding up my code would be great too. I’m not sure I’m using vectorization as efficiently as I could be.

Can you do plots of:

  1. energy vs sigma_eta
  2. energy vs sigma_issue
  3. sigma_eta vs. sigma_issue

What is xissues (dimensions too)? It is strange to me that covariates (if they are covariates) would go into a term like this. Do you have a centered version of this? I might be reading it wrong.

Thanks Ben for your interest in my question.

I had a bit more of a think and realised that I didn’t want/need to estimate the SD of alphaj - they are “fixed effects”. So I removed sigma_issue altogether.

I still received the E-BFMI warnings with a similar sigma_eta_unif vs energy__ plot, but saw comments elsewhere on these forums that half-Cauchys aren’t so well supported as SD priors. Thinking about it more, I agreed that these put too much weight on quite extreme values in my situation, so I changed it to a half-normal prior.

I still have quite a strong apparent correlation between sigma_eta and energy__ appearing in the plot, but now E-BFMI is up around 0.35-0.4 and I do not get any warnings. I’ll consider more carefully whether half-normal is really what I want, but for now it seems to have avoided the issue.

xissues is (like xfam and xfull) just a ‘design matrix’ of 0s and 1s to help me match up the observations to their parameters; its dimensions are n_obs * n_issues. The model doesn’t include any real covariates yet, but I plan to add them in the future, which is why I set it up like this.

I don’t have a centered version of the code, but this might make clearer what model I intended to fit

\mathrm{logit}(p_{ij}) = \alpha_0 + \alpha_j + \gamma_i + \eta_{ij}, \qquad i = 1, \dotsc, n, \; j = 1, \dotsc, J

with \alpha_0 \sim \mathrm{Logistic}(0,1), \alpha_j \sim N(0,3), \gamma_i \sim N(0,\sigma_\gamma), \sigma_\gamma \sim \text{Half-}N(0,1), \eta_{i} \sim MVN_J(0,\sigma_\eta \Omega^{1/2}), \sigma_\eta \sim \text{Half-}N(0,1), \Omega \sim LKJ(2)

The main idea being that the J responses within each i are correlated in some way, and (along with the \alpha_j s), that correlation matrix \Omega is of great interest.

Nice! Glad to hear it’s working out.

Aaah, gotcha, makes sense.

Someone in another thread just asked something about why you’d want to model a covariance like this. I think I messed up the answer. If you’re feeling charitable: Estimating covariance terms in multilevel model :D

I’m afraid I think I’m too new to Bayes to get my head around the exact question they are asking (or how you feel you messed up), but maybe if I describe why the correlation is of interest in my case, it might help?

My scenario is a survey of several potential ethical issues that might arise in a certain medical setting. Families indicated whether or not each issue was concerning to them, as a simple yes/no response.

The scientific questions of interest were, in order of complexity:

  1. What is the general level of concern about these ethical issues
  2. What is the general level of concern about each specific issue j
  3. How are the issues related to one another, e.g. are families who are likely to say ‘issue 7’ is a concern more or less likely to say ‘issue 3’ is a concern?
  4. [Still to do] What characteristics of families are associated with their general level of concern across all issues?
  5. [Still to do] Are there any family characteristics that are associated with concern about specific issues?

Using the notation from my previous post, \alpha_0 and \alpha_j help answer the first two questions; \gamma_i is a ‘random effect’ to allow families to have different levels of general concern across all issues, and \Omega is essentially there to answer the third question.

I had initially attempted to do this by putting a multivariate normal prior on the \alpha_j s and estimating their correlation, but I hope I was correct in realising (after many computational issues) that it was the correlation between issue-specific random effects within families (i.e. the \eta_{ij} s) that is actually what I wanted to model.

I agree with both of your posts in the other thread, but again I’m not sure if I’ve missed the point or not. I suppose that observing correlations in your posterior after prior independence is a different thing to obtaining a posterior for the correlation; and I hope my scenario is one in which the latter is of direct interest.

What’s happening here is that your regression is poorly identified – or even non-identified – which manifests as a likelihood function that is narrow along the identified directions but stretching towards infinity in the others. With half-Cauchy priors you allow way too much of that pathological behavior into the posterior and the HMC implementation in Stan is struggling to adequately explore it, hence the warning.*

By moving to tighter priors you cut off more of the pathology, which then induces a posterior that Stan can actually handle.

If you want more diffuse priors then consider an exponential density or a Student-t density with 4-7 degrees of freedom. A Cauchy density is never the right answer.

  • For the more technically minded, the Hamiltonian trajectories are able to explore only a tiny portion of the posterior typical set and instead the sampler is relying on the momenta resampling to diffuse it across the typical set. That diffusion, however, is too slow to be able to guarantee complete exploration which manifests as the warning.
4 Likes

Nah nah! This is good. That’s what I was looking for – something application-y. I linked your post to the other thread. That seems like a really clear description of what is happening and why you’re doing it.

Thanks for that! Also thanks @betanalpha for the follow up.