What to do about divergent transitions on the intercepts of a random effect in brm?

I am running a mixed effects model with brm. I am following the tips that I have read here before about slowly building up since my initial complicated model did not perform well at all. Here is the brm code i am using so far:

test.m1 <- brm(bf(party ~ income + (1|scode) + (1|gender), decomp = "QR"),
               family = categorical,
               iter = 2200, warmup = 1400,
               data = megapoll, cores = nCores, chains = nCores,
               prior = testp,
               control = list(max_treedepth = 10, adapt_delta = 0.9))

And here is the stancode() code below

functions {
data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int Y[N];  // response variable
  int<lower=1> K_mu2;  // number of population-level effects
  matrix[N, K_mu2] X_mu2;  // population-level design matrix
  int<lower=1> K_mu3;  // number of population-level effects
  matrix[N, K_mu3] X_mu3;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_mu2_1;
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_mu2_1;
  // data for group-level effects of ID 3
  int<lower=1> N_3;  // number of grouping levels
  int<lower=1> M_3;  // number of coefficients per level
  int<lower=1> J_3[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_3_mu3_1;
  // data for group-level effects of ID 4
  int<lower=1> N_4;  // number of grouping levels
  int<lower=1> M_4;  // number of coefficients per level
  int<lower=1> J_4[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_4_mu3_1;
  int prior_only;  // should the likelihood be ignored?
transformed data {
  int Kc_mu2 = K_mu2 - 1;
  matrix[N, Kc_mu2] Xc_mu2;  // centered version of X_mu2 without an intercept
  vector[Kc_mu2] means_X_mu2;  // column means of X_mu2 before centering
  // matrices for QR decomposition
  matrix[N, Kc_mu2] XQ_mu2;
  matrix[Kc_mu2, Kc_mu2] XR_mu2;
  matrix[Kc_mu2, Kc_mu2] XR_mu2_inv;
  int Kc_mu3 = K_mu3 - 1;
  matrix[N, Kc_mu3] Xc_mu3;  // centered version of X_mu3 without an intercept
  vector[Kc_mu3] means_X_mu3;  // column means of X_mu3 before centering
  // matrices for QR decomposition
  matrix[N, Kc_mu3] XQ_mu3;
  matrix[Kc_mu3, Kc_mu3] XR_mu3;
  matrix[Kc_mu3, Kc_mu3] XR_mu3_inv;
  for (i in 2:K_mu2) {
    means_X_mu2[i - 1] = mean(X_mu2[, i]);
    Xc_mu2[, i - 1] = X_mu2[, i] - means_X_mu2[i - 1];
  // compute and scale QR decomposition
  XQ_mu2 = qr_thin_Q(Xc_mu2) * sqrt(N - 1);
  XR_mu2 = qr_thin_R(Xc_mu2) / sqrt(N - 1);
  XR_mu2_inv = inverse(XR_mu2);
  for (i in 2:K_mu3) {
    means_X_mu3[i - 1] = mean(X_mu3[, i]);
    Xc_mu3[, i - 1] = X_mu3[, i] - means_X_mu3[i - 1];
  // compute and scale QR decomposition
  XQ_mu3 = qr_thin_Q(Xc_mu3) * sqrt(N - 1);
  XR_mu3 = qr_thin_R(Xc_mu3) / sqrt(N - 1);
  XR_mu3_inv = inverse(XR_mu3);
parameters {
  vector[Kc_mu2] bQ_mu2;  // regression coefficients at QR scale
  real Intercept_mu2;  // temporary intercept for centered predictors
  vector[Kc_mu3] bQ_mu3;  // regression coefficients at QR scale
  real Intercept_mu3;  // temporary intercept for centered predictors
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  vector[N_2] z_2[M_2];  // standardized group-level effects
  vector<lower=0>[M_3] sd_3;  // group-level standard deviations
  vector[N_3] z_3[M_3];  // standardized group-level effects
  vector<lower=0>[M_4] sd_4;  // group-level standard deviations
  vector[N_4] z_4[M_4];  // standardized group-level effects
transformed parameters {
  vector[N_1] r_1_mu2_1;  // actual group-level effects
  vector[N_2] r_2_mu2_1;  // actual group-level effects
  vector[N_3] r_3_mu3_1;  // actual group-level effects
  vector[N_4] r_4_mu3_1;  // actual group-level effects
  real lprior = 0;  // prior contributions to the log posterior
  r_1_mu2_1 = (sd_1[1] * (z_1[1]));
  r_2_mu2_1 = (sd_2[1] * (z_2[1]));
  r_3_mu3_1 = (sd_3[1] * (z_3[1]));
  r_4_mu3_1 = (sd_4[1] * (z_4[1]));
  lprior += normal_lpdf(bQ_mu2 | 0,5);
  lprior += student_t_lpdf(Intercept_mu2 | 3, 0, 2.5);
  lprior += normal_lpdf(bQ_mu3 | 0,5);
  lprior += student_t_lpdf(Intercept_mu3 | 3, 0, 2.5);
  lprior += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  lprior += student_t_lpdf(sd_2 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  lprior += student_t_lpdf(sd_3 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  lprior += student_t_lpdf(sd_4 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu2 = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] mu3 = rep_vector(0.0, N);
    // linear predictor matrix
    vector[ncat] mu[N];
    mu2 += Intercept_mu2 + XQ_mu2 * bQ_mu2;
    mu3 += Intercept_mu3 + XQ_mu3 * bQ_mu3;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu2[n] += r_1_mu2_1[J_1[n]] * Z_1_mu2_1[n] + r_2_mu2_1[J_2[n]] * Z_2_mu2_1[n];
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu3[n] += r_3_mu3_1[J_3[n]] * Z_3_mu3_1[n] + r_4_mu3_1[J_4[n]] * Z_4_mu3_1[n];
    for (n in 1:N) {
      mu[n] = transpose([0, mu2[n], mu3[n]]);
    for (n in 1:N) {
      target += categorical_logit_lpmf(Y[n] | mu[n]);
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(z_1[1]);
  target += std_normal_lpdf(z_2[1]);
  target += std_normal_lpdf(z_3[1]);
  target += std_normal_lpdf(z_4[1]);
generated quantities {
  // obtain the actual coefficients
  vector[Kc_mu2] b_mu2 = XR_mu2_inv * bQ_mu2;
  // actual population-level intercept
  real b_mu2_Intercept = Intercept_mu2 - dot_product(means_X_mu2, b_mu2);
  // obtain the actual coefficients
  vector[Kc_mu3] b_mu3 = XR_mu3_inv * bQ_mu3;
  // actual population-level intercept
  real b_mu3_Intercept = Intercept_mu3 - dot_product(means_X_mu3, b_mu3);

The ‘gender’ RE is causing my divergent transitions (see attached photo) at the mu_2 and mu_3 intercepts. I initially thought this would be an issue of the priors, but the pp_check seems to suggest that this is not the case.

Any advice on where to go? Thanks y’all, I appreciate y’all’s time!


How many levels are there in gender? If there are only 2, then this might make the sd difficult to estimate, particularly with those heavier tailed student t default priors. You might try putting gender as a population-level effect. If you don’t want to do that, then you probably need a more informative prior for the sd.

1 Like


Thanks for the response. I was actually just reading about this last night in a blog post by Gelman, I think. I put gender as a group-level effect because I was under the assumption that a group-level effect varied in participants and population-level effects varied across a group (eg US state, region). Additionally, I intend to use gender as part of an interaction as the model becomes more complex (gender:edu, for example), and honestly I am not sure if I can interact population-level effects with group-level effects.

I am currently working on more informative priors, such as normal(0.2,0.9) and checking this iteratively with sample_prior="ONLY".

(1|gender) adds a varying intercept for gender. Thus you’re specifying your model to have an intercept and then offsets from the intercept for each level of gender. These offsets are estimated by assuming that they come from a normal distribution with mean of zero and some standard deviation that is estimated in the model. If gender has 2 levels, then you are estimating the sd from only two levels. If you include gender in the ‘population-level’ effects, then you estimate a parameter that is the contrast between the reference level and the other level. You can get predictions for each level of gender either way, but in group-level effects you are partial pooling and in population-level effects no pooling. If you want to include an interaction with gender and another predictor, typically you will include them in the population-level effects.

Edit - no pooling when included in fixed effects


Ah, this makes much more sense. I think given this, it makes more sense for me to add this as a population-level effect.

One quick question, do you have any recommendations for things to read/advice on levels and group-/population-level effects? Or is there a general rule of thumb?

Michael Betancourt has a readable yet detailed explanation of hierarchical models and their implementation in Stan Hierarchical Modeling
For something a little bit easier to digest, you could try Richard McElreath’s YouTube lectures and/or his Statistical Rethinking book.

I’m not sure there is a “general rule of thumb”, because I am not sure this is a good idea (although you can probably find attempts at one out on the internet). Hierarchical (aka multilevel or random effects models) models are a way to model heterogeneity among discrete groups. The heterogeneity can come from a variety of reasons (I’ll give examples below) but importantly, the reason for it is unknown to the modeler (otherwise you would model it directly!). Betancourt describes this as a way of including latent interactions between parameters for the groups. This allows for something called partial pooling, where information is shared across groups (i.e. estimates from groups inform estimates for other groups). This is different from the no pooling model (including the groups in the population-level effects) where each group has a different parameter but there is no latent interaction or sharing of information, and it is different from the complete pooling model, where the heterogeneity is completely ignored and all groups have the same parameter (akin to a brms model where ‘group’ is not in the model formula at all). Usually, in the hierarchical model, the parameters for each group are modeled as coming from a normal distribution with a standard deviation estimated from the data, as I already described in an above post. An important assumption is exchangeability, which Betancourt gives a thorough explanation of in the link I gave.

So what are some examples?

Longitudinal data - Perhaps you have a study where you follow participants over time and take multiple measurements per time point. The data among participants is ‘clustered’ or correlated, because there are latent variables and functions that cause the outcomes across time to be correlated for each person. In addition, each participant can be thought of as coming from a larger population of participants. Since we don’t know what the latent variables and functions that cause the correlation are, we can use a hierarchical model to partially pool information across participants and model the heterogeneity among them. A brms model formula might look like outcome ~ 1 + time + (time|id).

Locations - You may have data that comes from distinct locations or regions. This is not the same as spatial data where you have continuous distances, in which case you might use some sort of correlation or Gaussian processes model. The locations may be state or country or farm plot or site. The data could have a nested structure so that you have regions within larger regions, like outcome ~ 1 + intervention + (1 | Country/State). Partial pooling is really nice because you may have different numbers of observations per site, and the sharing of information can make prediction tasks much more accurate for groups with little information (this applies to any partial pooling model, not just this example).

Experiments - If you are running controlled experiments in a lab, you can still have heterogeneity among batches of experiments. Similar to the above, you may want to include varying effects for each batch or experiment. Something like outcome ~ 1 + treatment + (1|batch) or outcome ~ 1 + treatment + (treatment|batch)

Many treatments - Typically one thinks of the ‘treatment’ variable as a population-level effect, but if you have a large number of treatments then it is possible that you may want to model them as a varying effect. This is particularly true if you have less than desirable amount of data and want to use partial pooling to inform the model across treatments. For example, suppose you have a particular class of drugs that works on the same mechanism, then it could make sense to include the drugs as group-level vs population-level effects.

These are only a few common examples among many use cases! The main thing to remember is your assumptions and if those are valid for your data and model of it.

Final note - as I hinted at, hierarchical models are used when you don’t have enough information to do better. If you have information on how the heterogeneity occurs, then model that! Here is a made up example - suppose you have a lab with a lot of different student workers, and you are running a study where bacteria in solution are pipetted to different growth plates and a treatment is applied in various amounts. Now suppose that the study consists of many smaller batches of experiments where a different student worker pipettes each batch. Due to systematic errors in pipetting between each student, your brms model might look something like growth ~ 1 + treatment + (1|batch). Now, suppose that you discover that the error in pipetting occurs more so in students with very small or very large hands, and that increasing error occurs with increasing discrepancy between a ‘standard’ size hand the pipette is made for and the size of the student’s hand. Once you put that into the model your heterogeneity disappears (the sd of batch is near zero).
That’s just a highly stylized example, but the point is, if you know how the heterogeneity among distinct groups occurs, then model it rather than automatically throwing a hierarchical model at it.

I hope that helps!


This is more than helpful! Thank you so much for your time and for such a great answer. I really appreciate it!