Low bulk ess simple random intercept model

I am trying to run a simple random intercept model with these data. Using lme4 I get these estimates and no convergence warnings:

> l = lmer(wy ~ zigdp_pc + (1|ctryear), data = dat, REML = FALSE)
> summary(l)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: wy ~ zigdp_pc + (1 | ctryear)
   Data: idat

     AIC      BIC   logLik deviance df.resid 
 -2140.2  -2118.2   1074.1  -2148.2     1822 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.5505 -0.5702 -0.0128  0.5261  5.3102 

Random effects:
 Groups   Name        Variance Std.Dev.
 ctryear  (Intercept) 0.14648  0.3827  
 Residual             0.01447  0.1203  
Number of obs: 1826, groups:  ctryear, 74

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.20660    0.04462    4.63
zigdp_pc     0.28908    0.01088   26.57

Correlation of Fixed Effects:
         (Intr)
zigdp_pc -0.036

When I run the model using brms I get:

bm = brm(wy ~ zigdp_pc + (1|ctryear), data = idat) 
> summary(bm)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: wy ~ zigdp_pc + (1 | ctryear) 
   Data: idat (Number of observations: 1826) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~ctryear (Number of levels: 74) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.39      0.03     0.34     0.45 1.02       91      241

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.21      0.04     0.13     0.30 1.16       28       90
zigdp_pc      0.29      0.01     0.27     0.31 1.02      330      625

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.12      0.00     0.12     0.12 1.00     1023     1618

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors. 

The diagnostic plots look a bit funky for the intercept, but still, if I increase the iterations to 15,000 I get the same issue (the bulk ess for the intercept goes up to 145).

The number of data points by group ranges from 10 to 50.
Any ideas or suggestions on how to deal with this?

Thanks!

2 Likes

So this is something I’ve already seen a few times, but currently have no super good insights about. The main issue is IMHO a negative correlation in the posterior between the main intercept and the random intercept.

pairs(bm$fit, pars = c("b_Intercept", "r_ctryear[2020.1950,Intercept]", "r_ctryear[2020.1990,Intercept]"))

This is probably related to the fact that you have a lot of data per group, which can overwhelm the hierarchical prior on the varying intercepts and make the model behave almost as if you had a dummy-coded fixed intercept for each level of ctryear (while with fixed effects you would usually not have a fixed intercept for the reference level)

What to do about this?

One think is just to fit ctryear as fixed effect: bm_fixed = brm(wy ~ zigdp_pc + ctryear, data = dat) doesn’t actually work that badly. You can even be a bit more clever about this and fit the levels with a lot of observations as fixed intercepts and keep the varying intercept only for those with few observations: you would introduce a new variable ctryear_fixed which has value “none” for those where we want to have a random effect and a specifc level otherwise, similarly ctryear_random then something like wy ~ zigdp_pc + ctryear_fixed + (1 | ctryear_random) could work better.

Mike’s case study on hierarchical modelling discusses that in such cases, you may want to use centered parametrization for some/all of your levels (brms uses the non-centered parametrization). There is an issue to let you have this in brms but it is not implemented yet: Centered and partially centered parametrization of varying effects/intercepts · Issue #1162 · paul-buerkner/brms · GitHub

One can manually edit the generated brms code and use non-centered for all parameters, but that doesn’t help. One could also add a sum-to-zero constrain on the varying intercepts to remove one degree of freedom (I think R-INLA does this by default), this also does not help (at least a simple implementation). However doing both of those things helps, here’s the manually modified Stan code:

// generated with brms 2.15.7
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  // Original code by brms
  // vector[N_1] z_1[M_1];  // standardized group-level effects
  
  // Manually added
  vector[N_1] r_1_1;  // actual group-level effects
}

// Original code by brms
// transformed parameters {
//  vector[N_1] r_1_1;  // actual group-level effects
//  r_1_1 = (sd_1[1] * (z_1[1]));
// }

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
    }
    target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, 0.1, 2.5);
  target += student_t_lpdf(sigma | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
    
  // Original code by brms
  // target += std_normal_lpdf(z_1[1]);
  
  // Manually added
  target += normal_lpdf(r_1_1 | 0, sd_1[1]);
  
  // manual soft sum to zero constraint
  target += normal_lpdf(sum(r_1_1) | 0, 0.0001 *N_1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

I then use make_standata to build data for the model and sample with cmdstanr directly (would also work with rstan), i.e.:

library(cmdstanr)
m_centered <- cmdstan_model("modified.stan")
sdat <- make_standata(wy ~ zigdp_pc + (1|ctryear), data = dat) 
fit_centered <- m_centered$sample(data = sdat)

bm_manual <- brm(wy ~ zigdp_pc + (1|ctryear), data = dat, empty = TRUE) 
bm_manual$fit <- rstan::read_stan_csv(fit_centered$output_files())
bm_manual <- rename_pars(bm_manual)
summary(bm_manual)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: wy ~ zigdp_pc + (1 | ctryear) 
   Data: dat (Number of observations: 1826) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~ctryear (Number of levels: 74) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.39      0.03     0.33     0.46 1.00     6214     2909

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.21      0.00     0.20     0.21 1.00     1651     2057
zigdp_pc      0.29      0.01     0.27     0.31 1.00      530     1027

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.12      0.00     0.12     0.12 1.00     6207     3161

Samples were drawn using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Hope that helps at least a bit!

2 Likes

Thanks so much, @martinmodrak ! Yes, this is really helpfull. I realized a couple of days ago, centered parametrization works better for these data. To add that option in brms would be great!

Thanks again for your answer!
Sebastian

2 Likes

Hello @martinmodrak, I am using your stan code to estimate a model using centered parametrization, I added your code, but it’s not working, so I am not sure what I am missing (it’s related to the correlation matrix of random effects).

I am running a brms model like this:

make_stancode(smx  | se(mx_sd, sigma = TRUE) ~ + time +  period  + time_period + (time +  period  + time_period | fips), 
    data =  sm)

Where smx is mortality rate and all the random effects correlated to each other. My code is:

// generated with brms 2.16.1
functions {
 /* compute correlated group-level effects
  * Args: 
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
  *   matrix of scaled group-level effects
  */ 
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  vector<lower=0>[N] se;  // known sampling error
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  vector<lower=0>[N] se2 = square(se);
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // original brms
  // matrix[M_1, N_1] z_1;  // standardized group-level effects
  // added manually
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  vector[N_1] r_1_3;
  vector[N_1] r_1_4;

  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
 
// original brms transformation
// transformed parameters {
//   matrix[N_1, M_1] r_1; 
//   // using vectors speeds up indexing in loops
//   vector[N_1] r_1_1;
//   vector[N_1] r_1_2;
//   vector[N_1] r_1_3;
//   vector[N_1] r_1_4;
//   // compute actual group-level effects
//   r_1 = scale_r_cor(z_1, sd_1, L_1);
//   r_1_1 = r_1[, 1];
//   r_1_2 = r_1[, 2];
//   r_1_3 = r_1[, 3];
//   r_1_4 = r_1[, 4];
// }
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n] + r_1_4[J_1[n]] * Z_1_4[n];
    }
    target += normal_lpdf(Y | mu, sqrt(square(sigma) + se2));
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, -0.1, 2.5);
  target += student_t_lpdf(sigma | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 4 * student_t_lccdf(0 | 3, 0, 2.5);

  // Original values
  // target += std_normal_lpdf(to_vector(z_1));
  // Manually added
  target += normal_lpdf(to_vector(r_1) | 0, sd_1[1]);
  
  // Manual soft sum to zero constraint
  target += normal_lpdf(sum(to_vector(r_1)) | 0, 0.0001 * N_1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1);

}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

I have issues generating the correlation matrix. Any hint or idea would be very welcome!

Thanks so much!
Sebastian