Brms: non-centered to centered parameterization

Hi,
I know that non-centered parameterization is generally preferred over centered parameterization but in some cases, it is advised to use centered parameterization.

when there is a lot of data, the centered parameterization is more efficienthttps://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html

Since brms follows the non-centered parameterization approach, can someone please guide me how best to change brms generated stan code from non-centered to centered parameterization.

Below is an example of stan code generated for a nonlinear model with three correlated random effects.

library(brms)

b <- c(2, 0.75)
x <- rnorm(100)
y <- rnorm(100, mean = b[1] * exp(b[2] * x))
dat1 <- data.frame(x, y)


fit_loss <- brm(
  bf(cum ~ ult * (1 - exp(-(dev/theta)^omega)),
     ult ~ 1 + (1|i|AY), 
     omega ~ 1 + (1|i|AY), 
     theta ~ 1 + (1|i|AY), 
     nl = TRUE),
  data = loss, family = gaussian(),
  prior = c(
    prior(normal(5000, 1000), nlpar = "ult"),
    prior(normal(1, 2), nlpar = "omega"),
    prior(normal(45, 10), nlpar = "theta")
  ),
  control = list(adapt_delta = 0.9)
)

# scode <- stancode(fit_loss)

scode <- 
  "
// generated with brms 2.20.3
functions {
 /* compute correlated group-level effects
  * Args:
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
  *   matrix of scaled group-level effects
  */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K_ult;  // number of population-level effects
  matrix[N, K_ult] X_ult;  // population-level design matrix
  int<lower=1> K_omega;  // number of population-level effects
  matrix[N, K_omega] X_omega;  // population-level design matrix
  int<lower=1> K_theta;  // number of population-level effects
  matrix[N, K_theta] X_theta;  // population-level design matrix
  // covariates for non-linear functions
  array[N] int C_1;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_ult_1;
  vector[N] Z_1_omega_2;
  vector[N] Z_1_theta_3;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  vector[K_ult] b_ult;  // regression coefficients
  vector[K_omega] b_omega;  // regression coefficients
  vector[K_theta] b_theta;  // regression coefficients
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_ult_1;
  vector[N_1] r_1_omega_2;
  vector[N_1] r_1_theta_3;
  real lprior = 0;  // prior contributions to the log posterior
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_ult_1 = r_1[, 1];
  r_1_omega_2 = r_1[, 2];
  r_1_theta_3 = r_1[, 3];
  lprior += normal_lpdf(b_ult | 5000, 1000);
  lprior += normal_lpdf(b_omega | 1, 2);
  lprior += normal_lpdf(b_theta | 45, 10);
  lprior += student_t_lpdf(sigma | 3, 0, 1963.7)
    - 1 * student_t_lccdf(0 | 3, 0, 1963.7);
  lprior += student_t_lpdf(sd_1 | 3, 0, 1963.7)
    - 3 * student_t_lccdf(0 | 3, 0, 1963.7);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] nlp_ult = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_omega = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_theta = rep_vector(0.0, N);
    // initialize non-linear predictor term
    vector[N] mu;
    nlp_ult += X_ult * b_ult;
    nlp_omega += X_omega * b_omega;
    nlp_theta += X_theta * b_theta;
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_ult[n] += r_1_ult_1[J_1[n]] * Z_1_ult_1[n];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_omega[n] += r_1_omega_2[J_1[n]] * Z_1_omega_2[n];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_theta[n] += r_1_theta_3[J_1[n]] * Z_1_theta_3[n];
    }
    for (n in 1:N) {
      // compute non-linear predictor values
      mu[n] = (nlp_ult[n] * (1 - exp( - (C_1[n] / nlp_theta[n]) ^ nlp_omega[n])));
    }
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

"

Thank you

the R code creates dat1 with columns x, y but it seems that data frame loss contains cum, dev, and AY ??

if you specify 0 instead of 1 for the intercept in the formula, the stancode will use the centered parameterization.

Thanks @mitzimorris for replying.

Please ignore the dat1, it was posted by mistake.

I think my original question (non-centered to centered parameterization) was not stated clearly. Sorry for that.

I am looking for the ways to modify brms generated stancode to implement centered parameterization approach for random effects (Hierarchical models).

Please see below an example for a single random effect (source: https://mc-stan.org/docs/stan-users-guide/reparameterization.html)

Centered parameterization

parameters {
  real mu_beta;
  real<lower=0> sigma_beta;
  vector[K] beta;
  // ...
}
model {
  beta ~ normal(mu_beta, sigma_beta);
  // ...
}

Non-centered parameterization

parameters {
  vector[K] beta_raw;
  // ...
}
transformed parameters {
  vector[K] beta;
  // implies: beta ~ normal(mu_beta, sigma_beta)
  beta = mu_beta + sigma_beta * beta_raw;
}
model {
  beta_raw ~ std_normal();
  // ...
}

I am looking to modify the stancode (for multiple correleated random effects, as shown in the original post) from Non-centered parameterization to Centered parameterization.

Thank you

the BRMS-generated Stan code doesn’t always line up with the way that the Stan code is written in the user’s guide. I think that it might be easier if you take a look at the BRMS-generated Stan code for a formula specifying the intercept both as 0 and 1 - that might make it easier to see what needs to be changed.

@mitzimorris 0 + Intercept turns off the internal design matrix centering, but does not switch the random effects parametrization from non-centered to centered, AFAIK.

For example both of the following still come out non-centered:

make_stancode(rating ~ 0 + Intercept + (1 + treat | subject), data = inhaler)
make_stancode(rating ~ 0 + Intercept + (0 + treat | subject), data = inhaler)

@satpal to switch from non-centered to centered, you should remove the parameter

matrix[M_1, N_1] z_1;  // standardized group-level effects

add as a parameter (move from transformed parameters)

matrix[N_1, M_1] r_1;  // actual group-level effects

And then sample the matrix parametes from a multivariate normal with covariance matrix constructed from the parameters sd_1 and L_1.

Thank you for your replies.

@jsocolar that’s exactly what I am looking for but not able to figure out how best to edit brms generated stan code (best in terms of efficiency and minimal changes required to achieve that).

I would really appreciate if please provide an example.

Thank you

For minimal changes, something like:

for(i in 1:N_1){
 r_1[i, ] ~ multi_normal_cholesky(rep_row_vector(0, M_1), diag_pre_multiply(sd_1, L1))
}

Thanks @jsocolar

Worked as expected!

Despite the fact that loss data used in the above example are really small (only 10 groups, 55 observations), both non-centered parametrization (NCP) and centered parametrization (CP) provided similar results. As expected (small data), the CP resulted in 27 divergences (same control for both NCP and CP: control = list(adapt_delta = 0.98, max_treedepth = 12)

Below I provide the edited stan code (used for CP) and a comparison of results for NCP and CP results.

As you see below (CP), I have commented out the whole transformed parameters block and moved it (except matrix[N_1, M_1] r_1; that is now in the parameters block, and r_1 = scale_r_cor(z_1, sd_1, L_1); which is not needed) to the model block.

Is there any way of using ~ multi_normal_cholesky (i.e, sampling) within the transformed parameters via functions block?

Non-centered parameterization

fit_loss_ncp <- brm(
  bf(cum ~ ult * (1 - exp(-(dev/theta)^omega)),
     ult ~   1 + (1|i|AY), 
     omega ~ 1 + (1|i|AY), 
     theta ~ 1 + (1|i|AY), 
     nl = TRUE),
  data = loss, family = gaussian(),
  prior = c(
    prior(normal(5000, 1000), nlpar = "ult"),
    prior(normal(1, 2), nlpar = "omega"),
    prior(normal(45, 10), nlpar = "theta")
  ),
  control = list(adapt_delta = 0.98, max_treedepth = 12), 
  normalize = FALSE, # Switch off normalizing constant for speed
  cores = 8,
  init = 'random',
  seed = 123
)


summary(fit_loss_ncp)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: cum ~ ult * (1 - exp(-(dev/theta)^omega)) 
         ult ~ 1 + (1 | i | AY)
         omega ~ 1 + (1 | i | AY)
         theta ~ 1 + (1 | i | AY)
   Data: loss (Number of observations: 55) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~AY (Number of levels: 10) 
                                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(ult_Intercept)                      764.50    299.69   328.61  1495.02 1.00     1763
sd(omega_Intercept)                      0.09      0.07     0.00     0.25 1.01      797
sd(theta_Intercept)                      4.40      2.97     0.27    11.66 1.00      624
cor(ult_Intercept,omega_Intercept)       0.26      0.45    -0.70     0.93 1.00     2865
cor(ult_Intercept,theta_Intercept)       0.03      0.43    -0.81     0.79 1.00     2435
cor(omega_Intercept,theta_Intercept)    -0.43      0.47    -0.98     0.69 1.01      868
                                     Tail_ESS
sd(ult_Intercept)                        2106
sd(omega_Intercept)                      1573
sd(theta_Intercept)                      1508
cor(ult_Intercept,omega_Intercept)       2225
cor(ult_Intercept,theta_Intercept)       2302
cor(omega_Intercept,theta_Intercept)     2420

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ult_Intercept    5243.66    316.97  4664.95  5900.01 1.00     1421     1893
omega_Intercept     1.36      0.06     1.23     1.49 1.00     1872     2175
theta_Intercept    45.66      2.88    40.51    51.99 1.00     1937     2142

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   133.32     16.27   105.18   169.87 1.00     1788     2504

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Centered parameterization

# Get stancode 
scode <- brms::stancode(fit_loss_ncp)
# Get data
sdata <- brms::standata(fit_loss_ncp)
# Get initials
sinit <- fit_loss_ncp$fit@inits

# Edit stancode 
scode <-
  "
// generated with brms 2.20.3
functions {
 /* compute correlated group-level effects
  * Args:
  *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
  *   matrix of scaled group-level effects
  */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K_ult;  // number of population-level effects
  matrix[N, K_ult] X_ult;  // population-level design matrix
  int<lower=1> K_omega;  // number of population-level effects
  matrix[N, K_omega] X_omega;  // population-level design matrix
  int<lower=1> K_theta;  // number of population-level effects
  matrix[N, K_theta] X_theta;  // population-level design matrix
  // covariates for non-linear functions
  array[N] int C_1;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_ult_1;
  vector[N] Z_1_omega_2;
  vector[N] Z_1_theta_3;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  vector[K_ult] b_ult;  // regression coefficients
  vector[K_omega] b_omega;  // regression coefficients
  vector[K_theta] b_theta;  // regression coefficients
  real<lower=0> sigma;  // dispersion parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
//matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
  matrix[N_1, M_1] r_1;  // actual group-level effects
}
transformed parameters {
 // Comment out the whole chunk from the transformed parameters
  /*
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_ult_1;
  vector[N_1] r_1_omega_2;
  vector[N_1] r_1_theta_3;
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_ult_1 = r_1[, 1];
  r_1_omega_2 = r_1[, 2];
  r_1_theta_3 = r_1[, 3];
  */
}
model {
  for(i in 1:N_1){
   r_1[i, ] ~ multi_normal_cholesky(rep_row_vector(0, M_1), diag_pre_multiply(sd_1, L_1));
  }
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_ult_1;
  vector[N_1] r_1_omega_2;
  vector[N_1] r_1_theta_3;
  // compute actual group-level effects
  //r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_ult_1 = r_1[, 1];
  r_1_omega_2 = r_1[, 2];
  r_1_theta_3 = r_1[, 3];
  real lprior = 0;  // prior contributions to the log posterior
  // likelihood not including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] nlp_ult = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_omega = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_theta = rep_vector(0.0, N);
    // initialize non-linear predictor term
    vector[N] mu;
    nlp_ult += X_ult * b_ult;
    nlp_omega += X_omega * b_omega;
    nlp_theta += X_theta * b_theta;
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_ult[n] += r_1_ult_1[J_1[n]] * Z_1_ult_1[n];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_omega[n] += r_1_omega_2[J_1[n]] * Z_1_omega_2[n];
    }
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_theta[n] += r_1_theta_3[J_1[n]] * Z_1_theta_3[n];
    }
    for (n in 1:N) {
      // compute non-linear predictor values
      mu[n] = (nlp_ult[n] * (1 - exp( - (C_1[n] / nlp_theta[n]) ^ nlp_omega[n])));
    }
    target += normal_lupdf(Y | mu, sigma);
  }
  // priors not including constants
  lprior += normal_lupdf(b_ult | 5000, 1000);
  lprior += normal_lupdf(b_omega | 1, 2);
  lprior += normal_lupdf(b_theta | 45, 10);
  lprior += student_t_lupdf(sigma | 3, 0, 1963.7);
  lprior += student_t_lupdf(sd_1 | 3, 0, 1963.7);
  lprior += lkj_corr_cholesky_lupdf(L_1 | 1);
  target += lprior;
//  target += std_normal_lupdf(to_vector(z_1));
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}


"


# Fit CP model using rstan

rfit_loss_cp <- rstan::stan(model_code = scode, data = sdata, init = sinit, 
                            control = list(adapt_delta = 0.98, max_treedepth = 12), 
                            seed = 123,
                            cores = 8)

# Posterior summary 
options(scipen=999)
fit_summary <- summary(rfit_loss_cp)
fit_summary_beta_sd <- fit_summary$summary[1:7, ]
print(round(fit_summary_beta_sd, 3))


# Or, for a better comparison with the fit_loss_ncp,
# Create an empty model and populate results from the above rfit_loss_cp fit

fit_loss_cp <- brm(
  bf(cum ~ ult * (1 - exp(-(dev/theta)^omega)),
     ult ~   1 + (1|i|AY), 
     omega ~ 1 + (1|i|AY), 
     theta ~ 1 + (1|i|AY), 
     nl = TRUE),
  data = loss, family = gaussian(),
  prior = c(
    prior(normal(5000, 1000), nlpar = "ult"),
    prior(normal(1, 2), nlpar = "omega"),
    prior(normal(45, 10), nlpar = "theta")
  ),
  control = list(adapt_delta = 0.98, max_treedepth = 12), 
  init = 'random',
  normalize = FALSE, # Switch off normalizing constant for speed
  empty = TRUE
)

fit_loss_cp$fit <- rfit_loss_cp
fit_loss_cp     <- brms::rename_pars(fit_loss_cp)

summary(fit_loss_cp)

Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: cum ~ ult * (1 - exp(-(dev/theta)^omega)) 
         ult ~ 1 + (1 | i | AY)
         omega ~ 1 + (1 | i | AY)
         theta ~ 1 + (1 | i | AY)
   Data: loss (Number of observations: 55) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~AY (Number of levels: 10) 
                                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(ult_Intercept)                      751.57    295.21   321.48  1456.84 1.00     1227
sd(omega_Intercept)                      0.09      0.07     0.01     0.25 1.02      294
sd(theta_Intercept)                      4.44      2.85     0.48    11.15 1.01      274
cor(ult_Intercept,omega_Intercept)       0.27      0.44    -0.69     0.92 1.00     1455
cor(ult_Intercept,theta_Intercept)       0.01      0.43    -0.83     0.77 1.00     1171
cor(omega_Intercept,theta_Intercept)    -0.44      0.47    -0.97     0.67 1.01      531
                                     Tail_ESS
sd(ult_Intercept)                        1781
sd(omega_Intercept)                       203
sd(theta_Intercept)                       316
cor(ult_Intercept,omega_Intercept)       2143
cor(ult_Intercept,theta_Intercept)       1694
cor(omega_Intercept,theta_Intercept)     1549

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
ult_Intercept    5238.25    312.07  4699.74  5894.41 1.00     1333     1451
omega_Intercept     1.36      0.06     1.24     1.49 1.00     1330     1622
theta_Intercept    45.58      2.85    40.44    51.65 1.01     1514     1717

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   133.34     16.00   105.97   167.90 1.00     2163     2539

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
There were 27 divergent transitions after warmup. Increasing adapt_delta above 0.98 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

Thank you

1 Like

Right, this affects the global intercept by changing whether the X matrix (“fixed effects” design matrix) is centered or not, but has no affect on the parameterization of “random effects” terms. Although it would be nice to have a quick switch between the two!

2 Likes