Varying Variances


#1

Hi all,

I’m attempting to fit a model that is similar to the one described here: Separate variances for different levels of a fixed effect

The data comes from a series of studies conducted at 5 different school sites. Individuals (n=675) at each site contribute multiple grade observations. The main experimental variables are treatment condition (2 levels) and ethnicity (2 levels). In rstanarm, the model looks like this:

y~Condition*Group + cov + (1|ID) + (Condition*Group + cov | site)

Where cov is a pre-study covariate measurement.

The relatively simple version of what I would like is to have the varying intercepts subject to different variance distributions.

 \alpha_i \sim N(0, \sigma_j)

(as an aside, is it possible to have math rendered here?)

Where there’s one sigma_j for each combination of Condition and group. If I can get this working and providing sensible answers, I’d ideally like to extend it to each combination of condition, group, and site.

As a first pass, I fit three models:

Model 1: as specified above, in rstanarm. That is, random intercepts for person, with everything else allowed to vary by site. There are no varying variances.
Model 2: Attempt to introduce varying variances for each combination of condition and group
Model 3: Attempt to introduce varying variances for each combination of condition, group, and site.

Below is a plot of the posterior means and 95% uncertainty intervals:

As you can see, something crazy is going on with models 2 (varying_variances1) and 3 (varying_variances2).

My stan code for the latter two models is here:

// generated with brms 1.10.2

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data for group-level effects of ID 1 
  int<lower=1> J_1[N]; // person indicator
  int<lower=1> N_1; // number of people = 675
  int<lower=1> M_1;  // how many reff intercept groups? 4 (for just 4 cells) or 20 (for 4 cells * 5 sites)
  int<lower=1> j_1[N_1]; // indicator for reff intercept groups
  vector[N] Z_1_1; 
  // data for group-level effects of ID 2 
  int<lower=1> J_2[N]; 
  int<lower=1> N_2; 
  int<lower=1> M_2; 
  vector[N] Z_2_1; 
  vector[N] Z_2_2; 
  vector[N] Z_2_3; 
  vector[N] Z_2_4; 
  vector[N] Z_2_5; 
  int<lower=1> NC_2; 
  int prior_only;  // should the likelihood be ignored? 
} 

transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 

parameters { 
  vector[Kc] b;  // population-level effects 
  real temp_Intercept;  // temporary intercept 
  real<lower=0> sigma;  // residual SD 
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
  vector[N_1] z_1;  // unscaled group-level effects 
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations 
  matrix[M_2, N_2] z_2;  // unscaled group-level effects 
  // cholesky factor of correlation matrix 
  cholesky_factor_corr[M_2] L_2; 
} 

transformed parameters { 
  vector[N_1] r_1_1;
  // group-level effects 
  matrix[N_2, M_2] r_2 = (diag_pre_multiply(sd_2, L_2) * z_2)'; //'
  vector[N_2] r_2_1 = r_2[, 1]; 
  vector[N_2] r_2_2 = r_2[, 2]; 
  vector[N_2] r_2_3 = r_2[, 3]; 
  vector[N_2] r_2_4 = r_2[, 4]; 
  vector[N_2] r_2_5 = r_2[, 5]; 
  // group-level effects 
  for (n in 1:N_1) {
    r_1_1[n] = sd_1[j_1[n]] * (z_1[n]); 
  }
} 

model { 
  vector[N] mu = Xc * b + temp_Intercept; 
  for (n in 1:N) { 
    mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_2_1[J_2[n]]) * Z_2_1[n] + (r_2_2[J_2[n]]) * Z_2_2[n] + (r_2_3[J_2[n]]) * Z_2_3[n] + (r_2_4[J_2[n]]) * Z_2_4[n] + (r_2_5[J_2[n]]) * Z_2_5[n]; 
  } 
  // priors including all constants 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += normal_lpdf(z_1 | 0, 1); 
  target += student_t_lpdf(sd_2 | 3, 0, 10)
    - 5 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_2 | 1); 
  target += normal_lpdf(to_vector(z_2) | 0, 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma); 
  } 
} 

} 

I generated a base model identical to model 1 using brms, and then adjusted it slightly to get the model I want. I know the usual advice is to start small and build up, but this seemed like a small enough tweak that I could get away with cheating a bit.

What am I missing? Why am I not getting sensible results?


#2

Is that the Stan code for model 2 or 1?


#3

For model 2. The two differences between model 1 and model 2 are 1) I added the j_1 indicator in the data block, and 2) the definitions of the r_1_1s looks like this:

  for (n in 1:N_1) {
    r_1_1[n] = sd_1[j_1[n]] * (z_1[n]); 
  }

rather than this:

  vector[N_1] r_1_1 = sd_1[1] * (z_1[1]); 

below is the original brms model (equivalent to model 1)

// generated with brms 1.10.2
functions { 
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data for group-level effects of ID 1 
  int<lower=1> J_1[N]; 
  int<lower=1> N_1; 
  int<lower=1> M_1; 
  vector[N] Z_1_1; 
  // data for group-level effects of ID 2 
  int<lower=1> J_2[N]; 
  int<lower=1> N_2; 
  int<lower=1> M_2; 
  vector[N] Z_2_1; 
  vector[N] Z_2_2; 
  vector[N] Z_2_3; 
  vector[N] Z_2_4; 
  vector[N] Z_2_5; 
  int<lower=1> NC_2; 
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 
parameters { 
  vector[Kc] b;  // population-level effects 
  real temp_Intercept;  // temporary intercept 
  real<lower=0> sigma;  // residual SD 
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations 
  vector[N_1] z_1[M_1];  // unscaled group-level effects 
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations 
  matrix[M_2, N_2] z_2;  // unscaled group-level effects 
  // cholesky factor of correlation matrix 
  cholesky_factor_corr[M_2] L_2; 
} 
transformed parameters { 
  // group-level effects 
  vector[N_1] r_1_1 = sd_1[1] * (z_1[1]); 
  // group-level effects 
  matrix[N_2, M_2] r_2 = (diag_pre_multiply(sd_2, L_2) * z_2)'; 
  vector[N_2] r_2_1 = r_2[, 1]; 
  vector[N_2] r_2_2 = r_2[, 2]; 
  vector[N_2] r_2_3 = r_2[, 3]; 
  vector[N_2] r_2_4 = r_2[, 4]; 
  vector[N_2] r_2_5 = r_2[, 5]; 
} 
model { 
  vector[N] mu = Xc * b + temp_Intercept; 
  for (n in 1:N) { 
    mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_2_1[J_2[n]]) * Z_2_1[n] + (r_2_2[J_2[n]]) * Z_2_2[n] + (r_2_3[J_2[n]]) * Z_2_3[n] + (r_2_4[J_2[n]]) * Z_2_4[n] + (r_2_5[J_2[n]]) * Z_2_5[n]; 
  } 
  // priors including all constants 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += normal_lpdf(z_1[1] | 0, 1); 
  target += student_t_lpdf(sd_2 | 3, 0, 10)
    - 5 * student_t_lccdf(0 | 3, 0, 10); 
  target += lkj_corr_cholesky_lpdf(L_2 | 1); 
  target += normal_lpdf(to_vector(z_2) | 0, 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma); 
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_Intercept = temp_Intercept - dot_product(means_X, b); 
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2); 
  vector<lower=-1,upper=1>[NC_2] cor_2; 
  // take only relevant parts of correlation matrix 
  cor_2[1] = Cor_2[1,2]; 
  cor_2[2] = Cor_2[1,3]; 
  cor_2[3] = Cor_2[2,3]; 
  cor_2[4] = Cor_2[1,4]; 
  cor_2[5] = Cor_2[2,4]; 
  cor_2[6] = Cor_2[3,4]; 
  cor_2[7] = Cor_2[1,5]; 
  cor_2[8] = Cor_2[2,5]; 
  cor_2[9] = Cor_2[3,5]; 
  cor_2[10] = Cor_2[4,5]; 
} 

#4

I’ve narrowed down the problem here. I think there’s something different between the stan model and the brms model, even without my adjustments. For instance:

p <- c(set_prior('normal(2,2)', class='Intercept', coef=''),
       set_prior('normal(0,1)', class='b'))

m.brm <- brm(gpa ~ Condition*Ethnicity + (1|ID) + 
                     (Condition*Ethnicity|site), data=testing_data, prior=p)
m.rstanarm <- stan_glmer(gpa ~ Condition*Ethnicity + (1|ID) + 
                           (Condition*Ethnicity|site), 
                         prior_intercept=normal(2,2), prior=normal(0,1), 
                         data=testing_data)

These models look identical to me, but result in wildly different precision of the non-varying estimates:

image

I started looking at some of the individual parameters to see if there were any that were totally different between the two models. It looks like the covariance parameters governing the site-level variability is a prime suspect:

image

What am I doing incorrectly here? My guess would be the LKJ prior, but both packages say that zeta is set to 1 by default, and some mild adjustments to the brms model have not changed the patterns here.


#5

I’m not sure how much @paul.buerkner (author of brms) monitors this list, so you may want to take it up on the brms list:

https://groups.google.com/forum/#!forum/brms-users