Mistake in the specificaiton of a simple linear hierarchical model


#1

Dear Stan community,

I have a simple varying-intercepts model that I usually fit in R like:

glmer(y ~ x1 + x2 + x3 + x4 + x5 + (1 | groups), data = database, family=gaussian)

It is my understanding that a simple translation of that to Stan would be like:

data{
  int<lower=1> N;
  int<lower=1> J;
  real y[N];
  real x1[N];
  int<lower=0,upper=1> x2[N];
  real x3[N];
  real x4[N];
  int<lower=0,upper=1> x5[N];
  int<lower=1> groups[N];
}

parameters{
  real fixed_a;
  real b_x1;
  real b_x2;
  real b_x3;
  real b_x4;
  real b_x5;
  vector[J] varying_a;
  real<lower=0> sigma;
}

model{
  vector[N] mu;

  sigma ~ cauchy( 0 , 2 );

  fixed_a ~ normal( 0 , 10 );
  varying_a ~ normal( 0 , 10 );
  b_x1 ~ normal( 0 , 10 );
  b_x2 ~ normal( 0 , 10 );
  b_x3 ~ normal( 0 , 10 );
  b_x4 ~ normal( 0 , 10 );
  b_x5 ~ normal( 0 , 10 );

  for ( i in 1:N ) {
    mu[i] = fixed_a + varying_a[groups[i]] + b_x1 * x1[i] + b_x2 * x2[i] + b_x3 * x3[i] + b_x4 * x4[i] + b_x5 * x5[i];
  }

  y ~ normal(mu, sigma);
}

However, to my surprise, the results of “b_x4” and “b_x5” are very different from the “glmer” estimation:

> summary(glmer(y ~ x1 + x2 + x3 + x4 + x5 + (1 | groups), data = stan.db, family=gaussian))
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x1 + x2 + x3 + x4 + x5 + (1 | groups)
   Data: stan.db

REML criterion at convergence: -1158.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.4678 -0.3357  0.0847  0.4586  2.6398 

Random effects:
 Groups   Name        Variance Std.Dev.
 groups   (Intercept) 0.006730 0.08204 
 Residual             0.004145 0.06438 
Number of obs: 513, groups:  groups, 72

Fixed effects:
             Estimate Std. Error t value
(Intercept)  2.548721   0.033433   76.23
x1          -0.016822   0.003816   -4.41
x2           0.042977   0.016200    2.65
x3          -0.006540   0.003601   -1.82
x4          -0.115440   0.032494   -3.55
x5          -0.015309   0.027933   -0.55

And:

results <- extract(model.stanfit)
> mean(results$b_x4)
[1] 0.1604067
> mean(results$b_x5)
[1] -0.5242973

Am I doing something wrong when specifying the Stan model? I send attached the data, a stan model file and a R script to reproduce all the above.

varying-intercepts.stan (792 Bytes)
script.r (1.6 KB)
data.csv (43.5 KB)


#2

If you want to stay closer to the glmer specification you probably want

varying_a ~ normal(0, sigma_a);
sigma_a ~ cauchy(0, 2); // or whatever you see appropriate. 

The brms and rstanarm packages let you specify these kind of models in Stan with mer style formulas. This could make your life a bit easier.