I’m trying to learn Stan and generated a simple synthetic data set. The data set is generated with the following R code:
library(dplyr)
library(rstan)
set.seed(12445)
n_cats <- 10
n <- 1000
alpha_mu <- 10
alpha_tau <- 2
sigma <- 0.2
df <- data.frame(cat = sample(1:n_cats, n, replace = TRUE))
alpha <- data.frame(cat = 1:n_cats,
alpha = rnorm(n_cats, alpha_mu, alpha_tau))
df1 <- left_join(df, alpha, by = "cat") %>%
mutate(y = alpha + rnorm(n, 0, sigma)) %>%
tbl_df()
data <- list(n_cats = n_cats,
n_samples = n,
cats = df$cat,
y = df1$y)
Essentially, I’m generating a mean (alpha) for every category from a reference distribution. The y values are generated by sampling from a distribution with that mean and a constant variance. In other words, this is the data generating process:
y ~ N(alpha_i, sigma)
alpha_i ~ N(alpha_mu, alpha_tau)
Fitting this model in lmer, recovers the correct parameters.
library(lme4)
lmer(y ~ (1 | cat), df1) %>% coef()
Output from this is:
(Intercept)
1 9.504471
2 10.239418
3 13.213756
4 6.875253
5 13.899031
6 7.953558
7 9.806968
8 8.089552
9 10.960454
10 8.809423
This is very close to the correct parameters.
I then tried to fit this using stan. The stan program is shown below:
data {
int<lower=1> n_cats;
int<lower=0> n_samples;
int<lower=1> cats[n_samples];
real y[n_samples];
}
parameters {
real<lower=9, upper=11> alpha_mu;
real<lower=1, upper=3> alpha_tau;
real alpha[n_cats];
real<lower=0, upper=1.0> sigma;
}
model {
for (i in 1:n_cats) {
alpha[i] ~ normal(alpha_mu, alpha_tau);
}
for (i in 1:n_samples) {
y ~ normal(alpha[cats[i]], sigma);
}
}
But when I look at the coefficients, all the alpha_i are equal to the grand mean.
mean se_mean
alpha_mu 9.95 0.01
alpha_tau 1.14 0.00
alpha[1] 9.96 0.00
alpha[2] 9.96 0.00
alpha[3] 9.96 0.00
alpha[4] 9.96 0.00
alpha[5] 9.96 0.00
alpha[6] 9.96 0.00
alpha[7] 9.96 0.00
alpha[8] 9.96 0.00
alpha[9] 9.96 0.00
alpha[10] 9.96 0.00
sigma 1.00 0.00
lp__ -2368724.67 0.10
What am I doing wrong here?