Question: Stan Modelspecification for Hierarchical Linear Model (Gelman's Radon example)

Hello there.

Disclaimer: I am using the package rstan together with Stan.

I am currently working on the radon dataset of Gelman to build HLMs but I am new to Stan and am a bit confused about the syntax. I would be happy if somebody could help me out in my mistakes/if the intuition of my code is right.

The dataset contains household data of radon and an indication of where the measurement was taken (1 = floor, 0 = basement).
Furthermore, there are uranium measurements but only 1 per county (85 counties).

Let’s call the variables:
Y = radon
X = floor
U = uranium
N = number of observations (919)
J = number of counties (85)

I have three models:

  1. Unpooled: (Varying Intercepts, Fixed Slopes)

J different models based only on the observations of each county.
Y_j = alpha_j + beta_j * X

I want to put a prior on mu_alpha, sigma_alpha, mu_beta, sigma_beta and sigma_y.

For this, I have the following Stan script:

 data {
	int<lower=1> N; // number of obs
 	int<lower=1> J; // Number of counties
 	vector[N] Y; // outcome data (log radon)
 	vector[N] X; // Inputa data (floor)
 	int<lower=1> county[N]; //county indices 
 }
 
 parameters {
 	vector[J] alpha;  
   vector[J] beta; 
 	real<lower=0> sigma_y;
 }
 
 model {
   vector[N] mu_y;
     for (i in 1:N)
     mu_y[i] = beta[county[i]] * X[i] + alpha[county[i]];
     
     // Priors
   Y ~ normal(mu_y, sigma_y);
   alpha ~ normal(0,10);
   beta ~ normal(0,10);
   sigma_y ~ normal(0,10);
 }

But, I get the following error message:

Warning messages:
1: In readLines(file, warn = TRUE) :
  incomplete final line found on 'C:\Users\..\unpool_1.stan'

Q1: Is this message neglectable? + Is this the right specification?

  1. Hierarchical Linear Model with 1 Level (Varying Intercepts and Fixed Slopes):
    J different models but now with information sharing, i.e. modelling the J different alphas as coming from the same distribution.

Y_j = alpha_j + beta * X

This is the Stan script I have for this model:

// HLM 1 
 
 data {
 	int<lower=1> N; // number of obs
 	int<lower=1> J; // Number of counties
 	vector[N] Y; // outcome data (log radon)
 	vector[N] X; // Inputa data (floor)
 	int<lower=1> county[N]; //county ID variable
 }
 
 parameters {
 	vector[J] alpha; // county j intercept 
   real beta; 
 	real mu_alpha; // prior on alpha
 	real <lower=0, upper=100> sigma_alpha; // hyperparameter sigma of alpha
 	real<lower=0, upper=100> sigma_y;// prior sigma of y in county j
 }
 
 model { // Model specifications
   vector[N] mu_y;
   
   // Full Model
 	for(i in 1:N)
 	mu_y[i] = alpha[county[i]] + beta * X[i]; 
 	
 // Priors and Hyperparams
   Y ~ normal(mu_y, sigma_y);
   sigma_y ~ cauchy (0, 2.5);
   alpha ~ normal(mu_alpha, sigma_alpha);
   mu_alpha ~ normal(0, 10);
 	sigma_alpha ~ uniform(0, 10);
 	
   beta ~ normal(0, 10);
 }

Here, I get this error message:

Warning messages:
 1: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
 http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
 2: Examine the pairs() plot to diagnose sampling problems

Q2: How can I fix this error + is the Stan specification right?

  1. Hierarchical Linear Model with 2 Levels:

Well, this is actually the one which is bothering me the most.

Y_j = alpha_j + beta * X
alpha _j = gamma_0 + gamma_1 * U

I have no clue how to specificy the second layer of the hierarchical model.

Q3: How do I specify the model in Stan + is it better to have 85 obs for Uranium or is it better when each of the 919 observation gets its own uranium observation (however would be duplicates, since there is only 1 measurement per county).

Here’s the minimal code from R:

 ## Load Data
 
 data("radon", package = "rstanarm")
 data <- radon
 
 
 # County Categorical Variable
 
 county_name <- as.vector(data$county) # create vector of county indications for each obs
 county_list <- unique(county_name) # create list of counties, drop duplicates
 J <- length(county_list) # number counties in Minessota
 
 county <- rep (NA, J) 
 for (i in 1:J){
   county[county_name==county_list[i]] <- i
 }
 
 # Seperate variables
 logradon <- data$log_radon
 loguranium <- data$log_uranium
 loguranium <- loguranium[!duplicated(loguranium)]
 floorind <- data$floor
 N <- length(logradon)
 
 # Lists for STAN
 cplist <- list(Y = logradon,
                X = floorind,
                N = N)
 
 unplist <- list(Y = logradon,
                  X = floorind,
                  N = N,
                  J = J,
                  county = county)
 
 datalist <- list(Y = logradon,
                  X = floorind,
                  U = loguranium,
                  N = N,
                  J = J,
                  county = county)

Exclaimer:
I actually feel bad for posting such an extensive questionnary, since usually people don’t answer to that but I am really hopeless. I tried it now for 2 weeks and I don’t know whether my results are correct or not and how to specify the 2-level HLM.
I don’t know anybody how uses Stan and most tutorials in the internet use Python.
(There are R tutorials but with different packages, where they just use default priors which is not what I want to do. I want to properly specify the full model to also perform sensitivity analysis later on regarding the priors.)
I would really appreciate any kind of help. Thanks in advance.

Here is the original paper for further reference.

Shouldn’t this be “Varying intercepts, varying slopes”? Or maybe the “varying” terminology isn’t specific enough (since you can have fixed/fully-pooled, varying/partially-pooled, and varying/unpooled)?

I believe that message simply means that Stan wants the model file to have a final empty newline at the end. But I wasn’t able to replicate this message (Linux, rstan 2.19.2), so try adding an empty newline on your end and report back how it goes.

Your model estimates an intercept for the radon levels independently for each county as well as an effect of sampling location independently for each county. This is a completely unpooled model, so if that’s what you want, you specified it correctly.

Specification: this model estimates a single effect of measurement location (i.e. “fully pooled”; exact same effect magnitude in each county) and separate-but-not-independent (i.e. “partial-pooled”) county-level intercepts.

How to fix the error is pretty clear from the error message and the link it provides! You want to increase the value of the max_treedepth argument to be higher than the default (so, greater than 10). Note however that with hierarchical models this error often means you should switch to the non-centered parameterization.

You want to pass in as data a vector of the 83 observed uranium values as well as an index vector of which uranium value goes with which radon values. Then model uranium however you want and then use it as a predictor by indexing into the uranium vector.

2 Likes

Yes, at it’s not an error, but a warning. If you want it to go away just put an extra line in the end of the file.

People like @stijn would be best suited to tell you about model specification, but you can do two things to improve this: (i) reduce the scale of the Cauchy prior on sigma_y or change it to a normal or something with lighter tails and (ii) increase max_treedepth to 15 or something to cope with the difficult posterior geometry.

@martinmodrak, any strong feelings?

1 Like

Thanks for your reply! Knowing that the first 2 models are correct, really pushed my confidence.

Here’s (obviously wrong) minimal code for the 2nd level HLM, do you know what I have to adjust?

The part where it says 2nd layer is wrong but I don’t know if I also have to change the part of the 1st layer…

R Code:

## Load Data

data("radon", package = "rstanarm")
data <- radon


# County Categorical Variable

county_name <- as.vector(data$county) # create vector of county indications for each obs
county_list <- unique(county_name) # create list of counties, drop duplicates
J <- length(county_list) # number counties in Minessota

county <- rep (NA, J) 
for (i in 1:J){
  county[county_name==county_list[i]] <- i
}


# Seperate variables
logradon <- data$log_radon
loguranium <- data$log_uranium
loguranium <- loguranium[!duplicated(loguranium)]
floorind <- data$floor
N <- length(logradon)
county_i <- county 
county_j <- 1:85


# Lists for STAN
datalist <- list(Y = logradon,
                 X = floorind,
                 U = loguranium,
                 N = N,
                 J = J,
                 county_i = county_i,
                 county_j = county_j)

## HLM 2-Level - Varying Intercepts, Fixed Coefficients
hlm2 <- stan(file = 'C://Users//...//hlm_2level.stan', data = datalist,
             iter = 5000, chains = 4)
print(hlm2)

Stan Code:

// HLM 2

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data Layer 1(room indication)
	vector[J] U; // Input data Layer 2 (uranium)
	int<lower=1> county_i[N]; //county ID variable for x
	int<lower=1> county_j[J]; //county ID variable for u
}

parameters {
	vector[J] alpha; // county j varying intercept 
  real beta; // fixed slope
	real mu_alpha; // prior on alpha
	real gamma_zero; // 2nd level fixed intercept 
	real gamma_one; // 2nd level fixed slope
	real <lower=0> sigma_alpha; // hyperparameter sigma of alpha
	real<lower=0> sigma_y;// prior sigma of y in county j
}

model { // Model specifications
  vector[N] mu_y;
  
   // 2nd Layer
	for(j in 1:J)
	alpha[county_i[i]] = gamma_zero + gamma_one * U[county_j];
  
  // 1st Layer
	for(i in 1:N)
	mu_y[i] = alpha[county_i[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 10);
  
  alpha ~ normal(mu_alpha, sigma_alpha);
  mu_alpha ~ normal(0, 10);
	sigma_alpha ~ cauchy(0, 10);
	
  beta ~ normal(0, 10);
  
  gamma_zero ~ normal(0, 10);
  gamma_one ~ normal(0, 10);
}

Perfect, thanks. I will try it with smaller scale paramaters in the priors.

The 1st layer is correct. The 2nd layer should be

for(j in 1:J)
  alpha[county_j[j]] = gamma_zero + gamma_one * U[county_j[j]];

(Although you have county_j[j] == j which allows additional simplification.)
alpha is defined, not sampled, so it cannot be declared in the parameters block. Move it to model block, right next to mu_y declaration. alpha also does not need a prior.

2 Likes

Thanks for your reply. It probably is close to the correct code, but there is still an error.
It says that alpha (left hand side) is real (scalar) but the right hand side is vector, so it’s not a “legal” statement. (I changed it according to your hint.)

Here’s what I have:

// HLM 2

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data Layer 1(room indication)
	vector[J] U; // Input data Layer 2 (uranium)
	int<lower=1> county_i[N]; //county ID variable for x
	int<lower=1> county_j[J]; //county ID variable for u
}

parameters {
  real beta; // fixed slope
	real gamma_zero; // 2nd level fixed intercept 
	real gamma_one; // 2nd level fixed slope
	real<lower=0> sigma_y;// prior sigma of y in county j
}

model { // Model specifications
  vector[N] mu_y;
  vector[J] alpha; // county j varying intercept 
  
   // 2nd Layer
	for(j in 1:J)
	alpha[county_j[J]] = gamma_zero + gamma_one * U[county_j];
  
  // 1st Layer
	for(i in 1:N)
	mu_y[i] = alpha[county_i[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 10);
	
  beta ~ normal(0, 10);
  
  gamma_zero ~ normal(0, 10);
  gamma_one ~ normal(0, 10);
}

That’s supposed to be U[county_j[j]] instead of U[county_j].

1 Like

I agree with @mike-lawrence, using the index variable is usually “nicer” (but hey, do what solves your problems).

Finally, I guess @razorlazor is doing this as an educational excersice to better understand Stan. Because if not, using rstanarm or brms would save you a lot of pain - they have formula syntax for hierarchical models, similar to the one used by lm :-)

Best of luck with your model!

2 Likes

Thanks it worked!

But the output shows the following:

 print(hlm2)
Inference for Stan model: hlm_2level.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

              mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta         -0.62    0.00 0.07   -0.75   -0.66   -0.62   -0.57   -0.49  8942    1
gamma_zero    1.47    0.00 0.03    1.41    1.45    1.47    1.49    1.52  8633    1
gamma_one     0.75    0.00 0.07    0.62    0.71    0.75    0.80    0.88  9623    1
sigma_y       0.74    0.00 0.02    0.71    0.73    0.74    0.75    0.78  9010    1
lp__       -185.23    0.02 1.44 -188.93 -185.91 -184.89 -184.18 -183.46  4355    1

Is there an easy method to compare the unpooled model, the HLM1 and the HLM2?
Not sure how to visualize the shrinkage effect…

You’re correct. I saw the easier versions but afaik it is terrible if you want to specify everything. Using the default priors it is really easy but if you want to go beyond that I think it’s equally cumbersome, so why not learn Stan right away especially if I can use it in future for other Bayesian projects.

1 Like

alpha and beta play the same role in pooled and unpooled model. You could compare pooled vs unpooled alpha for each county. These are the predicted mean radon levels in each county.

However, thinking more about this, the deterministic model of alpha as a linear function of uranium level only isn’t so reasonable. In the original paper you linked the hierarchical model in equation (1) is

\alpha_j \sim N\left(\gamma_0 + \gamma_1 u_j, \sigma_\alpha\right)

That means there’s additional noise and Stan code is

parameters {
  ...
  real<lower=0> sigma_alpha;
  vector[J] alpha;
}
model {
  ...
  for(j in 1:J)
    alpha[county_j[j]] ~ normal(gamma_zero + gamma_one * U[county_j[j]], sigma_alpha);
  ...
}

Btw, looking closer I noticed that you had upper case J in alpha[county_j[j]] but that would leave all alphas except the last one undefined.

1 Like

You are right! Thanks for the heads up.
Well, but now the model doesn’t converge…

// HLM 2

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data Layer 1(room indication)
	vector[J] U; // Input data Layer 2 (uranium)
	int<lower=1> county_i[N]; //county ID variable for x
	int<lower=1> county_j[J]; //county ID variable for u
}

parameters {
  real beta; // fixed slope
	real gamma_zero; // 2nd level fixed intercept 
	real gamma_one; // 2nd level fixed slope
	real<lower=0> sigma_alpha; // prior on noise of 2nd layer model
	real<lower=0> sigma_y;// prior on noise of 1st layer model
}

model { // Model specifications
  vector[N] mu_y;
  vector[J] alpha; // county j varying intercept 
  
   // 2nd Layer
	for(j in 1:J)
	alpha[county_j[j]] ~ normal(gamma_zero + gamma_one * U[county_j[j]], sigma_alpha);
  
  // 1st Layer
	for(i in 1:N)
	mu_y[i] = alpha[county_i[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 2.5);
  
  sigma_alpha ~ cauchy(0,2.5);
	
  beta ~ normal(0, 5);
  
  gamma_zero ~ normal(0, 5);
  gamma_one ~ normal(0, 5);
}

This is the error I get:

SAMPLING FOR MODEL 'hlm_2level' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: normal_lpdf: Random variable is nan, but must not be nan!  (in 'model3ae07c904b8c_hlm_2level' at line 27)

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: normal_lpdf: Random variable is nan, but must not be nan!  (in 'model3ae07c904b8c_hlm_2level' at line 27)

..

edit:/ Do I Have to move the alpha back to parameters now again?

edit:/ After moving it back it works but I get a warning message. Not sure if it’s “important” or not…

Warning messages:
1: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
2: Examine the pairs() plot to diagnose sampling problems

Also: How can I specify the Stan file so that I also calculate the log-likelihood of the models?
I need them to calculate the information criterion and train the model with LOO but it doesn’t work since I don’t provide them in the stanfit-objects.