I am a new Stan’s user and I would like to ask for helping to create a multilevel model.
My background is based on the Andrew Gelman books and papers about this subject. I also explored the Youtube videos about the theme such as Jonah Gabry & Lauren Kennedy Pestcontrol example.
The dataset that I am using is Males from the Ecdat package.
The help file description of the 12 variables in the dataset is as follows:
nr subject identifier
year year (1980-1987)
school years of schooling
exper year of experience (=age-6- school )
union factor; whether wage was set by collective bargaining
ethn factor with 3 levels
maried factor marital status
health factor for presence of health problems
wage log of hourly wage
industry factor with 12 levels
occupation factor with 9 levels
residence factor with 4 levels
The questions that I would like to answer are:
1 - I would like to build a model for studying the effects of covariates on wage with rstan package, taking into account the hierarchical structure of the data. In particular the relation between unionism and wage.
2 - Implement the effect of time in the Model.
3 - How could I check the model fit by using the proper tools of posterior predictive checking?
The simplest model that I could run was:
data {
int<lower=0> N; // number of observations
vector[N] x; // union
vector[N] y; // wage
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
data {
int<lower=0> N; // number of obs
int<lower=0> J; // Number of years
vector[N] y; // outcome data (log wage)
int<lower=0,upper=1> x[N]; // Input data (union)
int year[N]; //year ID variable
}
parameters {
real a[J]; // year j intercept
real b;
real mu_a; // prior on alpha
real<lower=0> sigma_y; // prior sigma of y in year j
real<lower=0> sigma_a; // hyperparameter sigma of alpha
}
model {
a ~ normal(mu_a, sigma_a);
for (n in 1:N)
y[n] ~ normal(a[year[n]] + b * x[n], sigma_y);
}
generated quantities {
vector[N] y_hat;
for (i in 1:N)
y_hat[i] = normal_rng(a[year[i]] + b * x[i], sigma_y);
}
Input data:
List of 5
N : int 4360
J : int 8 (years from 1980 to 1987)
y : num [1:4360] 1.2 1.85 1.34 1.43 1.57 … (log wage values)
x : num [1:4360] 0 1 0 0 0 0 0 0 0 0 … ( categorical variable 0 - No union; 1 - Yes Union)
I would like to create different models using this data set:
1 - Pooling
2 - no pooling
3 - partial pooling.
From the model that I have already described, implement with multiple varying intercepts and slopes.
A question that I have about the groups when applying hierarchical models in Stan:
From the examples that I have seen, the groups might be states or other geographic divisions, however is it possible to take into account the years as groups ?