Creating a hierarchical model (multilevel)

Hello,

I am a new Stan’s user and I would like to ask for helping to create a multilevel model.

My background is based on the Andrew Gelman books and papers about this subject. I also explored the Youtube videos about the theme such as Jonah Gabry & Lauren Kennedy Pestcontrol example.


The dataset that I am using is Males from the Ecdat package.

The help file description of the 12 variables in the dataset is as follows:

  • nr subject identifier
  • year year (1980-1987)
  • school years of schooling
  • exper year of experience (=age-6- school )
  • union factor; whether wage was set by collective bargaining
  • ethn factor with 3 levels
  • maried factor marital status
  • health factor for presence of health problems
  • wage log of hourly wage
  • industry factor with 12 levels
  • occupation factor with 9 levels
  • residence factor with 4 levels

The questions that I would like to answer are:

1 - I would like to build a model for studying the effects of covariates on wage with rstan package, taking into account the hierarchical structure of the data. In particular the relation between unionism and wage.
2 - Implement the effect of time in the Model.
3 - How could I check the model fit by using the proper tools of posterior predictive checking?


The simplest model that I could run was:


data {
  int<lower=0> N; // number of observations
  vector[N] x; // union
  vector[N] y; // wage
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model {
  y ~ normal(alpha + beta * x, sigma);
}

2 Likes

Hello,

I created a partial pooling code in Stan. Could someone tell me whether I made some mistake ?

wage = \alpha_{j[i]} + \beta * Union_i + \epsilon_i
\epsilon \sim N(0,\sigma_y^2)
\alpha_{j,[i]} \sim N(\mu_\alpha , \sigma_\alpha^2)

The code follows belllow:


data {
  int<lower=0> N; // number of obs
  int<lower=0> J; // Number of years
  vector[N] y; // outcome data (log wage)
  int<lower=0,upper=1> x[N]; // Input data (union)
  int year[N]; //year ID variable
}
parameters {
  real a[J]; // year j intercept
  real b;                           
  real mu_a; // prior on alpha
  real<lower=0> sigma_y; // prior sigma of y in year j
  real<lower=0> sigma_a; // hyperparameter sigma of alpha
}
model {
  a ~ normal(mu_a, sigma_a);            
  for (n in 1:N)
    y[n] ~ normal(a[year[n]] + b * x[n], sigma_y);
}

generated quantities {
 vector[N] y_hat;
  for (i in 1:N)
    y_hat[i] = normal_rng(a[year[i]] + b * x[i], sigma_y);
}



Input data:

List of 5

  • N : int 4360
  • J : int 8 (years from 1980 to 1987)
  • y : num [1:4360] 1.2 1.85 1.34 1.43 1.57 … (log wage values)
  • x : num [1:4360] 0 1 0 0 0 0 0 0 0 0 … ( categorical variable 0 - No union; 1 - Yes Union)
  • year: num [1:4360] 1980 1981 1982 1983 1984 …

Results:

    mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat

a[1] 1.35 0 0.02 1.31 1.34 1.35 1.37 1.39 2726 1
a[2] 1.47 0 0.02 1.43 1.45 1.47 1.48 1.51 3243 1
a[3] 1.53 0 0.02 1.48 1.51 1.53 1.54 1.57 2665 1
a[4] 1.57 0 0.02 1.53 1.56 1.57 1.59 1.62 3843 1
a[5] 1.64 0 0.02 1.60 1.63 1.64 1.66 1.69 3546 1
a[6] 1.70 0 0.02 1.66 1.68 1.70 1.71 1.74 3069 1
a[7] 1.76 0 0.02 1.72 1.74 1.76 1.77 1.80 2340 1
a[8] 1.81 0 0.02 1.77 1.80 1.81 1.83 1.86 3033 1
b 0.18 0 0.02 0.15 0.17 0.18 0.20 0.22 3052 1
sigma_y 0.51 0 0.01 0.50 0.50 0.51 0.51 0.52 4094 1


Comparing with the brms, it is the same result for the formula = wage ~ (1|year) + union

coef(bmod4)
$year
, , Intercept

 Estimate  Est.Error     Q2.5    Q97.5

1980 1.351247 0.02202636 1.308087 1.393600
1981 1.468648 0.02156949 1.426353 1.510592
1982 1.525916 0.02143721 1.482440 1.567273
1983 1.574645 0.02196081 1.531361 1.617394
1984 1.643587 0.02134702 1.601799 1.686262
1985 1.696632 0.02230993 1.653166 1.740112
1986 1.758255 0.02235138 1.714098 1.802303
1987 1.814969 0.02215333 1.771203 1.857841

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.60 0.07 1.46 1.75 1.00 923 1379
unionyes 0.18 0.02 0.15 0.22 1.00 2620 2082

I think at least one point you have to make more clear:

wage = \alpha_{j[i]} + \beta * Union_i + \epsilon_i\\ \epsilon \sim N(0,\sigma_y^2)

You write \epsilon_i and \epsilon. Is your measurement variance heterogeneous, eg.
your variance is different each year or same for all years?

I am sorry, it is only \epsilon, without i

Hello,

I would like to create different models using this data set:

1 - Pooling
2 - no pooling
3 - partial pooling.

From the model that I have already described, implement with multiple varying intercepts and slopes.

A question that I have about the groups when applying hierarchical models in Stan:
From the examples that I have seen, the groups might be states or other geographic divisions, however is it possible to take into account the years as groups ?