# Hierarchical Model for Causal Inference

Hello, I am developing a Hierarchical Model for Causal Inference in order to reproduce a famous paper in the Economics literature. I vectorized my model but I do not understand why all iterations either diverge or exceed the max tree depth. I was wondering if anyone could tell me what I am doing wrong.
This is the model I want to run:

\begin{align*} \begin{cases} & y_{n,j} \ | \ \beta_j, \ (\sigma_j)_{j=1}^J \ , \ \rho \ \sim \ \mathcal{N} \left( x_n \beta_j + w_n\theta \ , \ w_n\sigma_t + (1-w_n)\sigma_c \right) \ \ \ n = 1,...,N \ ; \ j=1,...,J \\ & \beta_j\ | \ \gamma_l \ (\sigma_j)_{j=1}^J \ , \ \rho \ \sim \ \mathcal{N} \left( (u \cdot \gamma)_{j} \ , \ \Sigma \right) \ \ \ j=1,...,J \ ; \ l= 1,...L \\ & \sigma_j \ \sim \ \mathcal{I}nv-\mathcal{G}amma(1, 0.01) \ \ \ j=1,...,J,c,t \\ & \gamma_l \ \sim \ \mathcal{N}(0,5) \\ & \rho \ \sim \ \mathcal{U}(-1,1) \\ & \theta \ \sim \ \mathcal{N}(0,100) \end{cases} \end{align*}

And here is how I coded it on Stan:

data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
int<lower=1,upper=J> g[N];  // group for individual
matrix[N, K] x;             // individual predictors
matrix[J,L] u;                 // group predictors
vector[N] y;                 // outcomes

vector[N] w;                      // treatment assigned
}
parameters {
vector[K] beta[J];            // vector of school-specific coefficients
matrix[L,K] gamma;              // school coefficients
real<lower=-1,upper=+1> rho;    // school coefficients correlation
real<lower=0> sigma2[K];

real effect;                      // super-population average treatment effect
real<lower=0> sigma2_t;            // residual SD for the treated
real<lower=0> sigma2_c;            // residual SD for the control

}

transformed parameters{

matrix[J,K] u_gamma;

matrix[K,K] Sigma;

u_gamma = u * gamma;

for (s in 1:K) {
for (z in 1:K){
if (s==z)
Sigma[s,z] = sigma2[s]^1/2 * sigma2[z]^1/2;
else
Sigma[s,z] = sigma2[s]^1/2 * sigma2[z]^1/2 * rho;
}
}

}
model {

vector[N] m;
vector[N] d;
vector[N] uno;
vector[K] a[J];

// Here we assign priors to the variances and the correlation coefficient
// of the beta_j's

sigma2 ~ inv_gamma(1, 0.01);

rho ~ uniform(-1,1);

// Here we create the vector Beta (JK x 1) by stacking in a column vector all
// the vectors beta_j (K x 1) and its expected value

for (j in 1:J){
a[j] = row(u_gamma,j)';
}

// Finally we assign the prior to Beta

beta ~ multi_normal(a, Sigma);

// Here we model the likelihood of our outcome variable y

effect ~ normal(0, 100);
sigma2_c ~ inv_gamma(1, 0.01);
sigma2_t ~ inv_gamma(1, 0.01);
to_vector(gamma)~ normal(0, 5);

for (n in 1:N) {
m[n] = dot_product(beta[g[n]] , x[n]) + effect*w[n];
}

uno = rep_vector(1, N);
d = sigma2_t*w + sigma2_c*(uno-w);

y ~ normal(m, d);

}


Thanks in advance for all the help.

This reflects a “centered” parameterization for beta, which is most useful when the data are highly informative but leads to divergences when the data are less informative (see here for in-depth description). So I’d try a non-centered parameterization.

Thank you, this article is amazing. And yes, you were right.