Hierarchical Model for Causal Inference

Hello, I am developing a Hierarchical Model for Causal Inference in order to reproduce a famous paper in the Economics literature. I vectorized my model but I do not understand why all iterations either diverge or exceed the max tree depth. I was wondering if anyone could tell me what I am doing wrong.
This is the model I want to run:

\begin{align*} \begin{cases} & y_{n,j} \ | \ \beta_j, \ (\sigma_j)_{j=1}^J \ , \ \rho \ \sim \ \mathcal{N} \left( x_n \beta_j + w_n\theta \ , \ w_n\sigma_t + (1-w_n)\sigma_c \right) \ \ \ n = 1,...,N \ ; \ j=1,...,J \\ & \beta_j\ | \ \gamma_l \ (\sigma_j)_{j=1}^J \ , \ \rho \ \sim \ \mathcal{N} \left( (u \cdot \gamma)_{j} \ , \ \Sigma \right) \ \ \ j=1,...,J \ ; \ l= 1,...L \\ & \sigma_j \ \sim \ \mathcal{I}nv-\mathcal{G}amma(1, 0.01) \ \ \ j=1,...,J,c,t \\ & \gamma_l \ \sim \ \mathcal{N}(0,5) \\ & \rho \ \sim \ \mathcal{U}(-1,1) \\ & \theta \ \sim \ \mathcal{N}(0,100) \end{cases} \end{align*}


And here is how I coded it on Stan:

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> g[N];  // group for individual
  matrix[N, K] x;             // individual predictors
  matrix[J,L] u;                 // group predictors
  vector[N] y;                 // outcomes
  
  vector[N] w;                      // treatment assigned
}
parameters {
  vector[K] beta[J];            // vector of school-specific coefficients
  matrix[L,K] gamma;              // school coefficients
  real<lower=-1,upper=+1> rho;    // school coefficients correlation
  real<lower=0> sigma2[K];
  
  
  real effect;                      // super-population average treatment effect
  real<lower=0> sigma2_t;            // residual SD for the treated
  real<lower=0> sigma2_c;            // residual SD for the control
 
}

transformed parameters{
  
   matrix[J,K] u_gamma;
   
   matrix[K,K] Sigma;
   
   u_gamma = u * gamma;
  
  for (s in 1:K) {
    for (z in 1:K){
     if (s==z) 
        Sigma[s,z] = sigma2[s]^1/2 * sigma2[z]^1/2;
     else
        Sigma[s,z] = sigma2[s]^1/2 * sigma2[z]^1/2 * rho;
    }
  }

}
model {
  
  vector[N] m;
  vector[N] d;
  vector[N] uno;
  vector[K] a[J];
  
  
  // Here we assign priors to the variances and the correlation coefficient
  // of the beta_j's
  
  sigma2 ~ inv_gamma(1, 0.01);
  
  
  rho ~ uniform(-1,1); 
  
  // Here we create the vector Beta (JK x 1) by stacking in a column vector all
  // the vectors beta_j (K x 1) and its expected value
  

  
  for (j in 1:J){
  a[j] = row(u_gamma,j)';
  }
 
  // Finally we assign the prior to Beta
  
  beta ~ multi_normal(a, Sigma);
  
  
  
  // Here we model the likelihood of our outcome variable y
  
  effect ~ normal(0, 100);
  sigma2_c ~ inv_gamma(1, 0.01);          
  sigma2_t ~ inv_gamma(1, 0.01);
  to_vector(gamma)~ normal(0, 5);
  
  for (n in 1:N) {
    m[n] = dot_product(beta[g[n]] , x[n]) + effect*w[n];
  }
  
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  
  y ~ normal(m, d);

} 

Thanks in advance for all the help.

This reflects a “centered” parameterization for beta, which is most useful when the data are highly informative but leads to divergences when the data are less informative (see here for in-depth description). So I’d try a non-centered parameterization.

Thank you, this article is amazing. And yes, you were right.