Slow Hierarchical Model for Meta-analysis

Hi, I am running a 3-level hierarchical model for a meta-analysis on data from several RCTs (the whole dataset is roughly 10 million observations). Since the dataset is incredibly large, I really need to optimize the model as best as I can, because even if I run it on a server, it takes a prohibitive amount of time. Therefore, I wanted to ask you if you see anything that might be worth changing to optimize the code. I will first post a sketch of the theoretical model and then my code.

\begin{align} & Y_{j,k} \mid \alpha_j, \theta_j, \Omega_j, \alpha, \theta, \Sigma, \Pi \sim \mathcal{MN}(\alpha_j + \theta_j T_j , \Omega_j) \\[10pt] & \begin{pmatrix} \alpha_1 \\ ... \\ \alpha_J \\ \theta_1 \\ ... \\ \theta_J \end{pmatrix} \mid \alpha, \theta, \Sigma, \Pi \sim \mathcal{MN} \left(\begin{pmatrix} \alpha \\ \theta \end{pmatrix}, \Sigma \right) \\[10pt] & \begin{pmatrix} \alpha \\ \theta \end{pmatrix} \mid \Pi \sim \mathcal{MN} \left(\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Pi \right) \\ \end{align}

As you can see by the code, I made some simplifying assumptions such as modelling the likelihood as only heteroskedastic (with respect to treatment or control group) and intercepts and slopes are modelled independently to reduce the number of parameters. On the other hand, there is a covariance structure within slope parameters or intercepts. Moreover, I used Cholesky decomposition for var-covar matrices for slopes and intercepts and uncentered them with respect to the first hierarchical layers (e.g. \alpha and \theta) as I suspect that there is little pooling going on in the site-specific Average Treatment Effects.

data {
  int<lower=0> N;              // num observations
  int<lower=1> J;              // num RCTs
//  int<lower=0,upper=N> n[J];   // num observations in j-th RCT  
  int<lower=1,upper=J> g[N];   // group for individual
  vector[N] y;                 // outcome
  vector[N] w;                 // treatment assigned

// The parameters accepted by the model. Our model
// accepts two parameters 'mu' and 'sigma'.
parameters {
  real a;                     // common mean for the alpha's
  real t;                     // common mean for the theta's
  vector[J] z;                // vector to construct uncentered parameters
  cholesky_factor_corr[J] L_Omega_theta;
  vector<lower=0,upper=pi()/2>[J] tau_unif_theta;
  cholesky_factor_corr[J] L_Omega_alpha;
  vector<lower=0,upper=pi()/2>[J] tau_unif_alpha;

  real<lower=0,upper=pi()/2> sigma_t_unif;        // residual SD for the treated
  real<lower=0,upper=pi()/2> sigma_c_unif;        // residual SD for the control

transformed parameters{
  // Here we use the fact that if X is uniformly distributed on an appropriate
  // support (0, pi/2), then Y=tan(X) is distributed as a half-Cauchy random 
  // variable. We perform this transformation because is computationally more 
  // efficient for Stan to sample from a uniform than from a Cauchy with thick
  // tails.
  real<lower=0> sigma_t = tan(sigma_t_unif);
  real<lower=0> sigma_c = tan(sigma_c_unif);
  vector<lower=0>[J] tau_theta = tan(tau_unif_theta);
  vector<lower=0>[J] tau_alpha = tan(tau_unif_alpha);
  vector[J] alpha;            // matrix of RCT-specific intercepts
  vector[J] theta;            // vector of RCT-specific ATE's
  vector[J] one_J = rep_vector(1, J);
  theta = one_J*t + diag_pre_multiply(tau_theta,L_Omega_theta) * z;
  alpha = one_J*a + diag_pre_multiply(tau_alpha,L_Omega_alpha) * z;
  // notice that the likelihood has RCT-specific control mean (alpha) and ATE
  // (theta) 
  vector[N] m;
  for (i in 1:N) {
    m[i] = alpha[g[i]] + theta[g[i]]*w[i];

  // Here we construct the heteroskedastic variance of the likelihood, where we
  // assume treatment and control group have different variance parameters
  vector<lower=0>[N] sigma;
  vector[N] one_N = rep_vector(1, N); 
  sigma = sigma_t*w + sigma_c*(one_N-w);

model {
  a ~ normal(0,1);
  t ~ normal(0,1);
  L_Omega_alpha ~ lkj_corr_cholesky(1);
  L_Omega_theta ~ lkj_corr_cholesky(1);
  tau_unif_alpha ~ uniform(0,pi()/2);
  tau_unif_theta ~ uniform(0,pi()/2);
  z ~ normal(0,1);

  sigma_c_unif ~ uniform(0,pi()/2);          
  sigma_t_unif ~ uniform(0,pi()/2);
  y ~ normal(m, sigma);

I don’t have a CS background so I bet my code could be improved in many ways, my main sources for this model were the Stan manual this awesome resource by @betanalpha (Hierarchical Modeling). Thanks in advance for the help, this community is awesome.


When trying to optimize a fit like this there are a few considerations. Firstly are you being limited by the cost of a gradient evaluation or the number of gradient evaluations needed per Hamiltonian Monte Carlo iteration? When building models sometimes the gradient is fundamentally expensive and the most productive way forward is to work with parallelization, bootrapped adaptation, and the like. On the other hand if the gradient evaluation isn’t too expensive but you need too many of them then reparameterizations, refined prior modeling, and other approaches will be the most effective.

To distinguish between the two you can look at the number of leapfrog steps per iteration (see for example Identity Crisis for how to do this in RStan). If the distribution of leapfrog steps concentrates below 10 then your posterior geometry is about as nice as it can be and reducing the cost of evaluating the gradient is the way to go. Concentrating below 100 isn’t bad, but if you see leapfrog steps approach 1000 then your posterior geometry is degenerate enough that managing it becomes the priority.

If you’re seeing posterior degeneracies then one thing to consider is the parameterization. Here you’ve used a monolithic non-centered parameterization, but if some of the trials have a lot of data then you might need a mixed parameterization where some groups are centered and some are non-centered.