Hierarchical Model in Stan Taking to Long

Hi all,

I’m trying to fit a simple hierarchical model but keep getting messages warning about a small number of divergent transitions (~ 10) and recommending that I increase adapt_delta. These divergent transitions appear below the pairs() diagonal - which the manual suggests may indicate a need for reparameterization. In addition, with just a small sample I’m getting very high run-times. I know this is a small number of divergent transition, but combined with a long runtime my hunch is that I’m just not writing the model correctly and wondering if anyone has better suggestions.

I’ve attached a reproducible example. The model is as follows:

I have i=1, \dots, n subjects over time t=1, \dots, 24. For each subject, i record measurements on 5 outcomes, \{y_{1it}, \dots, y_{5it} \}. I also have a subject-level treatment indicator, A_i. The goal is to estimate an overall treatment effect, \delta^*, as well as outcome-specific treatment effects, \{\delta_k\}_{k\in1:5}, while building in the prior belief that the outcome-level treatment effects are distributed around a common overall treatment effect.

y_{itk} \sim N \Big( \alpha_i + \beta + \delta_k A_i, \phi_k \Big) for k=1, \dots, 5

The \alpha_i are subject-level random effects. I have a common intercept \beta for now - but will let this vary with k eventually.

For priors, I have that

\delta_k \sim N( \delta^*, \phi_\delta )

With a Normal hyper-prior on \delta^* and half-Cauchy on \phi_\delta. The full Stan model is:

data {
  int<lower=1> Npats;
  int<lower=1> N;
  int<lower=1> K;

  vector<lower=0, upper=1>[N] A;
  int<lower=0> id[N];
  
  real y1[N];
  real y2[N];
  real y3[N];
  real y4[N];
  real y5[N];

}


parameters {
  vector[K] delta_raw;
  real delta_star;
  
  real beta ;
  
  vector[Npats] alpha;
  
  vector<lower=0>[K] phi_y;

  real<lower=0>phi_alpha;
  real<lower=0>phi_delta;
}

transformed parameters{
  vector[N] mu1;
  vector[N] mu2;
  vector[N] mu3;
  vector[N] mu4;
  vector[N] mu5;
  vector[K] delta;
  
  // non-centered parameterization
  delta = delta_star + delta_raw*phi_delta;
  
  // compute linear combination for each outcome model
  mu1 = alpha[id] + delta[1]*A + beta;
  mu2 = alpha[id] + delta[2]*A + beta;
  mu3 = alpha[id] + delta[3]*A + beta;
  mu4 = alpha[id] + delta[4]*A + beta;
  mu5 = alpha[id] + delta[5]*A + beta;

}


model {
  phi_delta ~ cauchy(0, 10000);
  phi_alpha ~ cauchy(0, 10000);
  phi_y ~ cauchy(0, 10000);

  alpha ~ normal(0, phi_alpha);

  delta_star ~ normal(0, 2000);
  delta_raw ~ std_normal();
  
  beta ~ normal(0, 10000);
  

  y1 ~ normal(mu1, phi_y[1]);
  y2 ~ normal(mu2, phi_y[2]);
  y3 ~ normal(mu3, phi_y[3]);
  y4 ~ normal(mu4, phi_y[4]);
  y5 ~ normal(mu5, phi_y[5]);

}

Even if I simulate a small dataset with n=100 and t=1, \dots, 5, it takes several minutes to run. In reality I’ll have a few thousand observations and dozens of time points. It takes hours to get just 5,000 posterior samples (after 5,000 warmup). If this is just Stan being Stan, that’s fine. But if it’s due to my parameterization, I’d love to my the necessary edits to speed things up.

test_forum.r (1.4 KB) test.stan (1.1 KB)

Hi,

One thing I notice about your model is that you are using incredibly diffuse priors, for example your cauchy(0, 1000) prior. This can often lead to problems computationally as the sampler can get stuck in low probability areas of the sample space.

I would try priors more along the lines of cauchy(0, 2) and normal(0, 10) and see if you see any improvement.

Best!

2 Likes

Thanks so much for the input! I had tried tighter priors before without success and went wider just to see if it’d make a difference. You’re right though that it did make run times even longer.

It’s still crazy to me that Stan’s run-time is so long. I think i’m using the best possible parameterization and coding practices (using vectorization/avoiding loops). Oh well.

Best

Have you considered calculating the things in the transformed parameters block on the fly within the model block? At a quick glance, I don’t see why you would need any of that stuff to be in the transformed pars block. It might save you some overhead to define them as local variables in the model block

1 Like

I can give that a shot. Thanks for the suggestion!

Having read your model code more carefully, I now also see that the five linear predictors for the five outcome share the same intercept. If the five outdone operate on very different scales, that might cause problems as all of the data for particular outcome might be in the tail of a normal distribution. This is easily alleviated by having K intercepts, one for each outcome

1 Like

One other thing to note: in your simulation code you simulate from some very diffuse distributions - this is going to make it hard for any model to accurately recover parameters.

1 Like

I could get a chain to run without any warnings under 30 seconds on my computer by (1) reducing the number of iterations to the default of 2000, (2) changing the priors to normals and scales consistent with your simulated data, (3) moving the transformed parameters into the model block and (4) rescaling the parameters so that they are on a smaller scale.

Step (4) actually makes a big difference. I guess because Stan initialises the parameters between -2 and 2 (on the unconstrained scale). It might take a while for the warmup to end up in the right scale for the parameters if they are around 1000. You can clean this probably up further to get some speed improvements.

model {
  vector[N] mu1;
  vector[N] mu2;
  vector[N] mu3;
  vector[N] mu4;
  vector[N] mu5;
  vector[K] delta;
  
  delta = delta_star + delta_raw*phi_delta;
  
  mu1 = alpha[id] + delta[1]*A + beta;
  mu2 = alpha[id] + delta[2]*A + beta;
  mu3 = alpha[id] + delta[3]*A + beta;
  mu4 = alpha[id] + delta[4]*A + beta;
  mu5 = alpha[id] + delta[5]*A + beta;

  phi_delta ~ normal(0, 1);
  phi_alpha ~ normal(0, 1);
  phi_y ~ normal(0, 1);

  alpha ~ normal(0, phi_alpha);

  delta_star ~ normal(0, 2);
  delta_raw ~ std_normal();
  
  beta ~ normal(0, 1);
  
  y1 ~ normal(mu1 * 1000, phi_y[1] * 1000);
  y2 ~ normal(mu2 * 1000, phi_y[2] * 1000);
  y3 ~ normal(mu3 * 1000, phi_y[3] * 1000);
  y4 ~ normal(mu4 * 1000, phi_y[4] * 1000);
  y5 ~ normal(mu5 * 1000, phi_y[5] * 1000);

}
3 Likes

Thanks! Great point. In my simulations I simulated with a common intercept for simplicity so shouldn’t be a problem, but in the actual model I will be allowing for separate intercepts.

I see this is really helpful. I hadn’t considered scaling issues. Do you know if it’s generally recommended to standardize the outcome and covariates before inputing into Stan? Thanks a lot!

Does warmup really take 5000 iterations and do you really need 20,000 posterior draws (assuming you run four chains)?

Nice!

Exactly. And we need to find the right scale during adaptation. We initialize with the assumption of a unit scale meaning we assume the posterior standard deviation is one. We then adapt that if the assumption’s not correct, but that takes a lot of time.

You can clean this probably up further to get some speed improvements.

Making mu a 5 \times N matrix and using linear algebra would speed this up. Similarly, you can vectorize

y ~ normal(mu, phi_y);

if you also make y a 5-vector (or array).

1 Like

I simpler model, in my view, would actually have been five separate models for each of the five outcomes, where each linear predictor included its own intercept and a fixed effect for the treatment effect (instead of a random effect shared across outcomes). That way, you could have seen what the scales of the parameters were before you start pooling stuff.

A couple of things

  • Ideally the parameters are of the same order of magnitude.
  • The parameters get by default initialised between -2 and 2. So Ideally this starting point is the order of magnitude for the parameters. If not, you can help Stan by providing your initial values, to help with the adaptation. Or you can do the quick and dirty way like I did and effectively multiply all parameters by 1000.
  • Another option is to scale the covariates, so that the parameters or on the right scale. In this case that would not have helped with the intercepts.
  • You need to scale the outcome for the problem of large scale intercepts. I think this is fine as long as you have a linear model. It can be tricky in non-linear models because it can change the interpretation of the parameters. This is where subject specific knowledge should help you make the decision what the best option is.
1 Like