Code takes too long to run, despite small dataset size

We have the following Stan code for our model and it takes quite some time to run (~10 minutes per chain), even though our dataset is small (only around 350 datapoints, 27 dimensions per datapoint). We’re not sure if there’s something wrong with the code, or if there’s any way to speed it up? Also, the MCMC chains are running sequentially, rather than in parallel. How do we make them run in parallel?

  logisticRegressionModel <- "
  data {
    int<lower=0> N;   // number of data items
    int<lower=0> K;   // number of predictors
    matrix[N, K] X;   // predictor matrix
    int y[N];      // outcome vector
  }
  parameters {
    real alpha;       // intercept
    vector[K] gamma;
    vector<lower=0>[K] tau;
    vector[K] beta;   // coefficients for predictors
    
  }
  model {
    // Priors:
    gamma ~ normal(0, 5);
    tau ~ cauchy(0, 2.5);
    alpha ~ normal(gamma, tau);
    for (k in 1:K) {
      beta[k] ~ normal(gamma, tau);
    }
    
    // Likelihood
    y ~ bernoulli_logit(alpha + X * beta);
  }
  "

Here’s the output we have:

SAMPLING FOR MODEL ‘1163d39e3fbcff93601a9dc8228bf4e8’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 8.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.82 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 269.211 seconds (Warm-up)
Chain 1: 279.788 seconds (Sampling)
Chain 1: 548.998 seconds (Total)
Chain 1:

SAMPLING FOR MODEL ‘1163d39e3fbcff93601a9dc8228bf4e8’ NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 7.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.74 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 278.215 seconds (Warm-up)
Chain 2: 307.572 seconds (Sampling)
Chain 2: 585.787 seconds (Total)
Chain 2:

SAMPLING FOR MODEL ‘1163d39e3fbcff93601a9dc8228bf4e8’ NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 7.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.74 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 318.265 seconds (Warm-up)
Chain 3: 315.994 seconds (Sampling)
Chain 3: 634.259 seconds (Total)
Chain 3:

SAMPLING FOR MODEL ‘1163d39e3fbcff93601a9dc8228bf4e8’ NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 6.5e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.65 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 312.392 seconds (Warm-up)
Chain 4: 306.04 seconds (Sampling)
Chain 4: 618.433 seconds (Total)
Chain 4:
Warning messages:
1: There were 20000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems

3: The largest R-hat is 4.24, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess

You define your hierarchical parameters as K-vectors, much like your set of predictors, but in your model block you do not index them. For example:

    for (k in 1:K) {
      beta[k] ~ normal(gamma, tau);
    }

Should maybe be:

    for (k in 1:K) {
      beta[k] ~ normal(gamma[k], tau[k]);
    }

Or perhaps you intended to have a shared prior for all of your parameters?

1 Like

Also for the sequential chains issue – is the mc.cores parameter set properly for your session in R?

1 Like

I think the line here can be written fully vectorized

  beta ~ normal(gamma, tau);

Else @amas0 points out unless you wanted shared priors

Going to the docs linked by the warning messages says that

When there are divergent transitions in the model, high R-hat is often just another symptom of the problematic geometry that caused the divergences. If there are “maximum treedepth” warnings alongside high R-hat, it usually reflects a problematic geometry of the posterior that is hard to traverse efficiently and is often a side effect of setting very high adapt_delta in an attempt to resolve divergences.

What might be helpful is if you simulate fake data based on your model and priors, attempting to a distribution that is similar to your target y.

Also are you sure you posted the model that you compiled and ran? alpha is a real while gamma and tau are vectors which looking at the code for the stan math library I think that should throw either compile or runtime error

1 Like

I set it now, it works thank you!

@stevebronder this compiles and runs for me on 2.29.1; I think it is doing the analogous thing to the program in the OP. It’s a bit counterintuitive that it runs, and I can only make an educated guess as to what it’s doing, but it does run:

parameters {
  vector[10] alpha;
  real beta;
}
model {
  alpha ~ std_normal();
  beta ~ normal(alpha, 1);
}

Yup it does run.

That’s more iterations than you should need for most well-behaved problems. When you see that every transition exceeds max tree depth, there are bigger problems.

With your data size, the real problem is probably that you’re using a centered parameterization. You can change that to non-centered this way:

real<offset=gamma, multiplier=tau> alpha;
vector<offset=gamma, multiplier=tau>[K] beta;

Usually the intercepts get wider priors that are unrelated to the slopes. But I just rewrote your model. The offset/multiplier don’t change the target density, just the geometry over which sampling takes place.