One chain always slower (x10) than other chains

Hi,

I am new to Stan. I created the following Beta-Binomial model for daily samples, which assumes the success rate of the Binomial distribution is sampled from a Beta distribution each day, and the mean of the Beta distribution is estimated using a logistic regression. I am trying to use that to estimate the success rate for multiple binomial distributions (so the mapping variable in the Stan code). The problem I encountered is with 4 MCMC chains, I always have one chain that takes much more time to finish than the other chains (more than x10 times) wh. I am wonder if this is a model specification issue and any way to improve that.

data {
    int<lower=1> L; 
    int<lower=0> ref_block[L, 2];  

    int<lower=0> N;  
    int<lower=0> M;  

    int<lower=0> n[N]; 
    int<lower=0> y[N];  
    matrix[N, M] X;  // design matrix
    int mapping[N];
}
transformed data {
    int num_cell;
    num_cell = max(mapping);
}
parameters {
    // beta params
    vector[M] beta;
    vector<lower=0>[num_cell] phi; 
    real<lower=0> rho_sq[L];
    real<lower=0, upper=1> theta[N];
}
transformed parameters {
    vector[num_cell] mu; //the linear predictor
    vector<lower=0>[num_cell] Alpha; //the first shape parameter for the beta distribution
    vector<lower=0>[num_cell] Beta; //the second shape parameter for the beta distribution

    for (nn in 1:N) {
        mu[mapping[nn]] = inv_logit(X[nn, :] * beta); //using logit link
        Alpha[mapping[nn]] = mu[mapping[nn]] * phi[mapping[nn]];
        Beta[mapping[nn]] = (1 - mu[mapping[nn]]) * phi[mapping[nn]];
    }
}
model {
    for (nn in 1:N) {
        y[nn] ~ beta_binomial(n[nn], Alpha[mapping[nn]], Beta[mapping[nn]]);
    }

    // Specify intercept priors
    beta[1] ~ cauchy(0, 2.5); 

    for (l in 1:L) {
        beta[(ref_block[l][1]+1):(ref_block[l][2]+1)] ~ normal(0, rho_sq[l]);
    }
    rho_sq ~ cauchy(0, 10);

    phi ~ inv_gamma(1, 1);
}
generated quantities {
    real<lower=0,upper=1> theta_pred[N];
    for (nn in 1:N) {
        theta_pred[nn] = beta_rng(Alpha[mapping[nn]], Beta[mapping[nn]]);
    }
}
2 Likes

(Before people with more experience will reply) I would suggest to observe the traceplots including the warmup draws. Try to see if for any parameter there is one chain that explores incredibly large or small values that leads to that chain be less efficient.

Sometimes applying more informative priors can help to better constrain the model. Sometimes reparametrising the model (using normal rather than fatter tail distributions) helps too.

2 Likes

Would a lower init_r be likely to help? Ref: One MCMC chain not moving.

Update: I tried setting init = 1 in $sample() (using cmdstanr) and it looks like it completely solves the problem on my end! The slow chain is easily 20x faster now and keeping up with the other chains.

My only concern is that I do not fully understand what init actually does. (I read the docs but have only an intuitive superficial grasp of what HMC is doing.) I am concerned about the potential consequences of setting init too low (comment: One MCMC chain not moving).

1 Like

I’ll try to explain a bit more what this line in the CmdStanR docs actually means:

A real number x>0 initializes randomly between [-x,x] (on the unconstrained parameter space);

If you have a parameter declared with no constraints (e.g., real theta) then setting init=1 will result in Stan randomly generating an initial value uniformly at random between -1 and 1. If, on the other hand, you have a parameter with constraints like real<lower=0> sigma then the same thing applies except the initialization takes place between -1 and 1 on the log scale, because the log transformation is used internally by Stan to transform a strictly positive variable to an unconstrained variable (under the hood we need a space with no constraints for HMC). Depending on what the constraints are different transformations are used to unconstrain the parameter (if you’re interested you can find all the different transformations here: https://mc-stan.org/docs/2_25/reference-manual/variable-transforms-chapter.html).

Currently the default value used by CmdStan is init=2 (so that’s the same default for CmdStanR), which is not a magic number but is typically enough to ensure sufficiently dispersed initial values. However, in some cases you will need to reduce the range of initialization (like in your case here) because you can get extreme initial values that particular models will have a hard time recovering from (this is likely what happened with the slow chain in your case and it might be visually detectable in a traceplot that includes warmup). So while it’s not ideal to narrow the initialization range if you don’t have to, it’s not frowned upon when it’s necessary. In some cases it’s even necessary to set it substantially lower than in your case. As long as you’re running multiple chains, checking that they’re mixing well and that other diagnostics look OK then I think you should be fine.

2 Likes

Thanks, that’s super helpful!

I was originally taught that in order to be able to assess mixing, the starting values from one chain to another need to be more dispersed than the posterior. In grad school when I was custom-coding high-dimensional models from scratch, I always struggled with this. But from your answer, I take it Uniform(-2, 2)^p or even Uniform(-1, 1)^p is likely to be dispersed enough for large p. So it’s not a problem if the tails of the actual posterior reach beyond the uniform bounds?

1 Like

Glad that was helpful!

I might be misunderstanding, but it’s not clear to me how you could ever have inits that are “more dispersed than the posterior”. If you initialize somewhere with zero (or vanishingly small) posterior probability then the chain really won’t be able to move. But if you just mean more dispersed than the bulk of the posterior, then yeah. In practice you want them to be as dispersed as possible while still allowing for the chains to move. If you narrow them too much then, for example, you could potentially miss important areas (e.g. one of the modes if you have a bimodal posterior), but if you don’t narrow them enough you can end up initializing somewhere the chain won’t be able to move.

2 Likes