Common causes of slow / non convergence?

I’m struggling to understand a problem with extremely slow convergence. The same model (hierarchical time series) on many different datasets typically reaches the high density / typical set region after 20 ish iterations. On this particular data, after 2000 iterations - which takes a few days as the treedepth also blows out in this case - the logprob plot is still just slowly increasing. Can anyone suggest the general causes of this sort of behavior? I can’t post the data, but I couldn’t find anything particularly unusual about it. Many parameters look something like the plot shown below – an initial fast shift towards a region, then an extremely slow wander towards each other – seems like it’s some kind of identifiability issue, but I’m having trouble getting clear in my head why that could be the case when the logprob is still slowly increasing.

Maybe check what’s happening with the timestep during adaptation for this data set vs. what’s happening for other ones?

Treedepth could be blowing up if the timestep is getting really small. Timestep could be getting small cause of numerical weirdnesses – but that’s just speculation at this point.

The folk theorem. If the model doesn’t fit some data set well, it can take a very very long time to sample.

The typical set is where the most probabilty mass is. It’s usually not the highest density region. In high dimensions, the sampler will never get near the mode of a multivariate normal, which has by far the highest density. See:

http://mc-stan.org/users/documentation/case-studies/curse-dims.html

for illustrative simulations and explanations.

That’s usually a sign of extreme curvature forcing the step size to be small and then other parts of the parameter space taking a lot of steps at that step size. If you are not deepening the max means the sampler just reverts back closer to a random diffusion rather than the more flow-driven exploration of HMC.

I forgot to mention stronger priors can help. Or changing parameterizations to something better identified if that’s a problem with the data at hand.

If you have strong priors, is it required to initialize the chains in the priors?

If you have hard constraints, then yes, you need to initialize satisfying all constraints.

If it’s a soft constraint (like a narrow normal), then you can run into numerical issues if you’re too far into the tails in initialization. It helps to scale everything to so the posteriors are roughly centered around zero and unit standard deviation (i.e., standardized).

1 Like

I have my parameters constrained like

real<lower=0.0, upper=1.0> alphaSraw;

then scale to get the transformed parameters (this one is on log scale)

slow_alpha = 10^(-2-3*alphaSraw);

and priors like

alphaSraw ~ normal(0.5, 0.1) T[0.0, 1.0];

then I initialise the chains (in R) by randomly/independently sampling from the priors.

alphaSraw=rtruncnorm(1, 0, 1, 0.5, 0.1),

Is this ok? Convergence is fairly fast without priors, but slow/non-convergent for some data sets, adding priors slows it down quite a bit.

Given the constraint, that’s not likely to cause problems—the edges are only 5 sd away from the center.

For efficiency, you don’t need the T[0, 1] part because the parameters are constants, so the adjustment for truncation is also constant and can be dropped.

Does the model fail if you let Stan randomly init? That may indicate missing constraints in the parameter declarations.

1 Like

Some of the other parameters have prior mean near the lower bound, so it could be up to 10 s.d. I am using an s.d. of 0.1 in all cases (at present but might tighten some of them).

Thanks for the tip about T[0, 1].

Yes with random init I got slow convergence: I assumed this was due to a chain initializing in a low likelihood location. I’ll do more testing. Thanks for your help, much appreciated.

That is a possibility. There are two things going on. First, the sampler needs to find the typical set (high probability mass region of the posterior). Then it needs to spend enough time exploring there to estimate the mass matrix and step size.

If there are a lot of parameters and they’re on average several standard deviations away from where they need to be, then it can be unstable. Usually a few aren’t so bad in my experience. Of course, it depends on what else is going on in the model.

1 Like