Chains stuck when use larger dataset, but not smaller

Hello everyone,

I’m trying to fit a model with a lot of parameters to datasets of about 60,000 observations. To start, I sampled 10,000 observations from the full datasets for reasons of time, to make sure the model worked correctly. I fit the model to these 10,000 observations with 8 chains. All of them converged for each of the datasets.

However, when I run the model on the complete datasets (~60,000 each), I get a few chains that get “stuck” in regions that don’t end up converging to the other chains. Is there any reason why this might occur with larger datasets? Should I use a longer warmup? A longer first phase of the warmup stage to get to the typical set (some hyper-parameters don’t appear to be in a reasonable place in the stuck chains)? The model takes 2-4 days to fit on the full data, so it’s not easy to test things quickly. Any ideas would be welcome!

1 Like

When you add more data your posterior becomes more concentrated, and as it concentrates it becomes more sensitive to the compatibility of your model and the true data generating process. In other words, the more data you have the more your posterior manifests model misfit.

Often this model misfit manifests in awkward posterior geometries that make computation difficult, yielding behavior not unlike what you see.

Another issue is that concentrating posteriors make non-identifiabilities worse. For example, if you have collinearity then as you add more data your posterior converges to a singular line that is extremely difficult to explore (you have to make infinitesimal jumps orthogonal to the non-identifiability, and large jumps transverse to it). Or if you have a multimodal model then adding more data will generally suppress the posterior between the modes and obstructing transitions between the modes.

So there are lots of possible reasons for the behavior you see, and the only way to better identify what’s going on is to work in steps. The 10k observations look good, but instead of jumping straight to the 60k try 20k. Carefully check sampling diagnostics for indications of bad posterior geometries arising and posterior predictive checks for any misfit becoming significant. As you add data incrementally the pathology should hopefully rear its head before it becomes so bad that you can’t explore the posterior at all.

9 Likes

Interesting… sounds like a Bayesian really dislikes point-estimates and even if the data admits a point estimate (as the data overwhelms) then the software surrenders for subtle reasons … kind of self-protecting us against the point.

Seriously, this is interesting and I would be curious to hear what solved the problem. What Mike describes makes sense and I would like to hear the hopefully glory end of this tale.

Thanks a lot for this explanation Michael. I’ll see if I can iterate to get to a solution by going from 20k to 30k obs. and so on.

Another solution? I dropped a hard constraint on one of the hyperparameters in the model because it appeared that the “bad” chain(s) were sitting up against the constraint until far into the warmup (I know that a parameter beta_mu is greater than 0, but one chains sits against 0 for a long time when it is set as a hard constraint). After it breaks away from that spot, it appears to travel the same path as the “good” chains, but doesn’t move beyond one of the mid-points that the other chains sit in for a while but eventually break out of. Would it therefore help to lengthen the initial fast adaptation stage to give the chains more time to get to a reasonable place? Or lengthen the warmup? (recall the model takes 2-4 days to fit, so I’m trying to manage time also).

I noticed in the manual that the arguments for each of the three adaption intervals aren’t proportions of the warmup period, but default to adapt_init_buffer = 75, adapt_term_buffer = 50, and adapt_window = 25. Are these defaults effectively proportions? 75/150, 50/150, 25/150 for the proportion of warmup iterations in each stage?

but doesn’t move beyond one of the mid-points that the other chains sit in for a while but eventually break out of

Any amount of non-moving is a problem, I think :D. Any chance you’ve got divergent transitions (if you’re using R, just pop the model open in ShinyStan and it’ll tell you) in the 10k chains? When I’ve had models freeze up in sampling I usually get a lot of divergent transitions, and these can start showing up a long time before the total stoppages happen.

The @betanalpha divergent transition tutorial is here: Diagnosing Biased Inference with Divergences . If you have divergences, then those can lead you to the problem (and hopefully can help you do experiments with your model that take less than a day :D).

It’s my understanding that divergent transitions happen when the leapfrog integrator in HMC goes totally crazy and unstable. You can try to make your timestep smaller/warmup larger to get a better timestep, but that won’t necessarily help. When I get divergences it’s usually cause there’s some sort of weird modeling thing happening that I need to fix.

If you’re using R, the easiest way to find the divergent transitions is with ShinyStan or with the stan pairs plot implementation (the divergences show up as little red dots in either case). ShinyStan has a lot of other handy diagnostic information to look at (tree depth is neat), though it can be tricky if you have lots of parameters.

Edit: The easiest thing to look for are non-identifiabilities. Look at pairplots between parameters and see if there are really tight coupling between any of them.

This screams bad priors. Not objectively always bad priors but too wide for a given constraint. Simulate from your prior and see if you’re effectively placing a large mass near zero.

Thanks for your comments Ben.

I actually don’t have any divergent transitions in any of the models run on each of the four datasets. I originally had problems with divergent transitions but was able to modify things to get around that problem.

1 Like

Don’t do this, it’s a model/priors/transformation problem.

This doesn’t work as well as you might think. A different thing to try is to add a semi-parametric component or random effect component to your model to take up the slack. I hard a model a while ago where adding about 25k random effects made it run faster so it can work quite well. Of course then you look at the patterns in those parameters to identify mis-fit and see how to improve your parametric model.

1 Like

The model is a heterogenous ordered logistic regression model, where the linear predictor is simply:

y_ij = alpha_i + beta_i * zeta_j

alpha_i ~ normal(0, alpha_sigma) # 0 for identification
beta_i ~ normal(beta_mu, beta_sigma)

zeta_j, the cut points, scale, and the hyper parameters are not explicitly given a prior in Stan.
zeta_[j = 1] and zeta_[j = 2] are set to -1 and +1 for identification

After dropping the constraint on beta_mu (where I had set the minimum to 0), things look better. However, one chain for one of the 4 datasets is sitting at beta_mu = -0.05 (I’m monitoring a 2-4 day model, so it’s hard to know where it’ll end up), where all the other chains are where they “should” be (~ 3). I could put a stronger prior on beta_mu, but it’s not immediately clear what that should be.

This is a mistake in a logistic regression unless you are dealing with a regulatory framework that forces it. It puts most of the mass of the implicit prior very far away from anything like reasonable results.

This behavior is strongly indicative of model misfit. The constraint that works well for small data is becoming inconsistent with the larger data set. Looking at the model you included below this isn’t surprising – the normal distribution assumes that the mean is unconstrained. As you get more data having a self-consistent model (i.e. if you want to model that something is positive then you need a distribution that supports only positive values like a gamma or log normal) becomes more and more important.

This is a problem of mixing, not adaption, so running longer warmup will very likely not help.

Please go through the adaptation section again, where this is explained in detail. These are not proportions. You have an initial init_buffer of 75 iterations where you let the chain reach the typical set, then adaption windows of 25, 50, 100, 200, … etc, until ending with a 50 iterations at the end to let the stepsize equilibrate to its final value. If the number of warmup iterations is less than 75 + 25 + 50 then it tries to allocate iterations proportionally before giving up (although in these case you are given a warning indicating as much).

Uniform priors are bad and you should feel bad for using them. Half-kidding – don’t feel bad, but definitely don’t use uniform priors in any circumstance.

What is likely happening is that your improper priors are inducing an awkward posterior geometry with a sink near zero that traps anything that gets too close, preventing it from escaping and exploring the more appropriate parts of the posterior.

As with any analysis, if you can say that the values you’re getting are “wrong” then there is information that you have that you’re not encoding in the prior and you shouldn’t expect the posterior to be able to know that!

1 Like

Thanks everyone for all of your suggestions.

A quick follow-up:

I dropped all of the hard constraints on the parameters, and set what I figure are reasonable priors. I still ran into the same problem.

I had a feeling that about 3 parameters were driving the problem, because they refer to survey responses that all respondents answer. Once I set reasonable initial values for these 3 parameters, with all others initialized to runif(n, -2, 2), I ended up with no problems at all. All of the chains for four separate datasets all appear to converge as expected.

I believe I’ve had to do something similar with ordinal item-response theory models, where constraining the sign of one of the discrimination parameters should be sufficient for identification, but where in practice one has to input initial values or constrain more parameters. This happens with simulated data also, so it’s not a problem of model misfit. If I ever code one of those models again, I’ll post it here as a “hard case” for Stan.

I’d try a lower initial stepsize (try stepsize=0.1 or 0.01), higher target acceptance rate (adapt_delta=0.95 or 0.99), and longer warmup. If you have a lot of parameters, then you may need to be more conservative in initialization rather than using the (-2, 2) on the unconstrained space.

Also, if you used a non-centered parameterization, you may find the centered one works better if there’s a lot of data in each group.

And if you share your code, we can see if there’s any way to speed up the log density evaluations.

What n_eff are you targeting that it runs for days?

@Gregory

Do you think that Greta would have been a better choice for this problem ?

https://www.rdocumentation.org/packages/greta/versions/0.3.0

1 Like