Correlation of markov chains

I am in the middle of testing a model and encounter a behavior that seems weird to me when changing the fit parameters.
My model converges nicely with \hat R = 1 and n_{eff} = >90 \% N when choosing iter in a range of 1000 to 8000, and chains in a range of 1 to 5. However, especially when I choose chains to go above this range my n_{eff} drops dramatically for all parameters and my \hat R is way above 1 for all parameters.

I am very curious about this behavior. The way I understood the chains parameter is that it initializes n separate chains that are entirely uncorrelated to each other. Also, the way I always understood Markov chains is that the higher iter is the smaller the impact of the chain’s initial value.

Could anybody help me by understanding this behavior, especially what the parameter chains actually does? When looking at the documentation I sadly couldn’t find information about what Stan actually does with the value that is given as input to the chains parameter.

Thanks in advance!

1 Like

If your neffs are all the vicinity of 90 with 5000 iterations, say, I would dispute the claim that this is doing nicely. While neffs in that range might be enough for your purposes, but this does not strike me as good efficiency.

Regarding things getting worse as you increase the number of independent chains, this could be due to some chains finding (and getting stuck in) tough parts of the parameter space. It could be a sign your posterior is actually quite tough, it just takes many chains to diagnose that.

@jsocolar might have further suggestions.

1 Like

Sorry, there was a backslash missing in the formatting of my initial post.
My n_{eff} is 90% of N an not a total amount of 90. An n_{eff} that is at least 90% of all the sampling iterations seems to me like good efficiency.

1 Like

Sure, efficiencies of .90 are usually quite good. Now, everything else I said still applies: using multiple chains allows you to explore several regions of the ambient space. If some chains get stuck, this will manifest in high Rhats, usually. My advice is to pick a few parameters/quantities of interest and do trace plots to see if some chains are getting stuck at different configurations.

1 Like

Strongly agree with @maxbiostat . It’s important to remember that passing convergence diagnostics does not guarantee that you are computing an accurate posterior. That is to say, the diagnostics don’t always succeed in flagging problems, even when problems exist. One way to increase the sensitivity of diagnostics is to run more chains. On the other hand, it’s also worth noting that you can get false-positives if some chains get stuck in minor modes that make negligible contributions to the posterior. Thus, it’s really important to follow @maxbiostat 's advice to examine the parameters, including the target density lp__ to understand what is happening.

If you can post your model or describe it, then somebody might have intuition for what flavors of problem might be cropping up, and what parameters might be productive to explore.

2 Likes

Sorry for the late reply,
at the moment I am testing my stan model by generating artificial data that is made with the same likelihood function as in stan and find them with my stan model. So currently it is possible to determine if stan found the right values by comparing the generated parameters with the guessed.
The weird behavior I am facing at the moment is that by increasing the number of chains it can happen that my effective sample sizes drops dramatically and I don’t understand why this happens. As I understand it at the moment is that every chain works independently and contributes effective samples to the model based on autocorrelation of the parameters within a chain.
So if I have 4 chains (which all computing an accurate posterior) and get an n_eff value of 4000 and then I increase the amount of chains that run to 7 I suddenly get an n_eff value of 2 or 4.
This makes me wonder if there is some sort of cross validation between the chains that could ultimately cause this.

In case my model could help you can find a detailed description in this post here. The model is still pretty much the same except for applying the changes suggested in the post.
Thank you for your help!

This is a sign that some chains are exploring different parts of the ambient space and getting stuck. As I said before, produce traceplots of selected quantities (including __lp as @jsocolar suggests) and you’ll be able to tell whether this hypothesis of mine holds any water.

Yes, I produced a couple of trace plots for different configurations and your hypothesis was absolutely right. However, even though the trace plots unravel that some chains get stuck, it doesn’t really help me understand if there is any correlation between chains (e.g. how a chain chooses the starting point for exploring) and doesn’t lead me towards a working stan model.
The latter is incredibly frustrating because the stan model should be able to find the correct parameters with the likelihood function I define when the data the stan model gets has been generated with the exact same likelihood function
I attached one of the generated plots here where the most important parameter and lp__ has been plotted

You’re right, it doesn’t. But it was important to confirm this behaviour. Now, as far as I know the chains are initalised independently and at random; for each chain, each coordinate in unconstrained space is picked at random from [-2, 2]. @betanalpha, @nhuurre or @andrjohns can correct me if I got that wrong.

You see, it’s not quite that easy. Even if the likelihood is well-behaved and the priors are appropriate, there is no guarantee that the posterior will cover the true generating values with any degree of certainty. That’s not what Bayesian methods are designed to do. With a bit of luck and thoughtful priors, though, things concentrate to the right places. My concrete advice for now is that you think about the prior for tau. Can you make it tighter? What does it (the prior on tau) imply in terms of possible data sets [i.e., have you run prior predictive simulations to generate ‘fake’ data?]?

2 Likes

@maxbiostat is right about initialization up to a typo; chains are initialized as described but in unconstrained space.

As an aside, note that ESS estimators in the Stan ecosystem get penalized for nonconvergence, which is not done in all packages that compute ESS but is definitely a good way to think about ESS. This is why the ESS estimator shows a multi chain dependency.

2 Likes

Thank you for clarifying @maxbiostat and @jsocolar .

Can you make it tighter? What does it (the prior on tau) imply in terms of possible data sets [i.e., have you run prior predictive simulations to generate ‘fake’ data?]?

Yes, I have run prior predictive simulations to generate my fake data. Explicitly, for the parameter of tau this means that my prior for the prior predictive check is normal(50,2) and stan gets the same normal distribution as a prior. This is probably the tightest I have attempted and previous runs were made with higher values of standard deviation.

Just wanted to emphasize a few points that @maxbiostat and @jsocolar have made.

In Stan each Markov chain is initialized and effectively evolves independently* from all other chains.
This means that increasing the number of Markov chains doesn’t change the behavior of any previously generated Markov chains**. Consequently the behavior encountered here almost always indicates that the new Markov chains were able to explore parts of the model configuration space that the previous Markov chains had largely ignored. In this case the only way to move forwards is to investigate this new exploration, identify if it’s meaningful or pathological, and then followup based on whether or not you want to incorporate this exploration into your posterior (i.e. if it’s meaningful behavior) or exclude it (i.e if it’s pathological and inconsistent with domain expertise).

* Pedantry: Stan achieves this independence by partitioning the states of a pseudo-random number generator as discussed for example in Rumble in the Ensemble. Technically if one uses a lot of very long Markov chain then the pseudo-random number generate states might start to overlap and induce subtle correlations between the individual Markov chains but this is a very extreme circumstance.

** More Pedantry: If you fix the seed argument and run on the same machine twice, once with n_chains=4 and once with n_chains=8, then the the first four Markov chains will be identical. If you’re not fixing the seed argument then running with an increased n_chains will not preserve any of the previously generated Markov chains. In this case running again with a larger n_chains argument effectively generates an entirely new set of Markov chains which may or may not have an opportunity to explore behavior not encountered by any previous Markov chains.

Effective sample size is a quantity defined in the context of a Markov chain Monte Carlo central limit theorem. If such a central limit theorem doesn’t hold for an expectand then there is no corresponding effective sample size. In this case one can still construct effective sample size estimators, which is what Stan does, but those estimators don’t correspond to any meaningful property. The split \hat{R} diagnostic is useful precisely because it is sensitive to failures of a Markov chain Monte Carlo central limit theorem; if \hat{R} \gg 1 is inconsistent with a central limit theorem holding in which case the effective sample size estimators are meaningless.

In other words the key issues here isn’t the effective sample size estimator plummeting as more Markov chains are added but rather \hat{R} suddenly increasing from 1. Once you see that the effective sample size estimator isn’t worth considering. This is made all the most confusing by the fact that Stan’s effective sample size estimator incorporates a completely heuristic modification that pushes the estimator to zero if hat{R} is large. Again in this case it’s not that the effective sample size is small but rather meaningless.

Finally there is no general guarantee that a posterior distribution will concentrate around the true model configuration, even in simulation settings where one simulates data from the same model used to construct the posterior. Intuitively the problem is that the likelihood function concentrates around model configurations consistent with the observed data which is not always near the true model configuration. When the data fluctuate in conspiratorial ways the likelihood function can concentrate away from the true model configuration, which is exactly the source of over-fitting. For nice models, where these conspiratorial fluctuations are rare or the prior model is able to suppress their influence, the posterior distribution will typically concentrate around the true model configuration, but this is not a guarantee and has to be verified for each analysis.

The consequence for a simulation study like this is that comparing an approximate posterior quantification to the true model configuration results in ambiguity. It could be that the exact posterior concentrates around the true model configuration, in which case any deviations indicate computational problems, or it could be that the posterior approximation is accurate but the exact posterior doesn’t concentrate around the true model configuration!

Here Stan’s empirical diagnostics clearly indicate that each Markov chain is not recovering a consistent posterior approximation, which strongly suggests computational problems in particular multimodality. Why the posterior from the simulated data is exhibiting such multimodality depends on the particular details of your model and the simulated data you are using.