 # Non-converging chains for a set of parameters

I am trying to fit a set of parameters (\theta_1 and \theta_2) for subjects (s) over different time points (t) with a Gaussian dynamic process. My generative model looks like this:

\mathbf{x}_{t+1}^s \sim \mathcal{N}(\mathbf{A} \mathbf{x}_t^s + \mathbf{B} \mathbf{u}_t^s,\mathbf{Q})\\ \mathbf{\theta}_t^s \sim \mathcal{N}(\boldsymbol \mu + \mathbf{C} \mathbf{x}_t^s,\mathbf{R}) \\ \mathbf{y}_t^s \sim{Bernoulli}(\mathbf{\theta}_t^s)

where initial state follows:

\mathbf{x}_1^s \sim \mathcal{N}(0,\mathbf{V})

y is choice data for subject s in time t that is being fitted.
U is also observed data for subject s in time t

A, B, C, X, u, R and V are dynamic parameters of the Gaussian process.

I implement this process in Stan in the following way:

model {

.
.
.

for (s in 1:S) {
for (t in 1:T) {
vector params_pr;
params_pr = theta1[s,t];
params_pr = theta2[s,t];

if (w == 1) {                                    // initial state
to_vector(X[s,t,]) ~  normal(0,sigma_v);
}
else {
to_vector(X[s,t,]) ~ normal(diag_matrix(A) * to_vector(X[s,t-1,]) + B * to_vector(U[s, t-1,]), Q);
}
params_pr ~ normal(mu_d + C * to_vector(X[s,t,]),sigma_r);

y[s,t,] ~ bernoulli_logit(theta1[s,t] * dat[s,t,])  ./ (1 + theta2[s,t]));
}
}
}


My general chain diagnostics (bin/diagnose from cmdstan) is not bad - no hitting max tree depth, almost no divergent transitions (0 to 3 transitions), satisfactory E-BFMI and ESS.
This is my typical bin/diagnose output:

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory for all transitions.

Effective sample size satisfactory.


Parameters \theta_1 and \theta_2 are fitted just fine - stable chain traces (see examples below; upper two), R-hat is 1.00 and also my predictive checks look amazing.
Now my problem is that A, B, C and X on the other hand are a complete disaster, as you can see in the example trace plots (bottom two), which is also reflected in huge R-hats.
I tried changing the number of draws, adapt_delta, step_size, put stricter priors, implementing a non-centered reparametrization. Nothing helped.

Can this be an identifiability issue, especially given the fact that I am able to recover the original behavior data? Or maybe I am missing something simple in how I specify my model in Satn? Any suggestions are welcome!

My system:
python 3.7
cmdstanpy 0.9.63
cmdstan 2.25.0
CentOS7
gcc 9.2.0

That’s where I’d put my money. Just a little bit of non-identifiability can make something super hard to sample.

I don’t know of any reliable way to diagnose this sort of thing though. Pairs plots are the basic thing, but only are really practical with a small number of parameters. You could compute the correlation matrix for your posterior and see if there are any patterns in there.

Also maybe pick out a few parameters with particularly low ESS/Rhat and plot pairplots or correlations only with those (to give yourself less stuff to look at).

In a model like this though with big matrices I’d expect that you could get a lot of weird correlations that aren’t going to show up in 2 parameter correlations or pairplots. Can you do first like, set A, B, C, Q, and R to known values and see how that goes? If that is still hard, maybe u and x are unidentified, or if that is easy, turn on another parameter and see how that goes, etc.

1 Like

Since your data is y and we know we are searching for unidentifiabilities, there’s probably also a way to do some thinking here to theorize where the non-identifiability is coming from.

The data is y, and that is determined from theta, and even though there’s a non-linear transform there that can confuse things, the two components of that are u and Cx. So in what ways can u and Cx be the same thing? If there’s a glimmer of a possibility, the sampler will probably find it and be really slow.

1 Like

Thanks @bbbales2 ! It took me some time to try all of what you suggested, but I still can’t get convergence! I have no idea what is going on and how to solve it.

Unfortunately, if a model doesn’t work, there are only a few reliable options:

1. Simplify it
2. Use simulated data
3. Bring more domain specific knowledge

This might have some more ideas: Bayesian Workflow « Statistical Modeling, Causal Inference, and Social Science

It was a long weekend trying many different things! I defined some of the equations to be deterministic and it resulted in the following trace plots. Looks like the warmup is stable most of the time and then it starts to diverge at the last warmup phase continuing into sampling. Any ideas why this could be and how is it possible to solve it?

All warmup settings were set to their defaults.

Thanks!

Hmm that looks more like warmup is never actually finishing.

Like this is an early-stage warmup behavior I saw once: https://aws1.discourse-cdn.com/standard14/uploads/mc_stan/original/2X/1/12f5696de99fc37f010f252c838375d9ca6ca521.png

In that case it would take like 300-400 draws for things to get going.

Maybe run 5000 iterations of warmup and see if the behavior changes?

I’m not sure what to do, but maybe this would tell us if it’s just taking a long time for this parameter to get going.

It look a while, but in the previous case I made a coding mistake. But now seems like I am dealing with multimodality. I found old threads about this - were there some improvements in Stan how to deal with it? Each chain diagnostic is great - no max tree hits, no divergences, good E-BFMI.

For multimodality, you either gotta constrain your problem so that only one mode is possible or live with it.

The second is not great cause then your chains never really mix, but in general it can be hard to come up with constraints to limit yourself to one mode.

Are these multiple modes mathematically equivalent? Or do they actually represent different solutions to the problem?

Yes… all chains recover the data well. I am calculating now LOO values for each chain to see if these are comparable as well. But I guess I don’t know what you mean by “mathematically equivalent”.

Mode swapping is the standard example.

Say I have a mixture of two normals, one with mean -1 and one with mean 1.

I generate data and do inference, and my chains might tell me the first mode has mean 1 or -1 (in which case the second has mean -1 or 1 respectively).

In this case the ordering doesn’t matter, and you can put a constraint on the means to make sure every chain gets the same solution.

You can also have problems where the multiple modes are not representing the same thing.

I see – how can I put such constrains in Stan? The puzzling part is that theta parameters converge well (upper left), but the Gaussian system parameters (all the rest) do not mix. I would expect either all parameters to mix or all parameters not to mix.

The specific constraints you need will come from the specifics of your model and data (dunno what those are yet).

In the case of the means, ordered mu is sufficient. This is a write up on the problem in general: Identifying Bayesian Mixture Models

That might give you some ideas on where to look/what to try.

1 Like

Thanks! I am looking into this.

Hmm, I don’t have an example off the top of my head, but in a very hand-wavy way, if I say it took me 11 hours to drive from NY to MN, then there are multiple roads I can take.

An actual statistical example would be better but that’s a handwavy answer for you :D.

2 Likes

@bbbales2 not so sure about that example :D

Yo speaking of maps, check this out: A map from Great Circle Mapper - Great Circle Mapper

The Chicago airport is basically on the geodesic between SFO and JFK.

Wooof, I though 11 hours would be too much! Those lakes are so big!

1 Like

Yes, I get this, thanks! The only thing I found about it suggests taking one of the solutions (when ordering is not enough section). Is it still the best practice?

Under the assumption that there is no way to fix this (which is a big assumption – I don’t know what the problem is in terms of the application but maybe there is a fix) – anyway under that assumption (model fits, not all parameters mix), probably the biggest thing you lose is on the communication front.

So you have a model and you can talk about a posterior p(\theta | x) and expectations of things computed with that posterior, but since the chains aren’t mixing (even if each one looks okay individually) you kinda know you aren’t sampling the posterior and the expectations you compute might be off.

So any time you talk about posterior expectations, you sorta need to throw in that you had these mixing problems. It kinda makes it annoying to talk about a problem, cause if you didn’t have these mixing problems you could just say “this is my model and these are MCMC estimates I computed” and as long as everyone trusts Stan it’s fine.

Now you gotta say “this is my model and these are MCMC estimates I computed but also I’m ignoring the multimodality in C that pops up and I assume it’s not much of a problem.” Depending on the application, that may or may not be concerning. Without any other information, it’s more concerning than not though.

2 Likes