Convergence of multiple chains

Hello to everyone!

I am fitting a rather complex hierarchical model model in STAN, and after a lot of trouble I was able to get some decent results. Yet, there are still some small issues I am trying to solve, and I would like to know if anyone of you have any suggestions. The scheleton of the model is the following

Y_{i,j} = \hbox{distr centered in } (A \cdot \hbox{diag}(W_{i,j}) B^T) \\ A_k \sim \hbox{some dist} \\ B_k \sim \hbox{some dist} \\ W_{i,j}[k] \sim Gamma(\nu_{k,j}, \beta_{k,j}) \\ \alpha \sim \hbox{un-important} \\ m_{k,j} \sim Gamma(\hat\nu_m, \hat\beta_m) \\ \nu_{k,j} \sim Gamma(\hat\nu_{\nu}, \hat\beta_{\nu}),

where i is the group index and j is the sample within group index. The global parameters like \hat\nu_{\nu} are fixed. I put the information regarding only the W because I think its the only part of the model left which could have some improvement.
Mainly it seems that different chains reach very slightly different log-posterior values; when I look at other important quantities for the model across chain, they are clearly comparable. Yet the log-posterior is slightly different. Below the image

As you can see the posterior values are very very similar, yet they do not seem to “mix” which is a little concerning to me. Again if I look at some parameters I want to make some inference on I have a much better behavior, like

The model diagnostic of relevance are all good, other than I get a lot of max treedepth hitting, with max-treedepth at 11.

Any tips?
Luca

That’s a very big difference in the log posterior. The fact that two chains find that point suggests that they get stuck in that part of posterior space and cannot explore it effectively. For this to happen also the estimates for at least some parameters have to have different values. Have you looked at all the parameters and been able to identify which parameters differ between the chains?

I suspect the problem is similar to what is discussed here: Stan User’s Guide

Similar to what is described in the user guide for other distributions, you can try to reparametrize the gamma distributions by sampling from a gamma with fixed rates, and then scaling the parameters (see here: Gamma distribution - Wikiwand)

1 Like

Hi!!

Thanks for the answer and for the reference. Its a very weird situation, in the sense that I indeed detect different convergence points for some parameters of the distribution. These though are kind of nuisance parameters, while any quantity of real interest seems to have consistent behavior across chains.

I guess this is due to the fact that I am dealing with a pretty big model, were some portions of the landscape are very badly posed and badly connected to the bulk of the distribution for some parameters.

I have tried reparametrizing the Gamma, but unfortunately although I do no detect any pathological trace plot, i do get some “banana-shaped” trace plots for some of the shape parameters, hence I doubt that reparametrizing the rate of the rate of the Gamma will really help. I have also tried to switching to a log-normal without success, also the log-normal has some features which I really do not like for my modelling purposes.

Also in my case any global scale parameter is fixed trough a sort of moment matching procedure, hence I should not have any funnel-like behavior in this model. I think I have to smooth more the landscape, but I guess this is a very very vague statement.

Any suggestion are very well appreaciated!

Can you try multi-Pathfinder (available in CmdStanR and CmdStanPy)? One of the use cases is to find better initial values and avoid getting stuck in minor modes.

Thanks for the suggestions!

Today I am travelling, but tomorrow I will try your suggestions! I have a way to give good initial values for a single chain, but I am not sure how to extend this to multiple chains since it is like a moment matching procedure, hence it outputs just one set of initial values.

Would sampling from there be sensible or given that the initial values are likely pretty close it would be not be very sound?

Best,
Luca

Overdispersion is beneficial, but not required. It’s better to start with different initial values that are underdispersed than starting with just one initial value. Convergence diagnostic will work also in this case unless you run very short chains.