Do these plots look concerning?

I have run some MCMC calibration of a multi-level 2PL IRT model. My calibration converged but I got the following plots which I find kind of concerning, but not sure how to explain them. Any help is much appreciated.
alpha[1] and alpha[8] 10_10_0.6_TwoPL_Alpha_Traceplot_1.pdf (26.3 KB)

mu[4:8]: 10_10_0.6_TwoPL_Mu_Traceplot_1.pdf (26.2 KB)

Auto-correlation plot after thinning every 10th: 10_10_0.6_TwoPL_Sigma_ACplot_1.pdf (7.3 KB)

These results are based on 2000 trains with 1000 burn-ins. Should I extend the trains longer say 5000 and 2500 burn-ins?


Looks fine at first sight. You might want to plot the post-warmup iterations only to see the hairy caterpillar. ;)

What about Rhats, energy? Any divergences? Neff reasonably high?

Edit: oh, and to be on the safe side, running more chains is recommended.

What makes you think that the traceplots look concerning?

1 Like

Thank you Max for the feedback. I am concerned because I see there is still some autocorrelation in the lags even though I have thinned at every 10th iteration. The hairy caterpillar is concerning too because the two chains does not seem to be that stationary (still going up and down quite a bit). Am I overconcerned? The other diagnostics looks OK to me. Could you help me take a quick look? Please find attached. 10_10_0.6_TwoPL_Diagnostics_1.txt (3.8 KB) The rhat at each individual parameter estimate looks good. Is there any way to get an overall estimate of the rhat across all the chains we run? I am not sure how big the effect sample size is regarded as big enough. Please see here (partial results): 10_10_0.6_TwoPL_Par_Ests_1.pdf (63.8 KB) I ran 2 chains of 1,000 post-warmup iterations with sample size of 2,000. But in the effect my effective sample size is around 200 for each parameter. Is that enough to get stable estimates? The MC error and the standard deviation of all the estimates looks decent. Thanks.

That looks all good. The Neff is pretty low, though…

But first: Rhats look fine. There is no overall measure for it afaik. Just make sure that all Rhat < 1.1. This indicates that the chains converged - it’s a necessary, but not sufficient condition!

The movement in the plot indicates autocorrelation, and the relatively low Neff is also indicative of that. The autocorrelation plots on the other hand do not look that bad. And while the Neff is not great, it’s not super terrible - but this really depends on your application! If you want to compute extreme quantiles of the posterior, then Neff of around 200 is certainly too low. If your interested in posterior means/medians, you should be fine.

You can run longer chains (and more!!) and just crank up the Neff in a “brute force” way to the level you need for you analysis. I’d only use thinning if the number of posterior draws gets too big to handle conveniently - otherwise you’re throwing away information.

The more elegant way is to parameterize your model in a way that you don’t have these autocorrelations in the first place. I’m afraid I don’t know enough about these kind of models to be of help here. :(

1 Like

what he said - thinning is not recommended - We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling | Statistical Modeling, Causal Inference, and Social Science

visual inspection of the traceplots is not a great way to check convergence. if you’re in the R universe, consider using the bayesplot package -

cmdstan’s diagnose utility is also useful, although it’s rather annoying in that if your model has converged and all is well, it returns silently. (diagnose give summary at end; if no problems found, say so · Issue #659 · stan-dev/cmdstan · GitHub)


Thank you for the quick feedback, Max. This is super reassuring!!! I will try to run a long chain to increase my Neff, which hopefully consequently get rid of the autocorrelation. Why HMC sampler does not recommend thinning? I do not quite get it. As I did notice the difference between thinning and not thinning with the same parameter setting. There are some very high autocorrelation without thinning, while the AC looks much more acceptable after thinning. Maybe it is because my chain is not long enough. BTW, how many chains do you guys often run? How many iterations do you use in your field? Any knowledge about the effective sample size in education/sociology/psychology? Thank you very much.

I think thinning or not is a dilemma. Any way, maybe HMC is very efficient so that it actually self correct autocorrelation? Thank you for the extra resource. I found them all useful. Another neat tool is shinystan for visualize model diagnostics and estimates. Thank you very much. :-)

4 chains is the default in rstan and what I normally use. Most of the time the number of CPU cores is the most reasonable pick, for obvious reasons, but in any case, I think two is not enough. You also want more chains to check whether they really converged: each chain starts at different initial values and if they all converge to the same parameter space, you can be a bit more confident about overall convergence.

Quick question: Are you using HMC or NUTS? (The default in rstan is NUTS.) Asking, because you really should use NUTS if you are not already.

…I’ll continue my answer when I get home. ;)

1 Like

Thanks Max. I think I am using the default setting if I did not change. So I think it is NUTS. I am running 2 chains to start with because it takers forever to run 2 chains of 1000 post warmup iterations. Is there anyway to speed up the estimation? I tried desktop computer as well, only somewhat faster than my laptop. It is a simulation study. How many repetitions will you use if you run simulation study. Thanks. :-)

That’s good. If your laptop (or desktop) has more than 2 physical cores, then running the model will probably not take that much longer, though. Did you specify thin = 10, or did I get that wrong? Did you get warnings when not using thinning?

HMC has low autocorrelation, because it explores the parameter space efficiently. NUTS even more so. Check out this great lecture by Richard McElreath for intuition for why NUTS is working so well. This presentation by Michael Betancourt is pretty awesome as well (check after ca. 30 min. especially). HOWEVER, it’s possible (loosely speaking) that the way you implemented the model leads to a posterior geometry that induces autocorrelation - it is hard for the chains to reach new/different regions of the posterior fast.

If you are at liberty to disclose your model code, I would suggest to start a new thread posting you code with the request to help to make the model more efficient.


My computer/laptop both have 4 cores. So running 4 chains work just fine. But not sure why it takes forever to finish any chain over 1000 iterations in less than 5 hours? I did specify thin = 10 in rstan model specification as follows:

fit1 <- stan(
file = paste0(path,“TwoPL.stan”), # Stan program
data = data, # named list of data
chains = 2, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
thin = 10,
cores = 4, # number of cores (using 2 just for the vignette)
refresh = 1, # show progress every ‘refresh’ iterations
control = list(max_treedepth = 20,adapt_delta=0.99)

As my model is hierarchical, maybe I really should think reparameterizing my model to make it more efficient?

Thank you for the two presentations as well. :-)

is max_treedepth=20 necessary?
use the diagnostics on the treedepth to set a proper limit.

I don’t thing you get any value add from having more cores than chains - maybe in RStan there’s some magic? @jonah or @bgoodri would know.

are you doing the non-centered parameterization?


I tried 10 max_treedepth. It did not converge, but I am not sure if it is completely because of that parameter or something else.

I specified 4 cores only because I ran 4 chains before. Did not bother to change to 2 cores when I ran 2 chains.

I am trying non-centering today to see if it will speed up my estimation.

Thank you very much. :-)

I assume speeding up the model and getting convergence is pretty much the same problem here (this more recent talk by @betanalpha also gives good intuition for that) and reparameterizing is most probably the solution (to both).

1 Like

I did reparameterize today. It converged and ran pretty fast. Thank you very much.

1 Like

Great! :)

1 Like