Different execution times between chains

Hello everyone,

Since a few weeks I have started to learn Stan through rstan. I am working with ordinal variables and therefore with ordinal regression models. I am still getting familiar with the language and output of this software.

I have a question about the execution time of each chain using rstan. I have run a very simple ordinal model, such as the one found in Stan User’s Guide. To do this, I run, in parallel, 5 chains of 1500 simulations, 500 of burnin (warmup) and 5 of thin. Of course I use 5 cores out of 8 available on my laptop (checked with htop with Linux terminal). Therefore, I keep 200 simulations per chain, that is, 1000 simulations in total for each parameter of interest.

My question is about the different times that each chain has. When I run 5 chains, it seems like they usually finish 3 of them at the same time, and after a while, the other 2 remaining chains finish, also around the same time. It seems strange to me, with WinBUGS (the only Bayesian inference software I’ve ever used) they usually finish, approximately, all at the same time. In Stan, when the first 3 are done, there are still quite a few simulations of the remaining 2 chains to perform, as the output of rstan::stan() indicates. However, if I run only 3 chains for 3 cores, they do all terminate at once. The problem starts when I ask for 5 chains for 5 cores.

I have left the parameters of the rstan::stan() function by default and used the no-U-turn sampler (NUTS) algorithm. I have to say that the convergence, in any scenario of the number of chains, is very good, both in the effective number of simulations (there does not seem to be any type of autocorrelation problem) and Rhat (I declare a sufficient burnin). In fact, the traceplots indicate that the burnin could be greatly reduced, but I wanted to ensure a good diagnostic analysis of the model. I want to make this clear because I have seen that many recommendations are given by diagnostic problems, difficulty in simulation, but that does not seem to be my reason.

Is this common in Stan? Do I have to change any parameter, of those that I have left by default, of the rstan::stan() function? I am a bit confused. I thank you in advance for your help.

Does your machine have 8 physical cores, or 4 cores with hyperthreading? This can be checked with lscpu in the terminal. For example, on my machine it says:
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1

This means I only have 4 “true” cores, but htop will show 8. You generally can’t expect perfect scaling per logical thread, only per core.

My machine says exactly the same as yours. I understand that my computer has 4 “true” cores but with 8 processing threads, so it can run up to 8 processes at the same time. It is so, isn’t it? Therefore, there should be no problem, because when I run x chains, htop in terminal shows me x processing threads at 100% CPU. I understand that the 5 chains I run should finish at the same time because they are parallel processes, or am I wrong?

The way hyperthreading/smt generally work is that resources of a core that are currently not used by process 1, can be used by process 2. This makes it look like you can run twice the number of processes, however in this context I would expect those 5 processes to require essentially the same resources. Thus the fifth process doesn’t have anything to work with as everything it needs is already in use.
If one of the processes were to wait for IO for example, then all the compute units on that core could be used by another process.
So I fear that your (mis)understanding of HT/smt is based on marketing and not reflected in how it actually works.

The fact that 3 of the chains finish at the same time and 2 later probably stems from the fact that the two slower ones share a core while the rest each have a core for themselves.

I think I understand what you’re telling me. I’m going to keep doing tests, I think that may be the reason. On my laptop I have 4 “true” cores, but on the project team I work on we have a compute server that supports 96 processes. I understand that there would be no problem and the 5 chains would end at the same time. I will inform you when I can run those chains.