Multicore Speedups are different between models

That’s only a couple percent variation, which is normal for any kind of speed comparison unless you take extreme measures to make sure no background processes ever run.

I start to think the memory bandwidth bound is the real explanation. The 12.6GB/s is with single core, thus 4 cores should have higher bandwidth. I haven’t measure it but it’s may be several times higher.

That’s right. The variation looks normal. There shouldn’t be any random scheduling problem.

And what, if you start 4 processes of ad_advertisement with cmdstan, does it differ from RStan?

I have a 64-core server, so I ran a fit with 60 chains. One chain behaved very badly (was off by 2 orders of magnitude - which may be a hint that we should have been fitting a log-scaled parameter), and my rhat’s were pretty bad. However, the other 59 chains converged.

When I compared the model predictions to the data, they actually did a pretty good job.

So my question is: is there ever a case where it is ok to ignore one bad chain? Or is the best practice to fix the problem (e.g. run longer warmup / rescale parameters) and re-fit?

In general, fix the problem. The number of situations where the one “bad” chain indicates there is a problematic part of the parameter space that the “good” chains never encountered is bigger than the number of situations where the one bad chain can be safely ignored. I would think about the scaling and also consider specifying a smaller value of init_r.