Some chains does not stop within reasonable time

Symptom:
One or more chains does not stop within reasonable time (+10x the fastest chain)

Platform: brms on rocker based Docker containers, (earthlab/r-greta:latest, methodsconsultants/tidyverse-h2o:latest)

But with different models, both with and without future set to TRUE, to see if that did make a difference, I have chains ‘hanging’ with no visible progress but full cpu usage.
And terminating the process does result is loss of information from the other chains.
Both adapt_delta and treedepth have been increased as suggested from prior test runs.


The image show the latest try with the following code.

chains = 8
model5 <- brm(formula = y  ~ -1 + antal_0 + antal_1 + antal_2 + antal_3 + antal_4 + antal_5
+ antal_6 + antal_7 + antal_8 + antal_9 + antal_10 + antal_11 + antal_12 + antal_13
+ antal_14 + antal_15  + antal_16 + (1 | omNavn) + (1 | monthNr)
, chains = chains, iter = 50000,
data = brmData, control = list(max_treedepth = 20, adapt_delta = 0.9999),
inits = initfun, prior = set_prior("exponential(0.1)"), cores = chains)

nr of rows 84, dim(omNavn) = 7, dim(monthNr) = 12.

Selected output:
Fastest:
Chain 6: Elapsed Time: 612.074 seconds (Warm-up)
Chain 6: 35.4076 seconds (Sampling)
Chain 6: 647.482 seconds (Total)
Slowest (but finished)
Chain 5: Elapsed Time: 1687.63 seconds (Warm-up)
Chain 5: 196.166 seconds (Sampling)
Chain 5: 1883.79 seconds (Total)
Non completed chains (last message):
Chain 1: Iteration: 20000 / 50000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 50000 [ 10%] (Warmup)
Chain 4: Iteration: 35000 / 50000 [ 70%] (Sampling)

I hope you can help me clarify, where to search for the cause, and if there are ways to end a chain without loosing information from completed chains (or is there some model problems, if the slower chains are from hard to diagnose areas of the distribution)

I have seen the problem with this model, but also with other types of models, but searching the net does not give any real idea as why some chains ‘dies’ on me.

Kind regards

@paul.buerkner Any interest in trying to get the killable-parallelism employed by ezStan into brms for situations like this?

Haven’t looked at this approach yet, but will definitely do. @bgoodri and @jonah any thoughts this?

I haven’t added it to ezStan yet, but it would be easy to set up a time-out feature or even an adaptive time-out (if final chain is taking X times longer than it’s siblings, kill). I think I recall seeing at least one other user hoping for a feature like that; as I recall, they were also running in a server environment where they were doing daily models where one chain occasionally got hung up. Obviously it’d be important to also help users dive into what that stalled chain was doing to identify model misspecifications, but that should also be straightforward when all the chains are set to write their output to file as done in ezStan.

1 Like

@mike-lawrence: Thanks for your suggestions, will try ezStan to see if this helps diagnosing the models.
My first attempts are faced with some difficulty, but that might be because the test is on a windows platform:
Warning messages: 1: In file.remove("stan_temp") : cannot remove file 'stan_temp', reason 'Permission denied' 2: In dir.create("stan_temp") : 'stan_temp' already exists
(manually deleting the directory, solves this but clean_stan(), faces the same problem.)

And using a model built with brms, does give several problems:
bfmAgg.stan (1.8 KB)
library(loggr)
library(ezStan)
my_mod = build_stan(“bfmAgg.stan”)
start_stan(brmData, my_mod, iter = 40000, chains = 4, max_treedepth = 20, adapt_delta = 0.9999)

data is not in the right format (my error, need to look at the brms code to make the right transformations)
Chains, are read form???, Output show 16 chains, the BRMS model file was created with 32 chains and I tried to specify 4 (output only shown for chain 16, it the same across chains)
[chain16:] Registered S3 methods overwritten by ‘ggplot2’:
[chain16:] method from
[chain16:] [.quosures rlang
[chain16:] c.quosures rlang
[chain16:] print.quosures rlang
[chain16:] data with name omNavn is not numeric and not used
[chain16:] Unable to convert event to a log event.
[chain16:] failed to create the sampler; sampling not done

Hi Thorvall,

I noticed similar issues with brms chains and by trying to observe… in which situations this happens I came to the conclusion that it most likely has something to do something with the “divergent transitions after warmup” issue, which is explained here: https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup. In short, (as I understand it) during the warm-up the samplers adapt, but if the parameterization of the model is ‘not nice’ (i.e. the distribution of the estimated parameters is not easy to approximate geometrically), then the model tends to make too large sample steps, and does not converge. And one proposed solution is to decrease “adapt_delta” (i.e. the sample step-size, which is .80 by default, and .9999 in your model), and the maximum tree-depth (which is also very large in your model). First, these two changed settings make the estimation of your model - incredibly slow -. And I would assume, that the necessity to set these controls at these extremes, rather points towards a parameterization issue, which can not be solved by changing tree-depth or adapt_delta. In other words, the model still runs into sampling problems for some chains (during warm-up, not finding ‘solutions’), but does this incredibly slow on top, because of the settings.

But this is just a guess :))
But maybe just re-parameterizing the model is worth a try
E.g. with or without intercept or using contrast coding of the variables instead. From my experience this sometimes already works (no slow chains anymore).

Hope this helps.
Best, René

1 Like

Ah, shoot, sorry, I didn’t intend to imply that you should/could use ezStan to investigate your situation, as it for sure wasn’t designed to work with brms models. I was more taking the opportunity to query whether brms might use some features from ezStan in the future. I should have made a new thread and simply linked back to this one.