Some chains does not stop within reasonable time

thorvall · June 12, 2019, 5:34am

Symptom:
One or more chains does not stop within reasonable time (+10x the fastest chain)

Platform: brms on rocker based Docker containers, (earthlab/r-greta:latest, methodsconsultants/tidyverse-h2o:latest)

But with different models, both with and without future set to TRUE, to see if that did make a difference, I have chains ‘hanging’ with no visible progress but full cpu usage.
And terminating the process does result is loss of information from the other chains.
Both adapt_delta and treedepth have been increased as suggested from prior test runs.

The image show the latest try with the following code.

chains = 8
model5 <- brm(formula = y  ~ -1 + antal_0 + antal_1 + antal_2 + antal_3 + antal_4 + antal_5
+ antal_6 + antal_7 + antal_8 + antal_9 + antal_10 + antal_11 + antal_12 + antal_13
+ antal_14 + antal_15  + antal_16 + (1 | omNavn) + (1 | monthNr)
, chains = chains, iter = 50000,
data = brmData, control = list(max_treedepth = 20, adapt_delta = 0.9999),
inits = initfun, prior = set_prior("exponential(0.1)"), cores = chains)

nr of rows 84, dim(omNavn) = 7, dim(monthNr) = 12.

Selected output:
Fastest:
Chain 6: Elapsed Time: 612.074 seconds (Warm-up)
Chain 6: 35.4076 seconds (Sampling)
Chain 6: 647.482 seconds (Total)
Slowest (but finished)
Chain 5: Elapsed Time: 1687.63 seconds (Warm-up)
Chain 5: 196.166 seconds (Sampling)
Chain 5: 1883.79 seconds (Total)
Non completed chains (last message):
Chain 1: Iteration: 20000 / 50000 [ 40%] (Warmup)
Chain 3: Iteration: 5000 / 50000 [ 10%] (Warmup)
Chain 4: Iteration: 35000 / 50000 [ 70%] (Sampling)

I hope you can help me clarify, where to search for the cause, and if there are ways to end a chain without loosing information from completed chains (or is there some model problems, if the slower chains are from hard to diagnose areas of the distribution)

I have seen the problem with this model, but also with other types of models, but searching the net does not give any real idea as why some chains ‘dies’ on me.

Kind regards

mike-lawrence · June 12, 2019, 7:08pm

@paul.buerkner Any interest in trying to get the killable-parallelism employed by ezStan into brms for situations like this?

paul.buerkner · June 12, 2019, 7:13pm

Haven’t looked at this approach yet, but will definitely do. @bgoodri and @jonah any thoughts this?

mike-lawrence · June 12, 2019, 7:20pm

I haven’t added it to ezStan yet, but it would be easy to set up a time-out feature or even an adaptive time-out (if final chain is taking X times longer than it’s siblings, kill). I think I recall seeing at least one other user hoping for a feature like that; as I recall, they were also running in a server environment where they were doing daily models where one chain occasionally got hung up. Obviously it’d be important to also help users dive into what that stalled chain was doing to identify model misspecifications, but that should also be straightforward when all the chains are set to write their output to file as done in ezStan.

thorvall · June 13, 2019, 7:51am

@mike-lawrence: Thanks for your suggestions, will try ezStan to see if this helps diagnosing the models.
My first attempts are faced with some difficulty, but that might be because the test is on a windows platform:
Warning messages: 1: In file.remove("stan_temp") : cannot remove file 'stan_temp', reason 'Permission denied' 2: In dir.create("stan_temp") : 'stan_temp' already exists
(manually deleting the directory, solves this but clean_stan(), faces the same problem.)

And using a model built with brms, does give several problems:
bfmAgg.stan (1.8 KB)
library(loggr)
library(ezStan)
my_mod = build_stan(“bfmAgg.stan”)
start_stan(brmData, my_mod, iter = 40000, chains = 4, max_treedepth = 20, adapt_delta = 0.9999)

data is not in the right format (my error, need to look at the brms code to make the right transformations)
Chains, are read form???, Output show 16 chains, the BRMS model file was created with 32 chains and I tried to specify 4 (output only shown for chain 16, it the same across chains)
[chain16:] Registered S3 methods overwritten by ‘ggplot2’:
[chain16:] method from
[chain16:] [.quosures rlang
[chain16:] c.quosures rlang
[chain16:] print.quosures rlang
[chain16:] data with name omNavn is not numeric and not used
[chain16:] Unable to convert event to a log event.
[chain16:] failed to create the sampler; sampling not done

ReneTwo · June 13, 2019, 10:05am

Hi Thorvall,

I noticed similar issues with brms chains and by trying to observe… in which situations this happens I came to the conclusion that it most likely has something to do something with the “divergent transitions after warmup” issue, which is explained here: https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup. In short, (as I understand it) during the warm-up the samplers adapt, but if the parameterization of the model is ‘not nice’ (i.e. the distribution of the estimated parameters is not easy to approximate geometrically), then the model tends to make too large sample steps, and does not converge. And one proposed solution is to decrease “adapt_delta” (i.e. the sample step-size, which is .80 by default, and .9999 in your model), and the maximum tree-depth (which is also very large in your model). First, these two changed settings make the estimation of your model - incredibly slow -. And I would assume, that the necessity to set these controls at these extremes, rather points towards a parameterization issue, which can not be solved by changing tree-depth or adapt_delta. In other words, the model still runs into sampling problems for some chains (during warm-up, not finding ‘solutions’), but does this incredibly slow on top, because of the settings.

But this is just a guess :))
But maybe just re-parameterizing the model is worth a try
E.g. with or without intercept or using contrast coding of the variables instead. From my experience this sometimes already works (no slow chains anymore).

Hope this helps.
Best, René

mike-lawrence · June 13, 2019, 3:25pm

Ah, shoot, sorry, I didn’t intend to imply that you should/could use ezStan to investigate your situation, as it for sure wasn’t designed to work with brms models. I was more taking the opportunity to query whether brms might use some features from ezStan in the future. I should have made a new thread and simply linked back to this one.

Famondir · August 8, 2020, 11:01am

Has here been any progression. My more complex models tend to have a chain that does not start sampling even after 8 h when the other chains are at 1200 samplings.

I now tried to start a sampling with the future = TRUE setting. But I wasn’t able to see the progress of the chains in the viewer anymore? Is there another way to check the process while using the future feature?

Famondir · August 10, 2020, 12:14pm

A workaround I’m going to try out next is using the Rstudio Jobs feature to start multiple scripts that all start a single chain and combining the chains via brms::combine_models() afterwards.

Are there any problems coming up you can warn me for right now? Maybe for calculating new Rhats? I tried this in before with two simple regression models.

The benefit I see is that you can inspect every chains progress and terminate the chains that are much slower. So when I start some extra chains right away I should get my desired samplesizes and chain numbers.

If I’m not writing here again ind a couple of days this workaround seems to work fine.

Topic		Replies	Views
Brms does not finish after drawing samples brms rstan , brms	0	64	December 18, 2024
Select chains and iterations post-sampling in brms brms	7	3566	April 15, 2022
Chain finished unexpectedly when using brms on a cluster brms fitting-issues	5	1263	June 26, 2025
One chain considerably slower than all others General fitting-issues	3	1500	May 15, 2023
Within-chain parallelization misbehaves with brms in a model with measurement errors brms	21	1982	November 1, 2020

Some chains does not stop within reasonable time

Related topics