Error when using brms with reduce_sum

I am trying to use the new reduce_sum functionality in brms (which is a huge, appreciated upgrade!), but am getting a non-intuitive (to me) error message.

The model I am trying to fit is complex and has taken weeks to run using brms in the past. A few weeks ago when I learned about the new reduce_sum functionality, I amended the model to leverage it and started it again. The model finished compiling at the 3 week mark, but did not return the expected brms object; rather, I received this disappointing error message:

The model has successfully run to completion with brms before using across-chain parallelization (but not within-chain parallelization), and I can run the model using the epilepsy data set in the brms vignette with within-chain parallelization successfully.

The code I used to run the model is:

output <- brm([model specified here],
                      data= data, family=bernoulli(link = "logit"), 
                      prior = priors_df,
                      refresh = 10, control=list(adapt_delta=0.5),
                      threads = threading(threads = 3, static = TRUE), 
                      backend = "cmdstanr",
                      chains = 4, cores = 4, iter = 4000, seed = 1989)
  • Operating System: Windows 10 64-bit
  • brms Version: 2.14.0
  • rstan Version: 2.21.2

Could anyone help me figure out what is going on here? Thanks!

Speeding up a bernoulli logit type model is hard with reduce_sum and things depend on details if you actually gain. So you should try out if your model really speeds up using a sub-sampled data-set.

Sorry to hear that your long run simply crashed. I would suggest you download the more recent version of brms from CRAN or you even go with the github brms as there were a few fixes for reduce_sum. It‘s still odd to hear that the model ran for a long time just fine and then crashed.

Since you are on Windows and you are struggling with runtime… you may want to consider using the WSL emulation of Linux and run things in that envirnoment (its still under Windws). People have reported significant speedups doing that.

1 Like

Thanks for the advice! The runtime speed-up was substantial with reduce_sum (~1 month with 3000 iterations with 4 chains vs. 3 weeks with 4000 iteractions with 4 chains and 3 cores per chain), it was just disappointing to crash at the end.

I’ll try the developer version of brms first and report back if I still have an issue.

2 Likes

Have you restarted the Rsession since or restarted Rstudio? There might still be ways of obtaining the files.

That’s great to hear that things go from almost impossible to fit to something better.

Still… you should try using a Linux platform! Threading has a large performance penalty under windows (which you avoid with WSL).

I did… oops :/

I think this might suggest that the file was quite large, maybe too large for Windows to handle?

Can you run this only for a few iterations and see whether it runs fine, returns the fit and all.
How many parameters does the model have?

I ran the same model for a small number of iterations (200) with across- and within-chain parallelization. The model again finished running, but yielded the following error message and did not return a brms object:

Error: Supplied CSV file is corrupt!

I still have the R session open.

For parameters, the model has in excess of 100k with a little over 300k observations.

What would you advise as a next step?

Can you run tempdir(), you should see something like "C:\\Users\\Rok\\AppData\\Local\\Temp\\RtmpWWF8hi" and check the folder above the reported one. In my case "C:\\Users\\Rok\\AppData\\Local\\Temp\\. Check if any of the subfolder have any recently generated .csv files.

Yes, here is what is in the designated folder:

Capture

The .csvs are all empty.

I updated my brms install to the latest developer version and tried a short run (200 iterations) again. The model again ran but, when finished, provided the same error:

Error: Supplied CSV file is corrupt!

Thoughts on next steps to debug?

Popping back in to provide an update: I followed @wds15’s suggestion to try WSL, and the model successfully executed (still not quite converged, but progress)! So, I assume there is an issue with Windows per @rok_cesnovar’s comment?

1 Like