Parallel chains not working; different behavior in RStudio and R from Terminal

This is a continuation of the problem that I poorly described here: Complaints about slot "mode" rearing their head again

Hopefully, I’ll have clarified things here.

First off, platform and versions: macOS High Sierra, version 10.13.6; R, version 3.6.1; RStudio, version 1.1.456; RStan, version 2.18.2. [ETA: I’ve updated RStan to version 2.19.2 (and am still seeing the same problems)].

If I sample a couple of my models from within RStudio (given simulated data), then one of two things happens:

  1. If mc.cores = parallel::detectCores(), then some or all of the chains will die prematurely. With one model, 2 out of 4 chains complete, and with my other model, all of the chains just die. I get the error message “Error in unserialize(socklist[[n]]) : error reading from connection” after one of the chains dies. Depending on the model, I may also get the error messages: Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal, Calls: <Anonymous> ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize, and Execution halted. When each chain starts, I get a (presumably benign) message like “starting worker pid=6810 on localhost:11280 at 09:48:12.300”.

  2. If I don’t set mc.cores, then each chain runs in serial, one after the other, and they all finish without any apparent problems.

If I open up a terminal and run R from that terminal, I get different behavior:

  1. If mc.cores = parallel::detectCores(), then the behavior is almost the same as with RStudio, but the messages differ. The error messages that I get are “Error in FUN(X[[i]], ...) : trying to get slot "mode" from an object of a basic class ("NULL") with no slots” and “In mccollect(jobs) : N parallel jobs did not deliver results” (where N is 2 or 4). I also don’t see a message like “starting worker pid=6810 on localhost:11280 at 09:48:12.300”.

  2. If I don’t set mc.cores, then sampling ends prematurely with “Abort trap: 6”, and the R session itself also terminates. This happens regardless of the number of chains I try to use.

These models used to work just fine. [ETA: After taking a closer look at the timestamps of the files containing MCMC samples, it looks like my models may never have worked with RStan on my Macbook. Those files were generated on a previous machine of mine (a Linux box), and then transferred to the Macbook.] Unfortunately, it’s been a few months since I’ve worked with them, and my system has gone through several updates since I touched them again. It’s not clear to me what update(s) broke things.

PyStan has not given me any of these problems, in any of the versions that I’ve tried, from PyStan 2.17.1.0 to 2.19.0.0. This is when using the same models on similar simulated data. [ETA: Despite the “ETA” bit in the last paragraph, I did confirm that PyStan still works on my MacBook.]

For now, I need to be cautious about posting the models publicly, unfortunately.

Any ideas on what is going on here?

1 Like

The first thing to check in these cases is if you ran out of memory. It may be that running 4 chains simultaneously is too onerous for your system. A way to see if this is the case is to try with 2 cores, and if that’s fine try 3.

It doesn’t look like a memory problem. Last time, 2 out of 4 parallel chains finished. When I reduced the number of parallel chains to 2, 1 out of 2 finished. So I’m apparently using less memory but still getting prematurely ending chains.

Do you set a random seed when you run these? If you see a failure on 4 cores with a certain random seed, would you still see the failure on 1 core using the same seed? This could help in identifying the problem, if it can be reproduced in serial (tracebacks from R are almost useless if an error happens in a parallelized region).

Yes, I do.

RStudio and the R terminal do behave differently in parallel because the former uses sockets and the latter uses forking. I think that would explain why the error messages could be different but not why the errors are happening in the first place.

Tentatively, that would suggest that there’s some problem in how my R installation is set up to do parallelism. That doesn’t seem to explain why, when I try to sample from my model from the R terminal, I get “Abort trap: 6”, which from what I’ve googled indicates either some kind of memory violation or a failed assertion.

I wish there was some way to set up R so that it did a traceback when that kind of error was encountered.

I did write some code that, to my surprise, triggered all the problems that I had mentioned before (the “Abort trap: 6” error, running fine in RStudio only if chains run in serial, etc.):

test_line.stan:

data {
  int N;
  vector[N] x;
  vector[N] y;
}

parameters {
  real a;
  real b;
  real SD;
}

model {
  y ~ normal(a*x + b, SD);
}

test_line.R:

#!/usr/bin/env Rscript

library(rstan)

set.seed(12345)

N <- 20

SD <- 10.0
a <- 5.0
b <- 2.0

x <- seq(-10.0, 10.0, length.out = N)
y <- rnorm(N, mean = a*x + b, sd = SD)

model <- stan_model("test_line.stan")
fit <- sampling(model, 
                data = list(N = N, x = x, y = y),
                seed = 54321)
print(fit)

I’m surprised that this triggered all the problems, since I couldn’t get the usual “8 schools” example to do it.

This example runs fine for me on Linux. I’ve also never seen (anyone with) an “Abort trap: 6” with Stan. There have been reports vaguely like this on Windows where the Stan program is thought to be a virus.

Do you get a reasonable traceback if you get the same error when running in serial?

Marco

Unfortunately, I can’t figure out how to get a traceback for the “Abort trap: 6” case. If I source test_line.R from the R prompt at the terminal, R itself dies.

This suggests to me that the problem isn’t with Stan proper. Judging from the different responses between terminal-based R and RStudio, I’m guessing that the problem may have something to do with how parallel chains are being launched, but I don’t know if that explains the “Abort trap: 6” error.

After looking at the RStan code, I suspect the parallel chains thing may be somewhat of a red herring, since if cores = 1, nothing from R’s parallel package gets launched. I suspect that what’s happening is that when the sampler is called, it dies off for some reason. That would explain why the “Abort trap: 6” message appears after the sampler has already printed messages like:

SAMPLING FOR MODEL 'test_line' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:      1 / 200000 [  0%]  (Warmup)
Abort trap: 6

If that’s the case, the question is if why the sampler is dying. It’s as if when running in serial in RStudio, the sampler is protected from being killed, but if either terminal-based R or RStudio launches a sampler, then the sampler is somehow exposed to something that kills it? I suppose that it could be an overzealous antivirus. Unfortunately, if that’s the case, there may be little I can do. My Mac is a work laptop, and I’m not the one who installs the antivirus.

Tentatively, I seem to have found a solution.

Before, I had largely used the R installation from Anaconda using its default channels. The only exception to that was when I upgraded RStan to 2.19.2 in a failed attempt to stop the crashes; in that case, I uninstalled the conda packages r-rstan and r-stanheaders and then installed the rstan and StanHeaders packages using install.packages() instead.

However, if I used the R installation from the Conda-Forge channel (in its own conda environment, BTW), then RStan seemed to behave properly. I’m not sure why that makes a difference, though I notice that with Conda-Forge, the CLang version is 9.0, where as in mainline Anaconda, it’s only 4.0.

I have also had some problems with the default Anaconda distribution (mainly python stuff) so keeping your environment using only conda-forge sounds reasonable.

Do you know is there somebody keeping conda-forge packages up-to-date?

Do you use system compiler or conda-forge compiler?

(See C++ part and use -c conda-forge flag when installing)

https://github.com/stan-dev/pystan/blob/develop/doc/installation_beginner.rst

I actually haven’t had any need for conda-forge when using PyStan. It’s just been RStan that’s been giving me trouble.

I am not privy to who in particular is maintaining the conda-forge packages.

As far as I know, I use whatever compiler that $CXX points to, and that environment variable is set in the current Anaconda environment.