Problems running LOO on >1 core

I just tried updating R and all packages to the latest version, and I’m still encountering what the others have posted recently. I ran the traceback command and warnings, and this is what I got:

> loo3=loo(Main_EffectsModel, save_psis=TRUE,cores=16)

Error in get(name, envir = envir) : object 'draws' not found

traceback()
12: get(name, envir = envir)
11: serialize(data, node$con)
10: sendData.SOCKnode(con, list(type = type, data = value, tag = tag))
9: sendData(con, list(type = type, data = value, tag = tag))
8: postNode(con, “EXEC”, list(fun = fun, args = args, return = return,
tag = tag))
7: sendCall(cl[[i]], fun, list(…))
6: clusterCall(cl, gets, name, get(name, envir = envir))
5: parallel::clusterExport(cl, “draws”)
4: relative_eff.function(x = likfun, chain_id = chain_id, data = args$data,
draws = args$draws, cores = cores, …)
3: loo::relative_eff(x = likfun, chain_id = chain_id, data = args$data,
draws = args$draws, cores = cores, …)
2: loo.stanreg(Main_EffectsModel, save_psis = TRUE, cores = 16)
1: loo(Main_EffectsModel, save_psis = TRUE, cores = 16)
There were 16 warnings (use warnings() to see them)
warnings()
Warning messages:
1: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 19 (<-DESKTOP-NIBF9LB:11815)
2: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 18 (<-DESKTOP-NIBF9LB:11815)
3: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 17 (<-DESKTOP-NIBF9LB:11815)
4: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 16 (<-DESKTOP-NIBF9LB:11815)
5: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 15 (<-DESKTOP-NIBF9LB:11815)
6: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 14 (<-DESKTOP-NIBF9LB:11815)
7: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 13 (<-DESKTOP-NIBF9LB:11815)
8: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 12 (<-DESKTOP-NIBF9LB:11815)
9: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 11 (<-DESKTOP-NIBF9LB:11815)
10: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 10 (<-DESKTOP-NIBF9LB:11815)
11: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 9 (<-DESKTOP-NIBF9LB:11815)
12: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 8 (<-DESKTOP-NIBF9LB:11815)
13: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 7 (<-DESKTOP-NIBF9LB:11815)
14: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 6 (<-DESKTOP-NIBF9LB:11815)
15: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 5 (<-DESKTOP-NIBF9LB:11815)
16: In .Internal(grep(as.character(pattern), x, ignore.case, … :
closing unused connection 4 (<-DESKTOP-NIBF9LB:11815)

It does work with brms at least so may not be an issue of loo necessarily but perhaps of rstanarm?

Hmm, yeah possibly an rstanarm issue then. I’ll look into it but unfortunately I don’t have access to a Windows machine at the moment, so I might not be able to reproduce this.

Because I don’t have access to Windows at the moment, it would help me a lot to if someone who is getting this error could try running the code below. This should help me isolate whether this is a problem in rstanarm or in the loo package:

library(rstanarm)
fit <- stan_glm(mpg ~ wt, data = mtcars, chains = 1, iter = 1000)

loo(fit, cores = 2) # this will probably error since cores > 1

# now manually extract log_lik, compute r_eff, and run loo.matrix
loglik <- log_lik(fit)
reff <- loo::relative_eff(loglik, chain_id = rep(1, 500), cores = 2)
loo::loo.matrix(loglik, r_eff = reff, cores = 2)

Question: do either of the last two lines (loo::relative_eff or loo::loo_matrix) result in the error?

And one more code chunk to run if you don’t mind (to test the loo.function method):

# Simulate data and draw from posterior
N <- 50; K <- 10; S <- 100; a0 <- 3; b0 <- 2
p <- rbeta(1, a0, b0)
y <- rbinom(N, size = K, prob = p)
a <- a0 + sum(y); b <- b0 + N * K - sum(y)
fake_posterior <- as.matrix(rbeta(S, a, b))
dim(fake_posterior) # S x 1
fake_data <- data.frame(y,K)
dim(fake_data) # N x 2

llfun <- function(data_i, draws) {
  # each time called internally within loo the arguments will be equal to:
  # data_i: ith row of fake_data (fake_data[i,, drop=FALSE])
  # draws: entire fake_posterior matrix
  dbinom(data_i$y, size = data_i$K, prob = draws, log = TRUE)
}

reff <- loo::relative_eff(llfun, chain_id = rep(1, S), cores = 2, 
                          data = fake_data, draws = fake_posterior)
loo_with_fn <- loo::loo.function(llfun, r_eff = reff, cores = 2, 
                                 draws = fake_posterior, data = fake_data)

Do either of the last two lines here result in the error?

Thanks a lot and hopefully this will help us fix this!

@paul.buerkner This seems similar to this loo issue

And also, I glanced at the brms code and it seems like you’re using the matridx method for loo::relative_eff() and not the function method. Is that correct? If so that’s a difference between brms and rstanarm here. So it’s possible this is all coming from the loo::relative_eff.function() method, like we saw in that issue I linked to.

Here’s the first block:

fit ← stan_glm(mpg ~ wt, data = mtcars, chains = 1, iter = 1000)

> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
> Chain 1: 
> Chain 1: Gradient evaluation took 0 seconds
> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
> Chain 1: Adjust your expectations accordingly!
> Chain 1: 
> Chain 1: 
> Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
> Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
> Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
> Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
> Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
> Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
> Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
> Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
> Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
> Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
> Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
> Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
> Chain 1: 
> Chain 1:  Elapsed Time: 0.106 seconds (Warm-up)
> Chain 1:                0.065 seconds (Sampling)
> Chain 1:                0.171 seconds (Total)
> Chain 1: 
> > loo(fit, cores = 2) # this will probably error since cores > 1
> Error in get(name, envir = envir) : object 'draws' not found
> > loglik <- log_lik(fit)
> > reff <- loo::relative_eff(loglik, chain_id = rep(1, 500), cores = 2)
> > loo::loo.matrix(loglik, r_eff = reff, cores = 2)
> 
> Computed from 500 by 32 log-likelihood matrix
> 
>          Estimate  SE
> elpd_loo    -83.6 4.2
> p_loo         3.3 1.2
> looic       167.2 8.5
> ------
> Monte Carlo SE of elpd_loo is 0.1.
> 
> Pareto k diagnostic values:
>                          Count Pct.    Min. n_eff
> (-Inf, 0.5]   (good)     31    96.9%   161       
>  (0.5, 0.7]   (ok)        1     3.1%   146       
>    (0.7, 1]   (bad)       0     0.0%   <NA>      
>    (1, Inf)   (very bad)  0     0.0%   <NA>      
> 
> All Pareto k estimates are ok (k < 0.7).
> See help('pareto-k-diagnostic') for details.
> Warning messages:
> 1: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
>  
> 2: In for (i in seq_len(n)) { :
>   closing unused connection 5 (<-DESKTOP-NIBF9LB:11494)
> 3: In for (i in seq_len(n)) { :
>   closing unused connection 4 (<-DESKTOP-NIBF9LB:11494)

It looks like that actually worked! I’m going to run the second block in a minute

Less luck with the second block, I got the error again:

> reff <- loo::relative_eff(llfun, chain_id = rep(1, S), cores = 2, 
+                           data = fake_data, draws = fake_posterior)
Error in get(name, envir = envir) : object 'draws' not found
> loo_with_fn <- loo::loo.function(llfun, r_eff = reff, cores = 2, 
+                                  draws = fake_posterior, data = fake_data)
Error: 'r_eff' must have one value per observation.

Thanks @Longshot408, that’s super helpful. I think these results confirm my suspicion about the loo:::relative_eff.function() method. Seems like this problem again

but the previous fix wasn’t actually a fix apparently.

@paul.buerkner rstanarm is calling the function method of relative_eff and I think brms is not, so that explains why this only seems to affect rstanarm. I can change to the matrix method in rstanarm temporarily but hopefully we can figure out how to fix this in loo. Any ideas why the fix when we closed that issue seems not to work in some cases?

1 Like

Glad I could help. Is there any known way to fix?

I’m a little bummed to find out that the extra cores I paid for when I upgraded my CPU this spring can’t really be used :(

So a short term solution for anyone affected by this is to extract the log likelihood matrix from the fitted model object (rstanarm stanreg object) using log_lik(fit) and pass that to loo::loo.matrix().

1 Like

Can you confirm that this would be the correct method then?

> Main_EffectsModel=stan_glm(Accept_Reject~Discount+Floor, 
+                            family = binomial(link = "logit"), 
+                            data=sonadata_clean, 
+                            prior = Priors_MEmodel,
+                            #prior_intercept = normal(), 
+                            #prior_PD = TRUE, 
+                            algorithm = c("sampling"), 
+                            mean_PPD = TRUE,
+                            adapt_delta = 0.95, 
+                            #QR = FALSE, 
+                            #sparse = FALSE,
+                            chains=3,iter=550,cores=3)
> lik=log_lik(Main_EffectsModel)
> loo::loo(lik, save_psis=TRUE,cores=16)

Computed from 825 by 633 log-likelihood matrix

         Estimate   SE
elpd_loo   -231.6 16.3
p_loo         4.0  0.4
looic       463.3 32.6
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
Warning message:
Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and 
MCSE estimates will be over-optimistic.

Looks good!

1 Like

The other difference is that brms uses loo.matrix by default while rstanarm uses loo.function. brms loo has the pointwise argument to switch to loo.function. When I activate that together with cores > 1, the loo computation does not terminate in reasonable time (few minutes) for a model that takes 2 seconds in pointwise evaluation when cores = 1. So there may be another problem with loo.function and cores on windows more generally. I will try to look into it later in more detail.

I have opened a PR to fix the problem with relative_eff.function on Windows (https://github.com/stan-dev/loo/pull/152)

On a somewhat related note, is this the only way to get my model itself to run in parallel on Windows?

@Longshot408 No I think parallelization when running a model should work fine. The only recent issue with parallelization I know of is

Are you getting that or other errors? If so can you open a separate topic and we’ll try to sort it out there?

Thanks, that would be helpful!

Nope, no errors, just disapointing benchmarks. After switching from an i5-6500 to a Ryzen 7 3700X I wanted to see how much more efficient my model run times were going to be; turns out the only benefit was the increased clockspeed of the newer CPU. Benchmarks stopped improving after setting “cores=3”.

After doing some more digging, is this because the cores setting is limited by how many chains you run??

Yeah that’s right. But you can leverage the extra cores by using certain features of the Stan language that allow for within-chain parallelization. Here’s a good tutorial from @bbbales2 on using one of those functions:

https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html

Unfortunately, we haven’t yet updated the models in rstanarm to use those functions.

1 Like

I think the error with multiple cores on windows is now fixed on GitHub. So it’s possible to either use the workaround I mentioned above

or to install the development version of the loo package from GitHub, in which case the workaround shouldn’t be necessary anymore:

devtools::install_github("stan-dev/loo")