Using multiprocessing for kfold with custom families

Please also provide the following information in addition to your question:

  • Operating System: macOS 10.14.5
  • brms Version: 2.9.0

When using a custom family to run kfold with plan(multiprocessing), it can’t seem to find log_lik functions that are in the global environment. I could only get it to work with my custom family if I turned the log_lik into a single function and passed it in the log_lik flag when creating the custom family. On the other hand, if I ran it normally (with plan(sequential)), this wasn’t a problem so seems to be an issue of passing environment variables to the processes in future.

Here is a reproducible example from only code from the vignettes (just to make sure it wasn’t an issue with my code). I made one small edit so that size was passed in as part of the regression rather than as a stanvar. (Without this edit, I ran into the same issue discussed here: No samples when using reloo on custom_family brmsfit . Is there now a more general solution than the one proposed there by any chance?)

library(brms)
data("cbpp", package = "lme4")

log_lik_beta_binomial2 <- function(i, draws) {
  mu <- draws$dpars$mu[, i]
  phi <- draws$dpars$phi
  N <- draws$data$trials[i]
  y <- draws$data$Y[i]
  beta_binomial2_lpmf(y, mu, phi, N)
}

beta_binomial2 <- custom_family(
  "beta_binomial2", dpars = c("mu", "phi"),
  links = c("logit", "log"), lb = c(NA, 0),
  type = "int", vars = "trials[n]", 
  log_lik = log_lik_beta_binomial2
)

stan_funs <- "
  real beta_binomial2_lpmf(int y, real mu, real phi, int T) {
    return beta_binomial_lpmf(y | T, mu * phi, (1 - mu) * phi);
  }
  int beta_binomial2_rng(real mu, real phi, int T) {
    return beta_binomial_rng(T, mu * phi, (1 - mu) * phi);
  }
"

stanvars <- stanvar(scode = stan_funs, block = "functions")

fit2 <- brm(
  incidence | trials(size) ~ period + (1|herd), data = cbpp, 
  iter = 200,
  family = beta_binomial2, stanvars = stanvars
)
expose_functions(fit2, vectorize = TRUE)

loo(fit2)

kfold(fit2, chains = 1)

library(future)
plan(multiprocess)
kfold(fit2)

I don’t see a fix for the multiprocess issue right now, as multiprocess seems to use separate enviroments that do not have the global enviroment of the main R process as a parent environment.
But it seems you already found a solution via the `log_lik´ argument.

Passing stuff via stanvar for newdata is a little bit tedious but you may use the new_objects argument for this purpose.

Thanks for the quick response! Re: new_objects, I’m not sure I totally understand where that would go in the call to kfold. Would I also have to edit the kfold function somehow like @bmfazio did in their solution?

You are right, new_objects may not be helpful for kfold as it is not appropriately subsetted inside kfold. Generally, subsetting new_objects automatically is more or less impossible as brms does not know what is passed there. Basically, for use in kfold, I would recommend passing all data via data and not using stanvars if possible.

Yeah, I’ve been trying to figure out how to do that exactly. What I need basically is something that reads in another (set of) variable(s) of length n that get passed to the custom lpmf family. Is there any functionality for including custom additional response information (like trials()/weights etc)?

Not that I am aware of. What we could do is implemented an addition argument that takes in vectors of values without checking them, which could be used inside a custom family. For instance

y | real(z) ~ x

where z is a (real) addition variables to be used in the custom family. Would something like that solve your problem?

Yes, that would be perfect!

Would you mind opening an issue about this on https://github.com/paul-buerkner/brms/issues?

Yep, just did. See here: https://github.com/paul-buerkner/brms/issues/707