No samples when using reloo on custom_family brmsfit

  • Operating System: Windows 10
  • brms Version: 2.7.0

I’m using brms to fit an endpoint-inflated binomial distribution. I’ve included the log_lik, predict and fitted functions to take advantage of the post-processing methods.

All of them seem to be working just fine, but if I get high Pareto k observations in loo, using reloo = TRUE results in the following error:

Fitting model 1 out of 1 (leaving out observation 497)
Start sampling
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
[1] "error occurred during calling the sampler; sampling not done"

 Error: The model does not contain posterior samples. 

Below is the code I’m using to define the model:

fit_eiBinomialSM <- function(formula_, df){
  eiBinomialSM <- custom_family(
    "eiBinomialSM", dpars = c("mu", "so", "sm"), # unrestricted mixture scores
    links = c("logit", "identity", "identity"),
    type = "int", vars = "trials[n]",
    log_lik = function(i, draws) {
      mu <- draws$dpars$mu[, i]
      so <- draws$dpars$so
      sm <- draws$dpars$sm
      trials <- draws$data$trials[i]
      y <- draws$data$Y[i]
      eiBinomialSM_lpmf(y, mu, so, sm, trials)
    },
    predict = function(i, draws, ...) {
      mu <- draws$dpars$mu[, i]
      so <- draws$dpars$so
      sm <- draws$dpars$sm
      trials <- draws$data$trials[i]
      eiBinomialSM_rng(mu, so, sm, trials)
    },
    fitted = function(draws) {
      mu <- draws$dpars$mu
      so <- draws$dpars$so
      sm <- draws$dpars$sm
      p <- t(apply(cbind(matrix(c(so, sm), ncol = 2), 0), 1, function(x)(exp(x)/sum(exp(x)))))
      p[,2] + p[,3]*(1/(1+exp(-mu)))
    }
  )
  
  stan_funs <- "
  vector linkfunc(real so, real sm){
  return softmax([so, sm, 0]');
  }
  
  int[] ei(int y, int ntrial) {
  return {(1 - min({1, y})), (1 - min({1, ntrial - y}))};
  }
  
  real eiBinomialSM_lpmf(int y, real mu, real so, real sm, int trials) {
  int yom[2] = ei(y, trials);
  vector[3] pom = linkfunc(so, sm);
  
  return log(yom[1]*pom[1] + yom[2]*pom[2] + pom[3]*exp(binomial_logit_lpmf(y | trials, mu)));
  }
  
  int eiBinomialSM_rng(real mu, real so, real sm, int trials) {
  int which_component = categorical_rng(linkfunc(so, sm));
  
  if (which_component == 1) {
  return 0;
  }
  if (which_component == 2) {
  return trials;
  }
  
  return binomial_rng(trials, inv_logit(mu));
  }
  "
  
  stanvars <- stanvar(scode = stan_funs, block = "functions") +
    stanvar(as.integer(wf$bindenom), name = "trials")
  
  brm(formula = formula_,
      data = df,
      family = eiBinomialSM,
      stanvars = stanvars)
  }

You may have to specify init_r to some number less than 2 when you first call brm in order to get it to use small enough values when it starts leaving observations out. That and check whether those observations are coded correctly in the data because they seem to be having a large influence on the posterior distribution.

I just tried init = "0" and init_r = 0.1 but the exact same error pops up.

I checked the single observation getting dropped and there doesn’t seem to be anything special about it. I was able to refit the model by removing the high-k observations altogether and the posteriors don’t show any major change either.

Are there other steps I could take to help troubleshoot this issue? I can share the data (it’s a relatively small dataset) and anything else needed to reproduce it.

Okay, I figured out the issue.

Following this tutorial, the number of trials for the full dataset is passed via stanvar:

stanvars <- stanvar(scode = stan_funs, block = "functions") +
  stanvar(as.integer(cbpp$size), name = "trials")

The issue seems to occur because loo tries to exclude the high-k observations but is still using the original trials data, which results in a mismatch between counts and trials, where some of the former are larger than the latter.

So I suppose the question now is how to allow loo and stanvar to work together to pass the appropriate subsets of data into the model.

For anyone who finds this thread in the future, here is my fix:

tmp.fun <- brms:::reloo.loo
body(tmp.fun)[[12]][[4]][7:9] <- body(tmp.fun)[[12]][[4]][6:8]
body(tmp.fun)[[12]][[4]][[6]] <- substitute(
  if(!is.null(fit_j$stanvars$trials)){
    fit_j$stanvars$trials$sdata <- fit_j$stanvars$trials$sdata[-omitted]
    fit_j$stanvars$trials$scode <- "  int trials[639];"
  })
assignInNamespace("reloo.loo", value = tmp.fun, ns = "brms")

It’s specific to my model, but the general idea is that I modified reloo.loo so that it checks for my variable inside the fit’s stanvars and if present, drops the corresponding observation and modifies the corresponding section of the data block.

Hi @bmfazio, I’m having a similar problem with a cumulative probit model. Great that you found a solution for your code. I’m still a bit of a novice with R functions - are you able to indicate which parts of your new function are specific to your purposes and would need to be changed? Sorry if this is a question that should be obvious!

With:

assignInNamespace(“reloo.loo”, value = tmp.fun, ns = “brms”)

Are you specifying that when brms calls reloo, it simply uses your function instead of the normal brms reloo, or does the new function need to be called in some other way?

Any help would be greatly appreciated!