Loo throwing error when computing ELPDs for problematic observations

loo

#1

I’ve been working with rstanarm and loo for several months, just updated to the newest versions a couple weeks ago, and since then have started seeing this error message when running the loo function:

>   l_full5 <- loo(full5, k_threshold = 0.7)
1 problematic observation(s) found.
Model will be refit 1 times.

Fitting model 1 out of 1 (leaving out observation 2252)
Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : 
  Exception: model_bernoulli_namespace::model_bernoulli: N[k0__] is 0, but must be greater than or equal to 1  (in 'model_bernoulli' at line 12)

failed to create the sampler; sampling not done
Error in check_stanfit(stanfit) : 
  Invalid stanfit object produced please report bug

This only happens when a problematic observation(s) is/are found, and furthermore, seems to happen every time a problematic observation is found. This is happening for the more complex models I’m using.

Here are the data I’m using: dat_with_soil2.csv (865.8 KB),
and code that should reproduce the problem:

dat$BAc5 <- scale(dat$BAc5)
dat$BAh5 <- scale(dat$BAh5)
dat$n_con <- scale(dat$n_con)
dat$n_het <- scale(dat$n_het)
dat$gap <- scale(dat$gap)
dat$ph <- scale(dat$ph)
dat$tec <- scale(dat$tec)
dat$org <- scale(dat$org)

full5 <- stan_glm(surv ~ 
                    n_con + n_het +
                    BAc5*gap + BAh5*gap +
                    BAc5*ph + BAh5*ph + 
                    BAc5*tec + BAh5*tec +
                    BAc5*org + BAh5*org,
                  prior = student_t(df=7, location=0, autoscale=TRUE),
                  prior_intercept = student_t(df=7, location=0, autoscale=TRUE),
                  adapt_delta = 0.99,
                  init_r = 0.1,
                  family = binomial(link = "logit"),
                  data = dat)

l_full5 <- loo(full5, k_threshold=0.7)

I hugely appreciate any guidance you might have. Thank you so much for the work you do!

  • Operating System: Windows 10
  • rstanarm Version: 2.17.4
  • loo Version: 2.0.0
  • R Version: 3.3.1

#2

Is observation 2252 the only one where the outcome is zero?


#3

Thanks for the quick reply!
I think I might misunderstand your question - by “outcome” do you mean the response variable (survival)? If so, then no, 2252 would not be the only zero; there should be 841 zeros and 3436 ones.


#4

Hi @ABrown, I don’t think you misunderstood @bgoodri’s question. He asked that because of the particular error message you got. N[k0__] is 0, but must be greater than or equal to 1 is referring to a variable in the underlying Stan code that contains the number of 0s observed. We should catch this in the R code and throw an informative error before calling Stan so it doesn’t spit out all this jibberish.

In any case, this may be a bug in rstanarm. I’ll report back when I’ve had a chance to run your code.
Thanks for sharing the data and reproducible example!


#5

Ok I figured out the problem. It’s a bug in rstanarm that should be easy to fix. There’s a line in one of the internal functions in rstanarm that is causing na.omit to be called on the data before refitting the model. When observation 2252 is left out and then na.omit is called, it turns out that there are no more 0s in the data:

table(dat$surv)

   0    1 
 841 3436 


table(na.omit(dat)$surv)

   1 
3364 

The NAs are in variables that aren’t actually included in your model formula, which shouldn’t be a problem. However, in the code used to refit the model when there is a high k value we aren’t treating those NAs differently from NAs in the variables actually used in the model (the observations are being dropped but they shouldn’t be). This should be easy to fix for the next release of rstanarm.

Until then you should be able to get around this problem by only including variables you use in the model formula when passing a data frame to the data argument of stan_glm(). Does that work for you?


#6

I just opened an issue for this so we get it permanently fixed:

I think that workaround I mentioned in my previous post should be a good short term solution.


#7

Yes, that makes sense @jonah. Thank you so much for your timely help!!