Problem with rstanarm::posterior_epred with interaction term

Hello, I’m having a problem making prediction for a 2-variable interaction model in which one of the combinations is not present in the data.
This is the data:

    structure(list(Year = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("2019", 
    "2020"), class = "factor"), Period = structure(c(1L, 2L, 3L, 
    4L, 1L, 3L, 4L), .Label = c("Jan - Mar 10", "Mar 11 - May 4", 
    "May 5 - Aug", "Sep - Dec"), class = "factor"), Tot = c(27L, 
    26L, 37L, 37L, 30L, 37L, 82L), Cases = c(3L, 6L, 5L, 6L, 4L, 
    11L, 18L)), class = "data.frame", row.names = c(NA, -7L))

and the model call:

> model.fixef.test <- stan_glm(
    Cases ~ Year * Period,
    family = poisson(),
    data = train_data.simp %>% as.data.frame(),
    iter = 1000, cores = 8, chains = 2,
    prior = student_t(df = 1, scale = 2.5, autoscale = T),
    show_messages = F
)

I get the following warning:

In center_x(x, sparse) :
      Dropped empty interaction levels: Year2020:PeriodMar 11 - May 4

(I also get other warnings because I put few samples and chains just to show the problem; with 10000 samples on 4 chains no other warnings appear)

Finally, when I try to produce prediction with posterior_epreds it fails:

> Preds <- posterior_epred(model.fixef.test)
    Error in stanmat[, beta_sel, drop = FALSE] : subscript out of bounds

I tried many things, like converting the variables to characters but nothing

Eventually I solved by specifying an interaction variable with interaction(Year, Period) since I was interested only in the predictions and not in the parameter estimates, but it’s a hacky solution.

1 Like

This looks like some sort of bug in rstanarm as this really looks like undesirable behavior (the prediction should definitely not error-out). Consider reporting this at rstanarm GitHub repo - unless there is already an issue for a similar problem…

Just to clarify (and maybe that’s something you already know): when fitting a model like Cases ~ Year * Period this is the same as Cases ~ Year + Period + Year:Period. So you can only rebuild the Year:Period term by hand. Using base R this would look something like:

year_ref_level <- levels(train_data.simp$Year)[1]
period_ref_level <- levels(train_data.simp$Period)[1]

train_data.simp$YearPeriodInt <- 
  ifelse(train_data.simp$Year == year_ref_level | train_data.simp$Period == period_ref_level, "base", as.character(interaction(train_data.simp$Year, train_data.simp$Period)))

train_data.simp$YearPeriodInt <- relevel(factor(train_data.simp$YearPeriodInt), ref = "base") 

model.fixef.test <- stan_glm(
    Cases ~ Year + Period + YearPeriodInt,
    family = poisson(),
    data = train_data.simp %>% as.data.frame(),
    iter = 1000, cores = 8, chains = 2,
    prior = student_t(df = 1, scale = 2.5, autoscale = T),
    show_messages = F
)

(but maybe I am overlooking some tricks to make the code simpler).

This gives you a model with exactly the same coefficients as glm(Cases ~ Year * Period) would fit - except glm will report Year2020:PeriodMar 11 - May 4 as NA instead of completely ignoring it.

Best of luck with your model!