Clarifying how to calculate predicted response values from fit model

So I am trying to fit a relatively simple GLMM with a single categorical fixed effect, some random effects, and a negative binomial distribution. I’m still relatively new to rstanarm and want to make sure if I’m going about generating the posterior predictions the right way, in the same spirit of what you would do with your frequentist approach. In my data I have a number of years of count data and I want yearly model-predicted estimates of counts. I know one year has much higher counts than others. Here’s a workable replica of those data:

library(glmmTMB)
library(dplyr)
library(magrittr)
library(ggplot2)
library(rstanarm)

df <- data.frame(
    expand.grid(
        week = c(1:18),
        year = c(2005:2020),
        site = c(1:15)
    )
) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
        n_obs = sample(c(4:10), 1, replace = TRUE)
    )
# Vector specifying the number of times to replicate each row
replication_times <- df$n_obs

# Replicate rows according to the specified pattern
replicated_df <- df[
    rep(row.names(df), times = replication_times),
    c("week", "site", "year")
]
replicated_df$count <- stats::rnbinom(
    nrow(replicated_df),
    size = 0.112,
    mu = 0.214
)
replicated_df$count[which(replicated_df$year == 2015)] <- stats::rnbinom(
    nrow(replicated_df[which(replicated_df$year == 2015), ]),
    size = 0.45,
    mu = 3.0
)

df <- replicated_df %>%
    dplyr::mutate(
        week = as.factor(week),
        year = as.factor(year),
        size = as.factor(site)
    )

So now in a frequentist framework (I’m familiar with glmmTMB so I use that) I would just fit the model, and get the predictions:

df <- replicated_df %>%
    dplyr::mutate(
        week = as.factor(week),
        year = as.factor(year),
        size = as.factor(site)
    )
freq_mod <- glmmTMB::glmmTMB(
    count ~ year + (1 | week) + (1 | site),
    data = df,
    family = nbinom2
)
predict_data <- data.frame(
    year = as.character(c(2005:2020)),
    week = NA,
    site = NA
)
predicted_yearly <- data.frame(
    year = as.character(c(2005:2020)),
    stats::predict(
        object = freq_mod,
        newdata = predict_data,
        re.form = ~0,
        se.fit = TRUE,
        type = "response"
    )
)
ggplot(predicted_yearly) +
    geom_errorbar(aes(x = year, ymin = fit - se.fit, ymax = fit + se.fit)) +
    geom_point(aes(x = year, y = fit)) +
    theme_bw()

bayes comparison

As far as I understand it for the posterior predictive check, I was following this guide I want to use the tidybayes::epred_draws() function which will operate similarly to stats::predict() (and I assume exactly the same as rstanarm::posterior_predict() but just in a tidy dataframe?

However, herein lies my confusion. For the newdata argument, in the stats::predict() case I would just put NA values for my random effects, set re.form = NA, type = "response" and I get the model-predicted yearly values given my model structure. In this case, I can’t have NA’s in the newdata so you just arbitrarily pick a value for the two random effects that’s actually IN the dataset and use that, then ensure re_formulation = NULL or re.form = NULL and that will give you the conditional means. I’ve tried that here:

# bayesian version of the model 
bayes_fit <- rstanarm::stan_glmer(
    count ~ year + (1 | week) + (1 | site),
    data = df,
    family = rstanarm::neg_binomial_2(link = "log"),
    iter = 5000,
    warmup = 2000,
    cores = 4
)

predict_data_bayes <- data.frame(
    year = as.character(c(2005:2020)),
    week = "1",
    site = "one"
)
x <- tidybayes::epred_draws(
    object = bayes_fit,
    newdata = predict_data_bayes,
    re_formula = NULL
)
ggplot(
    x,
    aes(
        x = .epred, y = year, fill = year
    )
) +
    tidybayes::stat_halfeye(.width = 0.95) +
    scale_fill_manual(values = c(rep("lightpink", 16))) +
    labs(
        x = "Count", y = "Year",
        subtitle = "Posterior predictions"
    ) +
    theme(legend.position = "bottom") +
    theme_bw()

Which gives me something like this:

This looks right, but I’m worried I’m confusing something here, and the picking of something arbitrary to go in the newdata feels like I’m doing something wrong and just somehow getting a conditional effect for THOSE specific values not across all random effect levels. Hoping someone can tell me if this is a correct interpretation/implementation of how to move between these two approaches to get what I know is not the SAME thing, but the analogous thing to a model-prediction from a frequentist model.

1 Like

OK so I don’t know how to actually FIX this, but I am now 99.9999% sure I’m doing this wrong. I made a particular level of one of the random effects (site == 2) MUCH larger in terms of the mean count value:

replicated_df$count[which(replicated_df$year == 2015 &
    replicated_df$site == 2)] <- stats::rnbinom(
    nrow(replicated_df[
        which(replicated_df$year == 2015 &
            replicated_df$site == 2),
    ]),
    size = 1.5,
    mu = 18.0
)

Now, if I use the frequentist approach (after re-fitting the model obviously)

predict_data <- data.frame(
    year = as.character(c(2005:2020)),
    week = NA,
    site = "2"
)
predicted_yearly <- data.frame(
    year = as.character(c(2005:2020)),
    stats::predict(
        object = freq_mod,
        newdata = predict_data,
        re.form = ~0,
        se.fit = TRUE,
        type = "response"
    )
)

whether or not I put “1” or “2” for the site value in the newdata object, I get predictions that look like this:

Which makes sense, and if I change the code to exclude the re.form = ~0 which as I understand it is the same as re_formula = NULL and use the site “2” like this,

predict_data <- data.frame(
    year = as.character(c(2005:2020)),
    week = NA,
    site = "2"
)
predicted_yearly <- data.frame(
    year = as.character(c(2005:2020)),
    stats::predict(
        object = freq_mod,
        newdata = predict_data,
        #re.form = ~0,
        se.fit = TRUE,
        type = "response"
    )
)

I get this:

which also makes sense because I am in this case re-formulating the model for that level of the random effect.

If I re-produce this poorly formed test with the bayesian method, first with the lack of re-formulation:

predict_data_bayes <- data.frame(
    year = as.character(c(2005:2020)),
    week = "1",
    site = "1"
)
x <- tidybayes::epred_draws(
    object = bayes_fit,
    newdata = predict_data_bayes,
    re_formula = NULL
)

I can plot and get

Which does NOT make sense, first of all since the values are far different than the maximum likelihood estimation which shouldn’t happen, but also if I keep re_formula = NULL and put site == 2 I get this:

which indicates the model IS being re-formulated according to whatever values I’ve put in the newdata object, even though re_formula = NULL.

Confusion reigns!

Can you add the code to generate bayes_fit? That will make your example reproducible and also ensure that we know exactly how that model fit was generated.

whoops sorry @joels, missed copying that over, just edited to include! Thanks!

After some more troubleshooting and finding this helpful answer on a related post from @bwiernik here, I am suspicious that the right answer is just that re_formula = NA instead of Null will do what I want? So something like this?

x <- df %>%
    # dplyr::filter(!is.na(site), !is.na(week)) %>%
    modelr::data_grid(year, site, week) %>%
    tidybayes::add_epred_draws(bayes_fit, re_formula = NA)

 ggplot(x, aes(x = .epred, y = year, fill = year)) +
        tidybayes::stat_halfeye(.width = 0.95) +
        scale_fill_manual(values = c(rep("lightpink", 16))) +
        labs(
            x = "Count", y = "Year",
            subtitle = "Posterior predictions"
        ) +
        theme(legend.position = "bottom") +
        theme_bw()

which gives this:

Which feels right??