So I am trying to fit a relatively simple GLMM with a single categorical fixed effect, some random effects, and a negative binomial distribution. I’m still relatively new to rstanarm
and want to make sure if I’m going about generating the posterior predictions the right way, in the same spirit of what you would do with your frequentist approach. In my data I have a number of years of count data and I want yearly model-predicted estimates of counts. I know one year has much higher counts than others. Here’s a workable replica of those data:
library(glmmTMB)
library(dplyr)
library(magrittr)
library(ggplot2)
library(rstanarm)
df <- data.frame(
expand.grid(
week = c(1:18),
year = c(2005:2020),
site = c(1:15)
)
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
n_obs = sample(c(4:10), 1, replace = TRUE)
)
# Vector specifying the number of times to replicate each row
replication_times <- df$n_obs
# Replicate rows according to the specified pattern
replicated_df <- df[
rep(row.names(df), times = replication_times),
c("week", "site", "year")
]
replicated_df$count <- stats::rnbinom(
nrow(replicated_df),
size = 0.112,
mu = 0.214
)
replicated_df$count[which(replicated_df$year == 2015)] <- stats::rnbinom(
nrow(replicated_df[which(replicated_df$year == 2015), ]),
size = 0.45,
mu = 3.0
)
df <- replicated_df %>%
dplyr::mutate(
week = as.factor(week),
year = as.factor(year),
size = as.factor(site)
)
So now in a frequentist framework (I’m familiar with glmmTMB so I use that) I would just fit the model, and get the predictions:
df <- replicated_df %>%
dplyr::mutate(
week = as.factor(week),
year = as.factor(year),
size = as.factor(site)
)
freq_mod <- glmmTMB::glmmTMB(
count ~ year + (1 | week) + (1 | site),
data = df,
family = nbinom2
)
predict_data <- data.frame(
year = as.character(c(2005:2020)),
week = NA,
site = NA
)
predicted_yearly <- data.frame(
year = as.character(c(2005:2020)),
stats::predict(
object = freq_mod,
newdata = predict_data,
re.form = ~0,
se.fit = TRUE,
type = "response"
)
)
ggplot(predicted_yearly) +
geom_errorbar(aes(x = year, ymin = fit - se.fit, ymax = fit + se.fit)) +
geom_point(aes(x = year, y = fit)) +
theme_bw()
bayes comparison
As far as I understand it for the posterior predictive check, I was following this guide I want to use the tidybayes::epred_draws()
function which will operate similarly to stats::predict()
(and I assume exactly the same as rstanarm::posterior_predict()
but just in a tidy dataframe?
However, herein lies my confusion. For the newdata
argument, in the stats::predict()
case I would just put NA
values for my random effects, set re.form = NA, type = "response"
and I get the model-predicted yearly values given my model structure. In this case, I can’t have NA
’s in the newdata
so you just arbitrarily pick a value for the two random effects that’s actually IN the dataset and use that, then ensure re_formulation = NULL
or re.form = NULL
and that will give you the conditional means. I’ve tried that here:
# bayesian version of the model
bayes_fit <- rstanarm::stan_glmer(
count ~ year + (1 | week) + (1 | site),
data = df,
family = rstanarm::neg_binomial_2(link = "log"),
iter = 5000,
warmup = 2000,
cores = 4
)
predict_data_bayes <- data.frame(
year = as.character(c(2005:2020)),
week = "1",
site = "one"
)
x <- tidybayes::epred_draws(
object = bayes_fit,
newdata = predict_data_bayes,
re_formula = NULL
)
ggplot(
x,
aes(
x = .epred, y = year, fill = year
)
) +
tidybayes::stat_halfeye(.width = 0.95) +
scale_fill_manual(values = c(rep("lightpink", 16))) +
labs(
x = "Count", y = "Year",
subtitle = "Posterior predictions"
) +
theme(legend.position = "bottom") +
theme_bw()
Which gives me something like this:
This looks right, but I’m worried I’m confusing something here, and the picking of something arbitrary to go in the newdata
feels like I’m doing something wrong and just somehow getting a conditional effect for THOSE specific values not across all random effect levels. Hoping someone can tell me if this is a correct interpretation/implementation of how to move between these two approaches to get what I know is not the SAME thing, but the analogous thing to a model-prediction from a frequentist model.