Summary
I have a sample that is (unfortunately) unbalanced with respect to day-of-week, week-of-year, and other calendrical variables that are also important predictors.
I would like to know if there’s a “shortcut” to creating a full historical or future calendar and then using that as newdata
in a post-stratification step. I’ve tried three approaches (see below). The third might be OK if I can understand why the credible interval is so large.
Apologies if this is too off-topic for this forum. You’ve all been gracious in answering my other questions! I’d be glad to post elsewhere if this is the case.
Example
Here’s an example where the true θ varies by day-of-week, such that \text{Pr(Y=1)} increases as the week goes on. But, the probability of missingness also increases as the week goes on.
This reprex illustrates three approaches:
- Full
calendar
as counterfactualnewdata
- Passing
wts
based on observed (vs known true) frequencies - Using
allow_new_levels = TRUE
and making up a level
In the actual setting, there are day-of-week, week-of-year, hour-of-day, and other such variables, but this example only includes one.
With many such variables, #1 is too computationally intensive, #2 gets really complex and therefore prone to operator error, and #3 seems OK, but I don’t think the confidence matches what I think I want to be asking for.
Is there a simple argument I’m overlooking, that I could pass to marginaleffects::avg_predictions()
, or some other interface that yields PPD draws, to get what basically corresponds to marginalizing across these “calendrical” random variables with uniform weights across the known levels? I need to be able to specify other things in the counterfactual (e.g. year = 2020
, …) at the same time.
Happy to add clarification or additional details as needed. Would be grateful to be pointed to a canonical example in some relatively accessible text, along the lines of Statistical Rethinking. Thank you in advance for your time and attention!
library(tidyverse)
library(marginaleffects)
library(brms)
# Adjust to suit your preferred backend
fit_logistic <- function (...) {
fit <- brm(..., backend = "cmdstanr", family = bernoulli())
return(fit)
}
# Helper functions
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
wday_levels <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
fct_wday <- function (x) factor(x, levels = wday_levels)
#' Higher theta for days that are later in the week
#' But also higher probability of missing observations
wday_theta <- set_names(seq(0.3, 0.7, length.out = 7), wday_levels)
wday_pr_na <- set_names(seq(0.2, 0.8, length.out = 7), wday_levels)
# All the data we wish we had
all_data <-
tibble(
wday = fct_wday(wday_levels),
theta = wday_theta[wday],
Y = map(theta, ~ rbernoulli(1000, p = .))) %>%
unnest(cols = "Y")
# Observed data
obs_data <-
sample_frac(
all_data,
1/3,
weight = 1 - wday_pr_na[all_data$wday])
summarise(all_data, n = n(), mean = mean(Y)) # expect 0.5
summarise(obs_data, n = n(), mean = mean(Y)) # biased low
fit <- fit_logistic(Y ~ (1|wday), data = obs_data)
summary(fit)
avg_predictions(fit) # biased low, as expected
#'
#' This approach does the job efficiently in the simple case.
#'
#' But, this approach gets expensive as the counterfactual `calendar`
#' gains more dimensions (day-of-week, week-of-year, hour-of-day, ...).
#'
calendar <- tibble(wday = fct_wday(wday_levels))
avg_predictions(fit, newdata = calendar)
#'
#' Using `wts` also works in this simplified example, but is tricky to
#' get right for certain reasons in the non-simplified real case
#'
wday_obs <- deframe(count(obs_data, wday))
wday_wts <- sum(wday_obs) / wday_obs
avg_predictions(fit, wts = wday_wts[fit$data$wday])
#'
#' `allow_new_levels = TRUE` seems to yield the desired expectation (?)
#’ but the confint seems too big ...
#' it's not that I don't know which level we're talking about and want that uncertainty too
#'
avg_predictions(fit, allow_new_levels = TRUE, newdata = tibble(wday = "(all)"))