MRP with uniform weighting across levels, despite MNAR in observed data


I have a sample that is (unfortunately) unbalanced with respect to day-of-week, week-of-year, and other calendrical variables that are also important predictors.

I would like to know if there’s a “shortcut” to creating a full historical or future calendar and then using that as newdata in a post-stratification step. I’ve tried three approaches (see below). The third might be OK if I can understand why the credible interval is so large.

Apologies if this is too off-topic for this forum. You’ve all been gracious in answering my other questions! I’d be glad to post elsewhere if this is the case.


Here’s an example where the true θ varies by day-of-week, such that \text{Pr(Y=1)} increases as the week goes on. But, the probability of missingness also increases as the week goes on.

This reprex illustrates three approaches:

  1. Full calendar as counterfactual newdata
  2. Passing wts based on observed (vs known true) frequencies
  3. Using allow_new_levels = TRUE and making up a level

In the actual setting, there are day-of-week, week-of-year, hour-of-day, and other such variables, but this example only includes one.

With many such variables, #1 is too computationally intensive, #2 gets really complex and therefore prone to operator error, and #3 seems OK, but I don’t think the confidence matches what I think I want to be asking for.

Is there a simple argument I’m overlooking, that I could pass to marginaleffects::avg_predictions(), or some other interface that yields PPD draws, to get what basically corresponds to marginalizing across these “calendrical” random variables with uniform weights across the known levels? I need to be able to specify other things in the counterfactual (e.g. year = 2020, …) at the same time.

Happy to add clarification or additional details as needed. Would be grateful to be pointed to a canonical example in some relatively accessible text, along the lines of Statistical Rethinking. Thank you in advance for your time and attention!


# Adjust to suit your preferred backend
fit_logistic <- function (...) {
  fit <- brm(..., backend = "cmdstanr", family = bernoulli())

# Helper functions
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
wday_levels <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
fct_wday <- function (x) factor(x, levels = wday_levels)

#' Higher theta for days that are later in the week
#' But also higher probability of missing observations
wday_theta <- set_names(seq(0.3, 0.7, length.out = 7), wday_levels)
wday_pr_na <- set_names(seq(0.2, 0.8, length.out = 7), wday_levels)

# All the data we wish we had
all_data <-
    wday = fct_wday(wday_levels),
    theta = wday_theta[wday],
    Y = map(theta, ~ rbernoulli(1000, p = .))) %>%
  unnest(cols = "Y")

# Observed data
obs_data <-
    weight = 1 - wday_pr_na[all_data$wday])

summarise(all_data, n = n(), mean = mean(Y)) # expect 0.5
summarise(obs_data, n = n(), mean = mean(Y)) # biased low

fit <- fit_logistic(Y ~ (1|wday), data = obs_data)

avg_predictions(fit) # biased low, as expected

#' This approach does the job efficiently in the simple case.
#' But, this approach gets expensive as the counterfactual `calendar`
#' gains more dimensions (day-of-week, week-of-year, hour-of-day, ...).
calendar <- tibble(wday = fct_wday(wday_levels))
avg_predictions(fit, newdata = calendar)

#' Using `wts` also works in this simplified example, but is tricky to 
#' get right for certain reasons in the non-simplified real case
wday_obs <- deframe(count(obs_data, wday))
wday_wts <- sum(wday_obs) / wday_obs
avg_predictions(fit, wts = wday_wts[fit$data$wday])

#' `allow_new_levels = TRUE` seems to yield the desired expectation (?)
#’ but the confint seems too big ...
#' it's not that I don't know which level we're talking about and want that uncertainty too
avg_predictions(fit, allow_new_levels = TRUE, newdata = tibble(wday = "(all)"))