MRP with uniform weighting across levels, despite MNAR in observed data

dholstius · May 21, 2024, 10:18pm

Summary

I have a sample that is (unfortunately) unbalanced with respect to day-of-week, week-of-year, and other calendrical variables that are also important predictors.

I would like to know if there’s a “shortcut” to creating a full historical or future calendar and then using that as newdata in a post-stratification step. I’ve tried three approaches (see below). The third might be OK if I can understand why the credible interval is so large.

Apologies if this is too off-topic for this forum. You’ve all been gracious in answering my other questions! I’d be glad to post elsewhere if this is the case.

Example

Here’s an example where the true θ varies by day-of-week, such that \text{Pr(Y=1)} increases as the week goes on. But, the probability of missingness also increases as the week goes on.

This reprex illustrates three approaches:

Full calendar as counterfactual newdata
Passing wts based on observed (vs known true) frequencies
Using allow_new_levels = TRUE and making up a level

In the actual setting, there are day-of-week, week-of-year, hour-of-day, and other such variables, but this example only includes one.

With many such variables, #1 is too computationally intensive, #2 gets really complex and therefore prone to operator error, and #3 seems OK, but I don’t think the confidence matches what I think I want to be asking for.

Is there a simple argument I’m overlooking, that I could pass to marginaleffects::avg_predictions(), or some other interface that yields PPD draws, to get what basically corresponds to marginalizing across these “calendrical” random variables with uniform weights across the known levels? I need to be able to specify other things in the counterfactual (e.g. year = 2020, …) at the same time.

Happy to add clarification or additional details as needed. Would be grateful to be pointed to a canonical example in some relatively accessible text, along the lines of Statistical Rethinking. Thank you in advance for your time and attention!

library(tidyverse)
library(marginaleffects)
library(brms)

# Adjust to suit your preferred backend
fit_logistic <- function (...) {
  fit <- brm(..., backend = "cmdstanr", family = bernoulli())
  return(fit)
}

# Helper functions
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
wday_levels <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
fct_wday <- function (x) factor(x, levels = wday_levels)

#' Higher theta for days that are later in the week
#' But also higher probability of missing observations
wday_theta <- set_names(seq(0.3, 0.7, length.out = 7), wday_levels)
wday_pr_na <- set_names(seq(0.2, 0.8, length.out = 7), wday_levels)

# All the data we wish we had
all_data <-
  tibble(
    wday = fct_wday(wday_levels),
    theta = wday_theta[wday],
    Y = map(theta, ~ rbernoulli(1000, p = .))) %>%
  unnest(cols = "Y")

# Observed data
obs_data <-
  sample_frac(
    all_data,
    1/3,
    weight = 1 - wday_pr_na[all_data$wday])

summarise(all_data, n = n(), mean = mean(Y)) # expect 0.5
summarise(obs_data, n = n(), mean = mean(Y)) # biased low

fit <- fit_logistic(Y ~ (1|wday), data = obs_data)
summary(fit)

avg_predictions(fit) # biased low, as expected

#'
#' This approach does the job efficiently in the simple case.
#'
#' But, this approach gets expensive as the counterfactual `calendar`
#' gains more dimensions (day-of-week, week-of-year, hour-of-day, ...).
#'
calendar <- tibble(wday = fct_wday(wday_levels))
avg_predictions(fit, newdata = calendar)

#'
#' Using `wts` also works in this simplified example, but is tricky to 
#' get right for certain reasons in the non-simplified real case
#'
wday_obs <- deframe(count(obs_data, wday))
wday_wts <- sum(wday_obs) / wday_obs
avg_predictions(fit, wts = wday_wts[fit$data$wday])

#'
#' `allow_new_levels = TRUE` seems to yield the desired expectation (?)
#’ but the confint seems too big ...
#' it's not that I don't know which level we're talking about and want that uncertainty too
#'
avg_predictions(fit, allow_new_levels = TRUE, newdata = tibble(wday = "(all)"))

Topic		Replies	Views
Data for Bayesian hierarchical weighting adjustment and survey inference General	14	1884	February 24, 2024
Improving precision when post-stratifying with MRP Modeling techniques , specification , brms	1	398	July 31, 2023
Blog post: The right way to do predictive checks with observation-level random effects Publicity brms	6	325	December 1, 2024
MRP for average predictive comparisons? rstanarm surveys	6	779	December 15, 2020
Credible intervals for estimated values General brms	1	563	February 16, 2023

MRP with uniform weighting across levels, despite MNAR in observed data

Summary

Example

Related topics