## Summary

I have a sample that is (unfortunately) unbalanced with respect to day-of-week, week-of-year, and other calendrical variables that are also important predictors.

I would like to know if there’s a “shortcut” to creating a full historical or future calendar and then using that as `newdata`

in a post-stratification step. I’ve tried three approaches (see below). The third might be OK if I can understand why the credible interval is so large.

Apologies if this is too off-topic for this forum. You’ve all been gracious in answering my other questions! I’d be glad to post elsewhere if this is the case.

## Example

Here’s an example where the true θ varies by day-of-week, such that \text{Pr(Y=1)} increases as the week goes on. But, the probability of missingness also increases as the week goes on.

This reprex illustrates three approaches:

- Full
`calendar`

as counterfactual`newdata`

- Passing
`wts`

based on observed (vs known true) frequencies - Using
`allow_new_levels = TRUE`

and making up a level

In the actual setting, there are day-of-week, week-of-year, hour-of-day, and other such variables, but this example only includes one.

With many such variables, #1 is too computationally intensive, #2 gets really complex and therefore prone to operator error, and #3 seems OK, but I don’t think the confidence matches what I think I want to be asking for.

Is there a simple argument I’m overlooking, that I could pass to `marginaleffects::avg_predictions()`

, or some other interface that yields PPD draws, to get what basically corresponds to marginalizing across these “calendrical” random variables with uniform weights across the known levels? I need to be able to specify other things in the counterfactual (e.g. `year = 2020`

, …) at the same time.

Happy to add clarification or additional details as needed. Would be grateful to be pointed to a canonical example in some relatively accessible text, along the lines of *Statistical Rethinking*. Thank you in advance for your time and attention!

```
library(tidyverse)
library(marginaleffects)
library(brms)
# Adjust to suit your preferred backend
fit_logistic <- function (...) {
fit <- brm(..., backend = "cmdstanr", family = bernoulli())
return(fit)
}
# Helper functions
rbernoulli <- function (n, p = 0.5) runif(n) > (1 - p)
wday_levels <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
fct_wday <- function (x) factor(x, levels = wday_levels)
#' Higher theta for days that are later in the week
#' But also higher probability of missing observations
wday_theta <- set_names(seq(0.3, 0.7, length.out = 7), wday_levels)
wday_pr_na <- set_names(seq(0.2, 0.8, length.out = 7), wday_levels)
# All the data we wish we had
all_data <-
tibble(
wday = fct_wday(wday_levels),
theta = wday_theta[wday],
Y = map(theta, ~ rbernoulli(1000, p = .))) %>%
unnest(cols = "Y")
# Observed data
obs_data <-
sample_frac(
all_data,
1/3,
weight = 1 - wday_pr_na[all_data$wday])
summarise(all_data, n = n(), mean = mean(Y)) # expect 0.5
summarise(obs_data, n = n(), mean = mean(Y)) # biased low
fit <- fit_logistic(Y ~ (1|wday), data = obs_data)
summary(fit)
avg_predictions(fit) # biased low, as expected
#'
#' This approach does the job efficiently in the simple case.
#'
#' But, this approach gets expensive as the counterfactual `calendar`
#' gains more dimensions (day-of-week, week-of-year, hour-of-day, ...).
#'
calendar <- tibble(wday = fct_wday(wday_levels))
avg_predictions(fit, newdata = calendar)
#'
#' Using `wts` also works in this simplified example, but is tricky to
#' get right for certain reasons in the non-simplified real case
#'
wday_obs <- deframe(count(obs_data, wday))
wday_wts <- sum(wday_obs) / wday_obs
avg_predictions(fit, wts = wday_wts[fit$data$wday])
#'
#' `allow_new_levels = TRUE` seems to yield the desired expectation (?)
#’ but the confint seems too big ...
#' it's not that I don't know which level we're talking about and want that uncertainty too
#'
avg_predictions(fit, allow_new_levels = TRUE, newdata = tibble(wday = "(all)"))
```