Computing average marginal effects for logistic regression when exploiting sufficient statistics

Hello,
I’m trying to compute average marginal effects for binary logistic regression using brms. There is no problem when working with the full data, but when trying run the model on aggregated data (to save on computation time), I can’t seem to get it quite right.

Here are the models and packages used:

library(dplyr)
library(palmerpenguins)
library(brms)
library(marginaleffects)

#Aggregating data
penguins_agg <- penguins |> 
  mutate(total = 1,
         male = sex == "male") |> 
  summarise(male = sum(male),
            total = sum(total),
            .by = body_mass_g)


m_full <- brm(sex ~ body_mass_g, data = penguins, family = bernoulli()) # Model on "full" dataset
m_agg <- brm(male | resp_trials(total) ~ body_mass_g, data = penguins_agg, family = binomial()) # Model on aggregated data

Both of these model return the same predicted probabilities, so they should be equivalent. However, I can’t seem to get the AMEs right. I understand that with the aggregated data, cases have to be weighted. I tried weighting by the proportion of observations in the full dataset, but that doesn’t work:

w <- penguins_agg$total / sum(penguins_agg$total) # n / N weights

avg_slopes(m_full, variables = "body_mass_g")
#       Term Estimate    2.5 %   97.5 %
# body_mass_g 0.000254 0.000203 0.000297

avg_slopes(m4, variables = "body_mass_g", wts = w)
#        Term Estimate   2.5 %  97.5 %
# body_mass_g  0.00143 0.00113 0.00169

My question is, what the correct weight should be for the AMEs to match? Thanks in advance!

2 Likes

I’m not sure whether you can do it by weighting. An alternative is to use the original data as the newdata grid in the avg_slopes call.

Prepare the original data frame, so that it has the same columns as the aggregate data frame: male, total (column of 1s) and body_weight:

data_01s <- penguins |>
  # Create an outcome column and a predictor column
  # of the same types as in the aggregate data.
  mutate(
    male = 1L * (sex == "male"),
    total = 1L
  ) |>
  # There are some rows with NaNs;
  # drop them before aggregating.
  drop_na(
    male, body_mass_g
  )

To get avg_slopes reproduce the disaggregated marginal estimates from the aggregated fit, use the original data_01s as its newdata argument. (The model m_agg expects two columns: the predictor body_mass_g & the sample size total; that’s why we added a column of 1s called total.)

avg_slopes(m_agg, newdata = data_01s, variables = "body_mass_g")
#>         Term Estimate    2.5 %   97.5 %
#>  body_mass_g 0.000254 0.000204 0.000297

Update: There are no weights w that give the correct estimate of the marginal effect with avg_slopes(m_agg, wts = w).

You can use one of these two calculations instead.

# Model `m_agg` uses weights. In the original, un-collapsed dataset
# each observation is a subset of sample size of 1.
# avg_slopes(m_agg, newdata = data_01s %>% mutate(n = 1))
#>  Term Estimate    2.5 %   97.5 %
#>     x 0.000256 0.000205 0.000297

# The sample sizes `n` get in the way of the `posterior_epred` calculation.
# So "turn those off" by setting `n = 1` and use the weights argument instead.
avg_slopes(m_agg, newdata = data_agg %>% mutate(n = 1), wts = data_agg$n)
#>  Term Estimate    2.5 %   97.5 %
#>     x 0.000256 0.000205 0.000297

More details in this marginaleffects issue which I submitted (unnecessarily, it turns out as there is no bug).

Amazing, thank you very much!