# Computing average marginal effects for logistic regression when exploiting sufficient statistics

Hello,
I’m trying to compute average marginal effects for binary logistic regression using brms. There is no problem when working with the full data, but when trying run the model on aggregated data (to save on computation time), I can’t seem to get it quite right.

Here are the models and packages used:

``````library(dplyr)
library(palmerpenguins)
library(brms)
library(marginaleffects)

#Aggregating data
penguins_agg <- penguins |>
mutate(total = 1,
male = sex == "male") |>
summarise(male = sum(male),
total = sum(total),
.by = body_mass_g)

m_full <- brm(sex ~ body_mass_g, data = penguins, family = bernoulli()) # Model on "full" dataset
m_agg <- brm(male | resp_trials(total) ~ body_mass_g, data = penguins_agg, family = binomial()) # Model on aggregated data

``````

Both of these model return the same predicted probabilities, so they should be equivalent. However, I can’t seem to get the AMEs right. I understand that with the aggregated data, cases have to be weighted. I tried weighting by the proportion of observations in the full dataset, but that doesn’t work:

``````w <- penguins_agg\$total / sum(penguins_agg\$total) # n / N weights

avg_slopes(m_full, variables = "body_mass_g")
#       Term Estimate    2.5 %   97.5 %
# body_mass_g 0.000254 0.000203 0.000297

avg_slopes(m4, variables = "body_mass_g", wts = w)
#        Term Estimate   2.5 %  97.5 %
# body_mass_g  0.00143 0.00113 0.00169

``````

My question is, what the correct weight should be for the AMEs to match? Thanks in advance!

2 Likes

I’m not sure whether you can do it by weighting. An alternative is to use the original data as the `newdata` grid in the `avg_slopes` call.

Prepare the original data frame, so that it has the same columns as the aggregate data frame: `male`, `total` (column of 1s) and `body_weight`:

``````data_01s <- penguins |>
# Create an outcome column and a predictor column
# of the same types as in the aggregate data.
mutate(
male = 1L * (sex == "male"),
total = 1L
) |>
# There are some rows with NaNs;
# drop them before aggregating.
drop_na(
male, body_mass_g
)
``````

To get `avg_slopes` reproduce the disaggregated marginal estimates from the aggregated fit, use the original `data_01s` as its `newdata` argument. (The model `m_agg` expects two columns: the predictor `body_mass_g` & the sample size `total`; that’s why we added a column of 1s called `total`.)

``````avg_slopes(m_agg, newdata = data_01s, variables = "body_mass_g")
#>         Term Estimate    2.5 %   97.5 %
#>  body_mass_g 0.000254 0.000204 0.000297
``````

Update: There are no weights `w` that give the correct estimate of the marginal effect with `avg_slopes(m_agg, wts = w)`.

You can use one of these two calculations instead.

``````# Model `m_agg` uses weights. In the original, un-collapsed dataset
# each observation is a subset of sample size of 1.
# avg_slopes(m_agg, newdata = data_01s %>% mutate(n = 1))
#>  Term Estimate    2.5 %   97.5 %
#>     x 0.000256 0.000205 0.000297

# The sample sizes `n` get in the way of the `posterior_epred` calculation.
# So "turn those off" by setting `n = 1` and use the weights argument instead.
avg_slopes(m_agg, newdata = data_agg %>% mutate(n = 1), wts = data_agg\$n)
#>  Term Estimate    2.5 %   97.5 %
#>     x 0.000256 0.000205 0.000297
``````

More details in this `marginaleffects` issue which I submitted (unnecessarily, it turns out as there is no bug).

Amazing, thank you very much!