What grid does passing a brms model to emmeans use when a predictor grid is not provided?

I am confused about what emmeans is averaging over with brms.

After fitting a generalized linear model with brms, I can pass the model object to emmeans() to obtain the posterior for marginal means the outcome on the probability or response scale. Since probabilities are non-linear in the parameters, the distribution of predictors affects these predictions. Accordingly, we need to specify the distribution/grid we want to average over for the marginal means.

To illustrate the source of confusion, let’s simulate some data and fit a beta regression. We’re interested in the marginal means of y for values of x, averaging over z.

library(brms)
library(emmeans)
set.seed(123)

# simulate data
df <- data.frame(y = runif(100),
           x = sample(0:1, size = 100, replace = T),
           z = as.character(sample(1:4, size = 100, replace = T)))

# fit beta model
mod <- brms::brm(formula = y ~ x + z + x:z,
                 data = df, 
                 family = Beta(link = "logit", link_phi = "log"),
                 seed = 123)

Doesn’t include predictors in grid

If we don’t specify anything about the predictor grid, the resulting object does not contain any information about the distribution of the predictor we’re averaging over z. What distribution is it implicitly assuming?

# Specify nothing about grid
nogrid <- emmeans(object = mod, ~x, epred = TRUE)
nogrid 
#>  x emmean lower.HPD upper.HPD
#>  0  0.545     0.466     0.622
#>  1  0.459     0.386     0.530
#> 
#> Point estimate displayed: median 
#> HPD interval probability: 0.95

nogrid@grid 
#>   x .wgt.
#> 1 0    46
#> 2 1    54

If we explicitly tell emmeans to weight each cell of the grid by its frequency in the sample, we get identical implied weights and still no mention of z in the grid/output.

# Specify that expected values should be based on probabilities for each cell
# of the predictor grid, where the cell's contribution is weighted by frequency
weighted <- emmeans(object = mod, ~x, weights = "cells", epred = TRUE)
weighted 
#>  x emmean lower.HPD upper.HPD
#>  0  0.541     0.460     0.614
#>  1  0.462     0.393     0.535
#> 
#> Point estimate displayed: median 
#> HPD interval probability: 0.95
weighted@grid 
#>   x .wgt.
#> 1 0    46
#> 2 1    54

It also does not appear possible to use the at argument either to fix this, as the below code produces the same issue (although somewhat different estimates).

emmeans(object = mod, ~x, weights = "cells", 
                     at = list(z = unique(df$z)), epred = TRUE)

Expected grid

While none of the above appear to average over z, I was expecting them to use a predictor grid like the below:

# What I was expecting:
grid_expected <- ref_grid(mod, weights = "cells")
grid_expected@grid
#>   x z .wgt.
#> 1 0 1    15
#> 2 1 1    17
#> 3 0 2    13
#> 4 1 2    11
#> 5 0 3     9
#> 6 1 3    10
#> 7 0 4     9
#> 8 1 4    16

Rather than pass the grid I produced with ref_grid above to emmeans(), I’d like to understand what is going on with the other code.

Figured I’d ping @Russ_Lenth

Honestly, all this stuff is pretty thoroughly documented.

The important things to know relative to this inquiry are:

  1. Unless specified otherwise, marginal means are computed with equal weights. The .wgt. column in the @grid slot contains what might be needed in case the user doesn’t want equal weights. See the help for emmeans with attention to the weights argument, and the section devoted to weighting. See also vignette excerpts Basics of estimated marginal means and Working with messy data.

  2. With generalized linear models, averaging is done on the link scale, then back-transformed. If you want it back-transformed before averaging, use regrid() (not to be confused with ref_grid) on the reference grid before calling emmeans(). See again the documentation for emmeans and for ref_grid and regrid, and the vignette Transformations and link functions in emmeans.

  3. By the way, there is also an index to vignette topics to make things easier to find: Index of vignette topics

I think those two points cover the confusion expressed here.

Russ

PS – the call ref_grid(mod, weights = "cells") makes no sense. weights is an argument to emmeans, not to ref_grid, so it is ignored in the ref_grid call.

Thank you for the thoughtful reply and catching the call to weights in ref_grid(). I understand what weights are being used.

I realized my misunderstanding was due to how specs and at relate to each other. While at allows the user to choose a set of values for a (sub)set of predictors over which to predict, with each cell of the joint distribution of the unspecified variables that enter the model otherwise equally weighted by default, specs is what actually determines the values for which you want to obtain the marginal means - averaging over the predictor distribution for each of the variables not passed to specs.

Accordingly, while I was wondering why the summary of the emmeans output was not showing the variable z when I specified at = list(z = unique(df$z)), my estimates were changing because this meant that I obtained an link-scale prediction for each value of unique(df$z) for each combination of the other variables in the model, those values were averaged, and then the prediction was returned on the response scale for each value of x. If I wanted to get epred values for the joint distribution of x and unique(df$z), I should have used ~ x + z or c("x","z").

I tried obtaining the reference grid with and without specifying the z levels:

> ref_grid(mod)
'emmGrid' object with variables:
    x = 0, 1
    z = 1, 2, 3, 4
Transformation: “logit” 

> ref_grid(mod, at = list(z = unique(df$z)))
'emmGrid' object with variables:
    x = 0, 1
    z = 2, 4, 1, 3
Transformation: “logit” 

That is, it makes no difference, except for the order of levels of z. I also have:

> (emm <- emmeans(mod, "x"))
NOTE: Results may be misleading due to involvement in interactions
 x emmean lower.HPD upper.HPD
 0  0.181    -0.128     0.535
 1 -0.170    -0.472     0.143

Point estimate displayed: median 
Results are given on the logit (not the response) scale. 
HPD interval probability: 0.95 

… and the same result with adding at = list(z = unique(df$z)). But what I also notice is that there is no message saying that the results were averaged over z, which is an annotation shown when we do a frequentist summary:

> emmeans(mod, "x", frequentist = TRUE)
NOTE: Results may be misleading due to involvement in interactions
 x emmean    SE  df asymp.LCL asymp.UCL
 0  0.183 0.170 Inf    -0.150     0.517
 1 -0.171 0.157 Inf    -0.478     0.137

Results are averaged over the levels of: z 
Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 

This appears to be an oversight in the hpd.summary function. I will look at it and fix it, as it can create confusion if we don’t say what variables were averaged over. And the information is there:

> emm@misc$avgd.over
[1] "z"
1 Like