Missing data of main effects in model with interaction terms

Imagine that I’m considering a model with an interaction term between main effects, X1 and X2. However, X2 has missing data.

Whereas I’ve seen multiple examples of missing data in Stan models, including this brms vignette, those examples rarely include missing data that are part of interaction terms.

My intuition is that for each iterated sample, a value for the interaction term should be generated in the “transformed data” block in the Stan model, with this value constructed from the sample-specific estimates for the respective main effects.

I’d be curious to hear further thoughts on that as well as details on the syntax and model specifications of the brms package.

So there’s two ways to handle missing data:

  1. Multiple Imputation
  2. Modeling the missing data mechanism

With multiple imputation you generate multiple hypothetical datasets that have all the missing values imputed, you do your inference on each hypothetical dataset separately in Stan, then you combine the results from your posteriors. For this approach I would recommend the mice package in R, and the excellent accompanying book “Flexible Imputation of Missing Data” by Stephen Van Burren.

The second way is to treat your missing data as a random variable to be inferred by including it in your Stan parameters block. Then in the model block you can assign that random variable a distribution, or you can even have it be determined based on a regression using data that we do observed. For example in trauma data, we often have blood tests that are missing because a patient was in too bad of a shape for the clinicians to collect a blood sample. So you can treat the blood test result as a parameter to be inferred and have it be determined using a regression on the injury severity score of the patient.

So in your case it sounds like you have a regression of the form

\mathbb{E} [y_i] = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2}

where x_{i2} is occasionally missing. In that case you either multiply impute x_{i2} outside of Stan in a package like mice, or you can you can include it in the parameters block and infer it. If you do the latter it’s perfectly fine to have something like the following in Stan

data {
    real y;
    real x1;
}
parameters {
    real b0;
    real b1;
    real b2;
    real b3;
    real x2;
}
model {
   x2 ~ normal(1.2, 2.0);
   y ~ normal(b0 + b1*x1 + b2*x2 + b3*x1*x2 , sigma);
}

This way every time you sample a new possible value of the missing value, x2, will be drawn and the regression is inferred that way, since the x2 is declared as a parameter. The transformed data block is only calculated once at the beginning of inference, and that same value is used throughout.

Thanks, Arya. That makes sense about the transformed data block.

In this case, though, there are some observations of x2, and I’m assuming those would need to be declared in the data block, correct?

1 Like

Go for

bf(
  Y ~ X1 * mi(X2),
  X2 ~ ...
)

Where ... contains the variables you want to use to estimate missing values in X2 on the fly.

1 Like

To confirm, Paul, that brms syntax would solve the issue of keeping the estimated missing values together with the corresponding interaction terms throughout the sampling? Thanks much.

Exactly. To see the underlying stan code, use stancode() on your fitted model object or directly use make_stancode().

1 Like

And if there were three main effects, two of which had missing data, then it could be:

bf(
  Y ~ X1 * mi(X2) + mi(X3),
  X2 ~ ...
  X3 ~ ...
)

Is that right?

And would it hold if X2 and X3 were used to estimate the missing values in those respective covariates? As in:

bf(
  Y ~ X1 * mi(X2) + mi(X3),
  X2 ~ X1 * X3
  X3 ~ X1 * X2
)

Or am I missing something by trying to extend the logic and modeling in this way?

Yes. The data that has been observed needs to be in the data block. That data wasn’t observed should be in the parameters. Then in the transformed parameters block you can combine them in to a single variable so you can carry about regression normally:

data {
    int N_missing;
    int N_observed;
    vector[N_observed] x_observed;
    vector[N_observed + N_missing] y;
}
parameters {
    real b0;
    real b1;
    vector[N_missing] x_missing;
}
transformed parameters {
   vector[N_observed + N_missing] x = append_row(x_observed, x_missing);
}
model {
   x_missing ~ normal(1.2, 2.0);
   y ~ normal(b0 + b1*x);
}
2 Likes

Sorry I misspecified the model. It should be

bf(Y ~ X1 * mi(X2)) +
  bf(X2 | mi() ~ ...)

With regard to your proposed extention, you need to specify

bform <- bf(Y ~ X1 * mi(X2) + mi(X3)) + 
  bf(X2 | mi() ~ X1 * mi(X3)) + 
  bf(X3 | mi() ~ X1 * mi(X2))

Basically all variables with missing values that appear on the right-hand side of a formula must be wrapped around mi() or otherwise rows with missing values will be dropped.

2 Likes

For the multiple imputation approach (the arbitrariness of which can be a reason to favor the full modeling approach), one paper showed that, as counterintuitive as it seems, it is better to impute the product term rather than imputing component variables then taking the product of imputed values. I reference this in my full course notes Missing Data chapter.

3 Likes

Wow that is interesting and indeed counterintuitive! I have to read in to why that is exactly. Thanks for sharing.

As a follow up question, Paul, what if one or both of the covariates with missing data is a binary (categorical) predictor?

I ask because there are methods for marginalizing over the missingness in Stan models, but I wasn’t sure if those methods are “under the hood” in BRMS.

Thanks.

Unfortunately, brms doesn’t yet support missing data estimation on the fly for categorical variables. You may still use multiple imputation and then brm_multiple, though.

@paul.buerkner I want to follow-up on this thread. What if instead of
bf(Y ~ X1X2) I want bf(Y ~ X1:X2) (aka I only want X1X2 as opposed to X1 + X2 + X1*X2), where Y and X1 are complete and X2 is incomplete. I’ve tried writing:

bf(Y ~ mi(X1:X2)) +
bf(X2 | mi() ~ …)

but it doesn’t seem to work. Does it make sense to write:

bf(Y ~ mi(X1:X2)) +
bf(X1:X2 | mi() ~ …)

Thank you!

how about mi(x1):mi(x2)?

Thank you so much for taking the time to respond.

When I do as you suggest, I get an error message that says:

Error: The parameter ‘relsatisfactcx’ is not a valid distributional or non-linear parameter. Did you forget to set ‘nl = TRUE’?

In case anyone is interested in a solution, I’ve decided it makes sense to code the interaction into the dataset so that I can use the mi() function on a single interaction variable instead of dealing with the two variables separately. Now I’m not dealing with the error message above, but I have a new one:

Compiling Stan program…
Start sampling
Error in FUN(X[[i]], …) : Stan does not support NA (in X_parel) in data
failed to preprocess the data; sampling not done

Any thoughts on a solution?? The model fits just fine without applying the mi() function (i.e., dropping rows with missing covariate data from the analysis).

Here’s my code in case it’s relevant to a solution:

Code interaction into dataset

data$FemaleRelSat ← data$Dfdata$relsatisfactc.x
data$MaleRelSat ← data$Dm
data$relsatisfactc.x

Treat Df and Dm as factors

data$Df ← as.factor(data$Df)
data$Dm ← as.factor(data$Dm)

Compile and fit model

lsm_bf ← bf(pa_rel ~ 0 + Df + Dm + Df:pa_1pc + Dm:pa_1pc +
Df:pa_partnerpc + Dm:pa_partnerpc +
Df:pa_partnerpmc + Dm:pa_partnerpmc +
FemaleRelSat + MaleRelSat +
(1 |C| ID_INDIV),
sigma ~ 0 + Df + Dm + Df:na_partnerpc + Dm:na_partnerpc +
Df:na_partnerpmc + Dm:na_partnerpmc +
FemaleRelSat + MaleRelSat +
(1 |C| ID_INDIV)) +
bf(FemaleRelSat | mi() ~ 1) +
bf(MaleRelSat | mi() ~ 1) +
set_rescor(FALSE)

lsm_brm ← brm(lsm_bf, data = data, warmup = 1000, iter = 5000, init = 0, seed = 1)

there is a syntax error. you need multiple bf() calls, one for each univariate response, and then add them via a +

1 Like
lsm_bf <- bf(pa_rel ~ 0 + Df + Dm + Df:pa_1pc + Dm:pa_1pc + Df:pa_partnerpc + Dm:pa_partnerpc + Df:pa_partnerpmc + Dm:pa_partnerpmc + mi(FemaleRelSat) + mi(MaleRelSat) +
(1 |C| ID_INDIV), sigma ~ 0 + Df + Dm + Df:na_partnerpc + Dm:na_partnerpc + Df:na_partnerpmc + Dm:na_partnerpmc + mi(FemaleRelSat) + mi(MaleRelSat) + (1 |C| ID_INDIV))
Fbf <- bf(FemaleRelSat | mi() ~ 1)
Mbf <- bf(MaleRelSat | mi() ~ 1)

lsm_brm ← brm(lsm_bf + Fbf + Mbf + set_rescor(FALSE), data = data, warmup = 1000, iter = 5000, init = 0, seed = 1)

In addition to needing multiple bf() calls, you also forgot to specify your missing terms in the lsm_bf model. I think the above is closer to what you want.

1 Like