Missing data of main effects in model with interaction terms


Imagine that I’m considering a model with an interaction term between main effects, X1 and X2. However, X2 has missing data.

Whereas I’ve seen multiple examples of missing data in Stan models, including this brms vignette, those examples rarely include missing data that are part of interaction terms.

My intuition is that for each iterated sample, a value for the interaction term should be generated in the “transformed data” block in the Stan model, with this value constructed from the sample-specific estimates for the respective main effects.

I’d be curious to hear further thoughts on that as well as details on the syntax and model specifications of the brms package.


So there’s two ways to handle missing data:

  1. Multiple Imputation
  2. Modeling the missing data mechanism

With multiple imputation you generate multiple hypothetical datasets that have all the missing values imputed, you do your inference on each hypothetical dataset separately in Stan, then you combine the results from your posteriors. For this approach I would recommend the mice package in R, and the excellent accompanying book “Flexible Imputation of Missing Data” by Stephen Van Burren.

The second way is to treat your missing data as a random variable to be inferred by including it in your Stan parameters block. Then in the model block you can assign that random variable a distribution, or you can even have it be determined based on a regression using data that we do observed. For example in trauma data, we often have blood tests that are missing because a patient was in too bad of a shape for the clinicians to collect a blood sample. So you can treat the blood test result as a parameter to be inferred and have it be determined using a regression on the injury severity score of the patient.

So in your case it sounds like you have a regression of the form

\mathbb{E} [y_i] = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2}

where x_{i2} is occasionally missing. In that case you either multiply impute x_{i2} outside of Stan in a package like mice, or you can you can include it in the parameters block and infer it. If you do the latter it’s perfectly fine to have something like the following in Stan

data {
    real y;
    real x1;
parameters {
    real b0;
    real b1;
    real b2;
    real b3;
    real x2;
model {
   x2 ~ normal(1.2, 2.0);
   y ~ normal(b0 + b1*x1 + b2*x2 + b3*x1*x2 , sigma);

This way every time you sample a new possible value of the missing value, x2, will be drawn and the regression is inferred that way, since the x2 is declared as a parameter. The transformed data block is only calculated once at the beginning of inference, and that same value is used throughout.


Thanks, Arya. That makes sense about the transformed data block.

In this case, though, there are some observations of x2, and I’m assuming those would need to be declared in the data block, correct?


Go for

  Y ~ X1 * mi(X2),
  X2 ~ ...

Where ... contains the variables you want to use to estimate missing values in X2 on the fly.


To confirm, Paul, that brms syntax would solve the issue of keeping the estimated missing values together with the corresponding interaction terms throughout the sampling? Thanks much.


Exactly. To see the underlying stan code, use stancode() on your fitted model object or directly use make_stancode().


And if there were three main effects, two of which had missing data, then it could be:

  Y ~ X1 * mi(X2) + mi(X3),
  X2 ~ ...
  X3 ~ ...

Is that right?

And would it hold if X2 and X3 were used to estimate the missing values in those respective covariates? As in:

  Y ~ X1 * mi(X2) + mi(X3),
  X2 ~ X1 * X3
  X3 ~ X1 * X2

Or am I missing something by trying to extend the logic and modeling in this way?


Yes. The data that has been observed needs to be in the data block. That data wasn’t observed should be in the parameters. Then in the transformed parameters block you can combine them in to a single variable so you can carry about regression normally:

data {
    int N_missing;
    int N_observed;
    vector[N_observed] x_observed;
    vector[N_observed + N_missing] y;
parameters {
    real b0;
    real b1;
    vector[N_missing] x_missing;
transformed parameters {
   vector[N_observed + N_missing] x = append_row(x_observed, x_missing);
model {
   x_missing ~ normal(1.2, 2.0);
   y ~ normal(b0 + b1*x);


Sorry I misspecified the model. It should be

bf(Y ~ X1 * mi(X2)) +
  bf(X2 | mi() ~ ...)

With regard to your proposed extention, you need to specify

bform <- bf(Y ~ X1 * mi(X2) + mi(X3)) + 
  bf(X2 | mi() ~ X1 * mi(X3)) + 
  bf(X3 | mi() ~ X1 * mi(X2))

Basically all variables with missing values that appear on the right-hand side of a formula must be wrapped around mi() or otherwise rows with missing values will be dropped.


For the multiple imputation approach (the arbitrariness of which can be a reason to favor the full modeling approach), one paper showed that, as counterintuitive as it seems, it is better to impute the product term rather than imputing component variables then taking the product of imputed values. I reference this in my full course notes Missing Data chapter.


Wow that is interesting and indeed counterintuitive! I have to read in to why that is exactly. Thanks for sharing.


As a follow up question, Paul, what if one or both of the covariates with missing data is a binary (categorical) predictor?

I ask because there are methods for marginalizing over the missingness in Stan models, but I wasn’t sure if those methods are “under the hood” in BRMS.



Unfortunately, brms doesn’t yet support missing data estimation on the fly for categorical variables. You may still use multiple imputation and then brm_multiple, though.