Imagine that I’m considering a model with an interaction term between main effects, X1 and X2. However, X2 has missing data.
Whereas I’ve seen multiple examples of missing data in Stan models, including this brms vignette, those examples rarely include missing data that are part of interaction terms.
My intuition is that for each iterated sample, a value for the interaction term should be generated in the “transformed data” block in the Stan model, with this value constructed from the sample-specific estimates for the respective main effects.
I’d be curious to hear further thoughts on that as well as details on the syntax and model specifications of the brms package.
With multiple imputation you generate multiple hypothetical datasets that have all the missing values imputed, you do your inference on each hypothetical dataset separately in Stan, then you combine the results from your posteriors. For this approach I would recommend the mice package in R, and the excellent accompanying book “Flexible Imputation of Missing Data” by Stephen Van Burren.
The second way is to treat your missing data as a random variable to be inferred by including it in your Stan parameters block. Then in the model block you can assign that random variable a distribution, or you can even have it be determined based on a regression using data that we do observed. For example in trauma data, we often have blood tests that are missing because a patient was in too bad of a shape for the clinicians to collect a blood sample. So you can treat the blood test result as a parameter to be inferred and have it be determined using a regression on the injury severity score of the patient.
So in your case it sounds like you have a regression of the form
where x_{i2} is occasionally missing. In that case you either multiply impute x_{i2} outside of Stan in a package like mice, or you can you can include it in the parameters block and infer it. If you do the latter it’s perfectly fine to have something like the following in Stan
data {
real y;
real x1;
}
parameters {
real b0;
real b1;
real b2;
real b3;
real x2;
}
model {
x2 ~ normal(1.2, 2.0);
y ~ normal(b0 + b1*x1 + b2*x2 + b3*x1*x2 , sigma);
}
This way every time you sample a new possible value of the missing value, x2, will be drawn and the regression is inferred that way, since the x2 is declared as a parameter. The transformed data block is only calculated once at the beginning of inference, and that same value is used throughout.
To confirm, Paul, that brms syntax would solve the issue of keeping the estimated missing values together with the corresponding interaction terms throughout the sampling? Thanks much.
Yes. The data that has been observed needs to be in the data block. That data wasn’t observed should be in the parameters. Then in the transformed parameters block you can combine them in to a single variable so you can carry about regression normally:
data {
int N_missing;
int N_observed;
vector[N_observed] x_observed;
vector[N_observed + N_missing] y;
}
parameters {
real b0;
real b1;
vector[N_missing] x_missing;
}
transformed parameters {
vector[N_observed + N_missing] x = append_row(x_observed, x_missing);
}
model {
x_missing ~ normal(1.2, 2.0);
y ~ normal(b0 + b1*x);
}
Basically all variables with missing values that appear on the right-hand side of a formula must be wrapped around mi() or otherwise rows with missing values will be dropped.
For the multiple imputation approach (the arbitrariness of which can be a reason to favor the full modeling approach), one paper showed that, as counterintuitive as it seems, it is better to impute the product term rather than imputing component variables then taking the product of imputed values. I reference this in my full course notes Missing Data chapter.
As a follow up question, Paul, what if one or both of the covariates with missing data is a binary (categorical) predictor?
I ask because there are methods for marginalizing over the missingness in Stan models, but I wasn’t sure if those methods are “under the hood” in BRMS.
Unfortunately, brms doesn’t yet support missing data estimation on the fly for categorical variables. You may still use multiple imputation and then brm_multiple, though.
@paul.buerkner I want to follow-up on this thread. What if instead of
bf(Y ~ X1X2) I want bf(Y ~ X1:X2) (aka I only want X1X2 as opposed to X1 + X2 + X1*X2), where Y and X1 are complete and X2 is incomplete. I’ve tried writing:
bf(Y ~ mi(X1:X2)) +
bf(X2 | mi() ~ …)
but it doesn’t seem to work. Does it make sense to write:
When I do as you suggest, I get an error message that says:
Error: The parameter ‘relsatisfactcx’ is not a valid distributional or non-linear parameter. Did you forget to set ‘nl = TRUE’?
In case anyone is interested in a solution, I’ve decided it makes sense to code the interaction into the dataset so that I can use the mi() function on a single interaction variable instead of dealing with the two variables separately. Now I’m not dealing with the error message above, but I have a new one:
Compiling Stan program…
Start sampling
Error in FUN(X[[i]], …) : Stan does not support NA (in X_parel) in data
failed to preprocess the data; sampling not done
Any thoughts on a solution?? The model fits just fine without applying the mi() function (i.e., dropping rows with missing covariate data from the analysis).
Here’s my code in case it’s relevant to a solution:
In addition to needing multiple bf() calls, you also forgot to specify your missing terms in the lsm_bf model. I think the above is closer to what you want.