Estimation of covariance over variables in a regression model

Guido_Biele · April 1, 2019, 3:44pm

In Full information maximum likelihood estimation (FIML) of a regression with missing data one jointly estimates

the regression model parameters and
the covariance between variables involved in the regression plus some auxiliary variables.

Can this be done in brms?

Example:
I want to estimate the effect of E on O.
There is missingness in E, which depends on M. M is also a mediator between E and O.

To obtain an unbiased estimates for the effect of E I would like to estimate jointly the regression model O ~ E, while also imputing the missing data in O and estimating the covariances between O, E, and M.

Thanks in advance, Guido

PS: I can do this directly in Stan, but I am looking for a solution for people who don’t want to write a Stan program.

martinmodrak · April 4, 2019, 11:03am

I don’t think you can do exactly what you are speaking about, but the vignette on missing values shows approaches that work with brms: 1) using linear predictor for missing data and 2) multiple imputation. For what it’s worth a linear predictor for missing values is probably very close to estimating covariance, maybe even identical if only one predictor contains missing data, but I didn’t check the math.

Guido_Biele · April 26, 2019, 1:47pm

Thanks for the suggestion Martin. I am familiar with the vignette and was looking to do something different (because the methods shown there won’t work for my problem).

I finally had time to look into this and explore how to use the stanvars function to do what I want to do.

Before presenting my (half working) solution, I’ll recap the problem:

I want to estimate the total effect of e on y, but y has missing data.
The full data generating process is (omitting error variances):
m \sim N(e,\sigma_m) \\ y \sim N(e\cdot l + m,\sigma_y) where l\cdot e stands for the interaction of variables l and e. Further, the probability of missing data is
P_{miss_y} \sim bern(inv.logit(m + l))

Inspired by the full information maximum likelihood (FIML) approach to dealing with missing data, one way to deal with missing data is to simultaneously

imputed the missing values and
estimate y \sim e
estimate the co-variances of y,m,e,l

Such a model can be implemented in brms as follows (with all variables in the data.frame my_data):

(a) Specify code to be added to the parameters and model blocks of the basic brms model.

parameters_code = "
corr_matrix[4] Omega;
vector<lower=0>[4] tau;
vector[4] mus;
"

Variables are now in capital letters. Yl is the brms-generated variable with imputed missing values.

model_code = "
mus ~ normal(0,2);
tau ~ normal(0,2);
Omega ~ lkj_corr(2.0);

for (i in 1:N) {
  target += multi_normal_lpdf(to_vector({M[i],iEL[i],Yl[i]}) | 
                              mus,
                              quad_form_diag(Omega,tau));
}
"

put model code to be added and auxiliary variables in a stan_vars object:


stanvars = 
  stanvar(x = as.vector(my_data$E),
          name = "E") +
  stanvar(x = as.vector(my_data$M),
          name = "M") +
  stanvar(x = as.vector(my_data$E*my_data$L),
          name = "iEL") +
  stanvar(block = "parameters",
          scode = parameters_code) +
  stanvar(block = "model",
          scode = model_code)

(Try to) run the modified brms model-

sf <- brm(bf(Y | mi() ~ E),
          data = mar_interaction_data,
          stanvars = stanvars)

I write “try to run” because there remain 2 problems:

brms puts the additional model code between the variable declarations and the rest of the original brms code, i.e. before missing values are put into the vector Yl. I can work around this by manually modifiying the brms code, but maybe @paul.buerkner has an idea that is more elegant than what is described here?
Even if there is an elegant way to solve this coding problem, I found that the estimation of the co-variance and the effect model interfere with each other. Only when I multiply the log posterior for the covariance estimation with a large number (e.g. 100) am I able to recover the correct parameter values. I’ll describe this problem in a different thread.

paul.buerkner · May 4, 2019, 12:19pm

Sorry for the slow response. I get way too many brms questions these days.

This is model is so far out of the scope of brms (at least in the currently specified way; I cannot judge if we can get it closer to what brms already offers), that I am not aware of any elegant solution to keep everything in the brms framework.

Guido_Biele · May 4, 2019, 12:33pm

Edit: Corrected to sentence about imputation of outcome values.
No worries.

The key problem here is not so much what brms can do out of the box (I do get it into brms, it just involves a few additional lines of code). The key problem is that if one estimates (1) regression coefficients and (2) covariance of outcomes with predictors and an auxiliary variables, these two things interfere with each other (see details here). Maybe you have an idea why this is happening?

About the scope of brms: I think it would be great if brms ~~had some facility to deal with missing outcome variables~~ would allow to condition imputed outcome values on auxiliary variables, i.e. variables that are not part of the regression model. The reason is that if one does a complete case analysis when only predictor values are missing, the only effect will be that the variance of the regression weights will be overestimates. That’s not nice, but it’s not so bad either. However, if outcome values are missing systematically, the situation is much worse because the expectation of the regression weights will be biased.

paul.buerkner · May 4, 2019, 12:38pm

(1) To make both identifiable you need to clearly differenitate regression coefficients from residual correlations by including other predictors in the models of (perhaps both) response variables. If we just did bf(y ~ x) + b(x ~ 1) + set_rescor(TRUE) the model would essenitally not be identified. In other words, the residual correlation must measure something else than the regression coefficient. Not sure if this broad statement is of any help to you.

(2) brms can do that already by adding y | mi() ~ ... to your model of the primary outcome. I am sure you know that so I am confused what is the difference between this and the approach you have in mind?

Guido_Biele · May 4, 2019, 12:46pm

Thanks, for the quick feedback. Even If the statement (1) is broad, it gives me food for though :-)!

You are of course right considering (2).[I’ll edit that in a minute so as to not confuse other readers] .The thing is that a lot can be gained by using auxiliary information when imputing outcome values, which as far as I can tell is currently not possible.

paul.buerkner · May 4, 2019, 1:38pm

After your edits, I now understand what you mean, that is separating the imputing from the prediction model, right? Feel free to open an issue about this on github. I am not sure though, if this is something I will actually implement in the end.

Guido_Biele · May 4, 2019, 2:47pm

Yes, separating the imputation from the prediction model captures it well.
I’ll hold off with opening an issue until I have figured out how to do this properly.

Also, thanks again for your great work with brms! I really appreciate the work you are doing in providing the software and the support!!

Topic		Replies	Views
Estimating missing response data in multivariate model brms	4	814	November 24, 2019
"Interference" when simultaneously estimating regression model and co-variance between variables Modeling fitting-issues	2	570	April 26, 2019
Missing data in multiple correlated predictors brms	2	407	May 6, 2020
Missing data with brms mi() brms missing-data	2	829	June 5, 2020
Data imputation in multilevel meta analysis brms brms meta-analysis , missing-data	4	927	June 25, 2020

Estimation of covariance over variables in a regression model

Related topics