Transformation of mi() in the main outcome model

Hi there,

For my thesis I have to perform multiple imputation during model fitting, and compare the results to a model without multiple imputation during model fitting. Unfortunately, I cannot manage to linearly transform the mi() variable in the main outcome model. In short, I want to do the following:

  1. Estimate the missing Troponin_Second_Measure values in a ‘sub-model’
  2. Calculate (not estimate) the Troponin_Difference based on the Troponin_First_Measure (observed, this variable does not contain missing values) and Troponin_Second_Measure (observed + imputed) values.
  3. Estimate the main outcome model component

Since multiple imputation during model fitting is a ‘one-step approach’, all above mentioned steps have to be performed at once. Currently, the two models (with and without multiple imputation during model fit) in R look as follows:

model1 <-
  quiet(
    brm(
      data = df,
      formula = Cardiac_Outcome ~ log(Troponin_First_Measure) + I(Troponin_Second_Measure-Troponin_First_Measure) + Physiological_Platec +  Angiogram  + Gender + Physiological_Creatine + (Troponin_First_Measure|ECG_Ischaemia),
      prior =  c(set_prior("student_t(3,0,5)"),
                set_prior("lkj(2)", class = "cor")),
      family = categorical(refcat = "Normal"),
      iter = 500, chains = 4, cores = 4,
      control = list(adapt_delta = 0.96, max_treedepth = 12)))

&

model2 <-
 quiet(
brm(
  data = df,
  formula = bf(Cardiac_Outcome ~ log(Troponin_First_Measure) + I(mi(Troponin_Second_Measure) - Troponin_First_Measure) + Physiological_Platec +  Angiogram  + Gender + Physiological_Creatine +  (Troponin_First_Measure|ECG_Ischaemia)) + 
             bf(Troponin_Second_Measure| mi() ~ Troponin_First_Measure + Physiological_Creatine +  Physiological_Platec + Gender) + set_rescor(FALSE),
  prior = c(set_prior("student_t(3,0,5)"),
            set_prior("lkj(2)", class = "cor")),
  family = list(categorical(refcat = "Normal"),
                gaussian()),
  iter = 500, chains = 4, cores = 4,
  control = list(adapt_delta = 0.96, max_treedepth = 12)))

Unfortunately, I get the following error when running the second model:

Error in stanc(file = file, model_code = model_code, model_name = model_name, : failed to parse Stan model ‘2f963b3c6056c1414f07e7c5ec2b7f85’ due to the above error

The subtraction

I(mi(Troponin_Second_Measure)-Troponin_First_Measure)

in the second model is not possible whereas the subtraction

I(Troponin_Second_Measure)-Troponin_First_Measure)

in the first model is possible. Hence, it is most likely that the error is caused by trying to perform a linear transformation on the mi() variable. Here it says that the formula doesn’t currently allow imputed variables to have non-linear effects in the main outcome model. However, is there any way of performing the subtraction mentioned above in either R or STAN? In other words, are linear transformations of the imputed variables possible?

Note: Calculating Troponin_Difference before fitting the model and imputing this variable leads to suboptimal results since fitting Troponin_Second_Measure in the ‘sub-model’ is more suitable because of the nature of the data. Namely, Troponin_Difference is harder to model based on domain knowledge/literature and it contains many zeroes.

Note 2: The I() function allows arithmetic operators in a formula in R.

Note 3: Unfortunately, I cannot share the data since it is confidential patient data. However, I have added the STAN code from a model where Troponin_Difference is calculated before fitting the model and then imputed (this model runs without any errors). See the R code and STAN code (see uploaded file) below. My guess is that in the model{} block something has to be adjusted in the lines under

// vector combining observed and missing responses

(see text highlighted in red). However, I’m not familiar with writing STAN code and I have no idea on how to proceed.

Thanks in advance!
Marty

model3 <-
quiet(
brm(
  data = df,
  formula = bf(Cardiac_Outcome ~ log(Troponin_First_Measure) + mi(Troponin_Difference) + Physiological_Platec +  Angiogram  + Gender + Physiological_Creatine +  (Troponin_First_Measure|ECG_Ischaemia)) + 
             bf(Troponin_Difference| mi() ~ Troponin_First_Measure + Physiological_Creatine +  Physiological_Platec + Gender) + set_rescor(FALSE),
  prior = c(set_prior("student_t(3,0,5)"),
            set_prior("lkj(2)", class = "cor")),
  family = list(categorical(refcat = "Normal"),
                gaussian()),
  iter = 500, chains = 4, cores = 4,
  control = list(adapt_delta = 0.96, max_treedepth = 12)))

model3.stan.txt (17.6 KB)

1 Like

Hi, sorry for not getting to you earlier. It seems this is not currently supported and I don’t think this could be done easily - when using I(a - b) (with non-missing data), the subtraction is done when preprocessing the data, but that’s not an option for missing data. In fact, just taking a function (e.g. having I(mi(Troponin_Second_Measure) ^ 2)) works.

If you really need this, then you might be able to edit the raw Stan code generated by brms to make this happen (feel free to ask for direction on how to achieve this).

For posterity, I’ll also share an example to show this problem on example data:

So, using a function (works):

library(brms)
options(brms.backend = "cmdstanr")
data("nhanes", package = "mice")
bform <- bf(bmi | mi() ~ I(mi(chl)^2)) +
 bf(chl | mi() ~ age) + set_rescor(FALSE)
fit_imp2 <- brm(bform, data = nhanes)

Using data (does not work):

bform <- bf(bmi | mi() ~ I(mi(chl) - age)) +
 bf(chl | mi() ~ age) + set_rescor(FALSE)

fit_error <- brm(bform, data = nhanes)

Best of luck with your analysis!

Two things I forgot to note:

  1. If you instead model your data with + mi(Troponin_Second_Measure) + Troponin_First_Measure , you get the desired model as a special case where b_Troponin_Second_Measure = -b_Troponin_First_Measure. So if you can tolerate the bit of additional flexibility and the two measures are not too strongly correlated, this could actually work quite nice.
  2. You have adjusted adapt_delta and max_treedepth, presumably to address some divergences. It is quite likely those point at some problem with the model that if resolved could help you learn more from your data. We’d be happy to help you in diagnosing this, but we’d need to see some summaries from the model (e.g. pairs plots for suitable subsets of the parameters) and know more about the data.