Imputing missing responses in multivariate ordinal data

Hi Paul and others,

I was wandering if there is a way to do multiple imputation in a fully Bayesian way using brms in the case of discrete (ordinal) variables? Maybe using some latent variable approach and some other “tricks” would do the thing. I’m completely new to brms and would be thankful if someone could show an example or give some useful hints. Imputing missing values in response variable Y (ordinal) is of the main interest in my case. There are some missing discrete X’s as well, but these are not the main issue.

Thanks for any input!

2 Likes

I think that there is a rough agreement in the original thread that when you care about missing responses, you can’t do much better than just infer the missing values from the model. If there is a reason, you believe this is not appropriate for your case, could you explain why do you think so?

Quoting the relevant excerpts:

Best of luck with your model!

3 Likes

Thank you, Martin, for replying. In my case I’m dealing with multivariate Y. All y’s are discrete (ordinal or categorical). Some y’s have significant proportion of missing values. Y’s are correlated. I need to utilize all non-missing data (as well correlations between y’s) so simple excluding rows when some y’s are missing is not an option in my case.
For the same reason (Y is multivariate) inferring missing values from “the model” is also not an option. I need a way to deal with missing values before (or in the process of) building “the model”.

I fully agree that there is no point to do MI if Y would be not multivariate or y’s would be not correlated. Sorry for not being specific in my first post.

I also have some discrete (categorical and ordinal) predictors - x’s with missing values. Having a way to do MI of discrete variables with brms would be usefull in any case.

I was planning to use MICE (or some other alternative) for “Imputation before model fitting” and use brms afterwards. However, I would prefer doing “Imputation during model fitting” as both options seemed to be supported by brms (Handle Missing Values with brms). Unfortunately, discrete variable MI is not supported. So, I’m looking for an elegant ant simple way around. If there is none I will stick to “Imputation before model fitting” and use other grate brms functionalities (built on top of stan) afterwards.

A simple straightforward way is to use a weighted average approach as explained by @richard_mcelreath in the second edition of his book Statistical Rethinking (chapter 15).

1 Like

Thank you for a direction. I have already looked there but still do not have a clue how to define the procedure in brms. I am new to brms, not familiar with rethinking or stan. Was wandering if someone had a working example (or code snippet) implementing something like this in brms. I have looked at 15 Missing Data and Other Opportunities | Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition but also with no success.

Anyway, thanks a lot again!

1 Like

So one way to do this with multivariate data in brms is to convert it to “long format” (each dimension-value pair gets its own row). Than one can do “sort-of” multivariate model by having a varying intercept per each row in the original data. The model is not fully identified by default, but one can fix the sd of the varying intercept to make it identifiable.

So assuming simulated data like:

rcumulative_logit <- function(N, mean, thresholds) {
  raw <- rlogis(N) + mean
  res <- rep(1, N)
  for(t in thresholds) {
    res <- res + (t < raw)
  }
  res
}


set.seed(58224522)
N_patients <- 7
simple_test <- data.frame(patient = factor(1:N_patients)) %>%
  crossing(time_from_diagnosis = c(0, 3, 5)) %>%
  mutate(q1 = rcumulative_logit(n(), rnorm(N_patients)[as.integer(patient)] + time_from_diagnosis * 0.5, thresholds = c(-1,-0.3,0,1)),
         q2 = rcumulative_logit(n(), rnorm(N_patients, sd = 0.5)[as.integer(patient)] + time_from_diagnosis * 0.1, thresholds = c(-0.4,-0.1,0.5,1.5)),
         q3 = rcumulative_logit(n(), rnorm(N_patients, sd = 2)[as.integer(patient)] + time_from_diagnosis * 0.3, thresholds = c(-2,-0.8,-0.2,1.1)),
         obs_id = 1:n()
  )

simple_test_long <- simple_test %>% 
  pivot_longer(starts_with("q"), names_to = "question", values_to = "answer")

obs_id now binds all the response that were previously the same row.

The idea is that we add a varying intercept that corresponds to the correlated residuals, i.e. we add one parameter per observed answer as a varying intercept of the form (0 + question | obs_id) where obs_id uniquely identifies each row in the wide dataset. To keep the model identified, we fix the sd of this varying intercept.

We then increase the disc parameter of the cumulative("logit") distribution (which I currently do by adding disc ~ 1 to the formula and setting a constant prior on this intercept. This is almost the same as using actual multivariate distribution, the difference is that the observation noise now has two components: one independent from the logistic distribution of ordered_logistic (with roughly sd = 1/disc) and one correlated from the varying intercept (where we set the sd = 1). So as disc -> inf we get a closer and closer approximation to multivariate probit, but also increase risk of computational issues. disc = 10 seems to work quite OK.

Note: it does not really matter which link we put in the cumulative response (as we are trying to minimize it’s effect), but the logit link is much more numerically stable than the probit link, so we use it. Still, this is a probit model, because the main noise component is normally distributed.

fit_thres_rescor_approx <- brm(bf(
  answer | thres(gr = question) ~ time_from_diagnosis + (1 | patient) + (0 + question | obs_id), disc ~ 1, family = cumulative(link = "logit", link_disc = "identity")), 
  prior = c(
    prior(constant(1), class = "sd", group = "obs_id"), 
    prior(student_t(3, 0, 5*2.5), class = "Intercept"),
    prior(constant(10), class = "Intercept", dpar = "disc")
  ),
  data = simple_test_long
  )


I know this is not a very good approximation if you have many questions (roughly more than 5, but didn’t investigate thoroughly), still it models some correlation even in the higher-dimensional case. The model will also have trouble initializing properly for large data - reducing the constant(10) makes it better behaved but also further from a pure multivariate probit. But is multivariate, ordinal, correlated, works with brms out of the box and I verified it is a faithful approximation to multivariate probit for 3 questions.

For fitting, you simply drop the rows with missing question-answer pairs from the long dataset. To impute them, you predict for the whole dataset.

This code was developed and tested as a part of ongoing project with @jroon, so if you end up using it, some form of acknowledgement would be great :-) (I feel a bit shameful asking for this, but my employer is recently being a bit more inquisitive about my publication output).

You can hack brms to work with latent variables - actually working on this in the project, but I don’t need to handle missing values which would complicate things another bit. However, it gets quite complex quite fast and requires you to be comfortable with both brms internals and coding in Stan, so if the above version works well for you, I would stick with it.

I also took the liberty to move this to a new thread as I think it is getting far from the original question.

4 Likes

Hi Martin,

Thank you very much for sharing. This is amazing and shows how to introduce correlated residuals for GLM’s in brms. Really a nice “trick”. To make sure I understand how (if) it works I have done some simulations similar to your example but closer to my problem. It seems to work quite well. I will definitely use these ideas in some stage of my project.

Unfortunately, it does not solve all the issues I have. For simplicity let’s say there are 3 Y’s, Y = (Y1,Y2,Y3). Y’s are ordinal. To make things even simpler let’s assume each Y[k] has only 2 levels (is an ordered 2 level factor). Levels are highly unbalanced e.g. #[Y3 == L1] : #[Y3 = L2] <-> 9 : 1.

2 of them, say Y1 and Y2, are fully observed and the third one – Y3 is not observed for 90-95% of cases. My task is to construct a (regression) model P(Y3|X), for simplicity lets say it would be linear in X, for Y3 ~ XB. My estimator (not sure that’s the right naming for this in Bayesian world) – the procedure to estimate parameters of this regression model, should use all the information at hand.

If I would follow your suggestion, I would not use nor X[is.an(Y3)] nor Y1[is.an(Y3)] nor Y2[is.an(Y3)]. I would stick to only using (X[!is.na(Y3)], Y2[!is.na(Y3)]) which means using only 5% of data and throwing away the rest.

To utilize all the data I could use your suggested procedure as a first step “of the loop”. In the second step I could impute missing Y3’s run another brms model, after finishing do another iteration until some convergence criterion would be met. I would also need to adjust prior for i+1 step maybe by some gausian approximation of the posterior from i’th step. However I think this would be like trying to implement mi() brms functionality for ordinal (categorical) variables in an extremely inefficient way. And this would also not be feasible from computation time perspective as my sample size in the 10^5 order.

To some up I would like to utilize X[is.an(Y3)] and Y1[is.an(Y3)] and Y2[is.an(Y3)]. This could be thought as some kind of semi-supervised learning task. Not a classical semi-supervised task however as there are Y1 and Y2 to be used for additional “supervision” not only X[is.an(Y3)] for “regularization”. This additional supervision and regularization are necessary not only because I have extremely large proportion of missing Y3’s but also unbalanced levels (#L1 >> #L2) then Y3 is observed.

I have reasons to believe that MAR (missing at random) assumption is satisfied for Y3 conditioned on X and Y1, Y2. So, my plan was to do multiple imputation and to let Bayes do the rest. Just wanted to do everything in a “fully Bayesian way” using brms.

Would appreciate any additional thoughts regarding multiple imputation of categorical (ordinal) variables in brms or any semi-supervised learning examples/hints for implementation using brms.

As for latent variables it would be nice if you could share anything you can. It could be very useful. If not for handling missing data so for SEM. I was going to try incorporating “structural constrains” of this sort using brms. Scott Claessens | Latent Variable Modelling in brms is the only example I came across.

To make things even more complicated I can say that I have discrete measurement errors in Y1 an Y2, even Y3[!is.na(Y3)]. Would appreciate any thoughts how to deal with this using brms if you have any. me(se) and mi(se) can be used for continuous variables only.

Thank you for your efort to help!

That’s not completely true - you totally can include Y1[is.na(Y3)] and Y2[is.na(Y3)] in the model (as it is in the long form - you can filter NAs after converting to long format). However, those would influence the model coefficients for Y3 only very indirectly by informing parts of the correlation structure of the (0 + question | obs_id) term (which is not surprising - unless you have some additional assumptions on how those observations should influence your understanding of Y3 , this is the best we can do).

In a broader context, I see two ways to interpret your goals:

  1. You care about recovering the unobserved Y3 for the given dataset, because that’s a primary quantity of interest
  2. You only care about the unobserved Y3 to the extent it lets you better inform the model and make inferences about future Y3.

If it’s 2, then I don’t think that you can do any better than just fitting the model as described (but with all the data) and use the fit for predictions. If you fit the model and then use its results to predict new Y3 which you then use to fit a new model, you are conditioning twice on the same data and likely will end up with meaningless results. Note that if it was possible to somehow impute AND use the data for inferring your model you could IMHO for example improve our understanding of cancer by adding the data from a study of high blood pressure to a cancer study as long as the cancer registry contains blood pressure measurements.

Note that when imputing predictors, the situation is different, as the model for the missing values is separate from the model of interest and would usually not use response information at all.

If you need 1, then there is a further step I missed in my first description, sorry for the confusion - you will use Y1[is.na(Y3)] and Y2[is.na(Y3)] in predicting Y3- you have an estimate of the correlation, so for each missing Y3 you can draw Y3 | X, Y1, Y2, fit. If this was a multivariate gaussian, you could easily work out the conditional density of Y3. However things get trickier for a multivariate probit and I think just rejection sampling could be good enough in 3 dimensions with few categories - you generate a prediction for Y1, Y2, Y3 | X, fit if Y1 and Y2 match the observed values, you keep it. If not, you generate a new prediction.

Does that make sense?

1 Like

Thank you for a swift response.

Answering you question regarding my goals. Both goals you described are of my interest in different stages of my project. To build a good model M1 = P(Y3|X) for future predictions of Y3 for existing (in-sample) and new (out-of-sample) subjects and future X’s is the first goal. The second goal is inferring missing Y3 and using them for further statistical analysis. This second task would be done on different sample - not the same used to build model M1.

Yes, I understand that Y1 and Y2 would influence P(Y3|X) model by using residual correlation from long format specification. However, this residual correlation (if significant or even present) is not enough and will not stabilize nor regularize estimates of regression coefficients of P(Y3|X) enough unless I will enforce it explicitly in another way. This becomes possible by using long format “trick” you suggested and using nonlinear brms specification to explicitly introduce hand engineered parameter restrictions (hard parameter sharing or soft parameter pooling) between P(Y1|X), P(Y2|X) and P(Y3|X) models. However, I have my reasons why this is not the way to go in my case.

Anyway, I understand that you don’t have any suggestions how MI could be implemented in brms to handle categorical (ordinal) variables in a similar (or different) way as it was described by @richard_mcelreath in Statistical Rethinking (chapter 15) as pointed by @torkar. Lets say for imputing only categorical (ordinal) predictors which is also a problem to solve in my project :)

I am not an expert in medicine nor modeling related to medicine. However, based on my knowledge there is a potentially huge benefit of semi-supervised learning enabling to use labeled pairs of data (Y, X) together with unlabeled data (X), especially if sample of the first is << of the second. In my case I have not only big second data sample with (X) but also with (Y1,Y2) - known to be strongly related with Y3. I have looked over the literature of multiple imputation and have not found anything about non-admissibility of MI for response variables. I am not saying that MI is the only or the ideal solution. Do you have any experience implementing semi-supervised learning procedures using brms or stan?

Imagine another situation when I have Y4 = f(Y1,Y2,Y3), there f is a known deterministic nonlinear function. Suppose I need a model P(Y4|X). If this does not convince you I need some iterative procedure like MI for inferring missing Y3 values during model training process vs case wise deletion + inferring afterwards then I have no further arguments which would convince you 😊

I have another question regarding you suggested long format “trick”. This works for cumulative type of ordinal regression model but does not work for sequential or adjacent category model type together with category specific effects specified with cs(x). This is because of using “answer | thres(gr = question)” on the left side of the formula to define question specific thresholds. Any ideas how to overcome this limitation?

Error: Cannot use category specific effects in models with multiple thresholds.

Thanks for your insights and patience!

So it is possible I am missing something completely basic. I admit I am no expert on imputation, just using some statistical common sense (which can be fallible). All of the imputation tasks I’ve worked with so far are of the form:

[1] outcomes ~ something_with_missing_values + other_predictors
[2] something_with_missing_values ~ subset_of_other_predictors + subset_of_outcomes

Where some of the subsets might be empty, so [2] can even be an intercept-only model. So the variable with missing values always figures in two (or more) relationships, at least once on the left hand side and at least once on the right-hand side. And in my understanding this is what allows us to impute it while using the imputed values for inference. If there was only one relationship, we would be stuck. You seem to have only one relationship. So I think that to do anything beyond just fitting the model and using its predictions, you need additional constraints/information/assumptions. It seems you have some intuition on why there should be additional information about Y3 in the data beyond just P(Y3[!is.na(Y3) | X) and the residual correlations between Y1, Y2 and Y3, could you explain a bit more why do you believe this is the case?

I checked my copy of statistical rethinking (I have first edition, so this might have changed) and my understanding is that the imputation procedure described there also uses two relationships for each of the imputed variables, not just one.

If we are speaking about imputing predictors and those are ordinal and you can assume a cumulative model for the predictors (i.e. they are categorizations of a continuous latent variable), then I think you can hack around brms to introduce latent continuous variables that are constrained by some thresholds (when the predictor is observed) and unconstrained (when the predictor is not observed) and put some predictors on those as you would do while imputing a continuous value. I would not expect this to provide substantial benefits over just using mice to impute the predictors and would be tricky to do, but I’d be happy to get you started if you want to go in this direction.

I however believe this is a completely separate problem - when you are imputing responses, the problems I see in your case would apply in the same way to continuous values.

I don’t think I understand why you need an iterative procedure. As I described above, you can use the fitted long model to impute Y3 | Y1, Y2, X and this should IMHO be perfectly fine to use to make probabilistic predictions for f (and I don’t see how you could do better). I however don’t think that you can use the imputed values of Y3 to gain any further insights into the relationships between Y3 and X.

I have only limited experience with non-cumulative ordinal models, but my understanding is that cs in non-cumulative models handles all the use cases handled by thres in cumulative models (and some more), so I see no reason why you would need both in the same model. IMHO answer | thres(gr = question) ~ x with cumulative family should correspond quite directly to answer ~ x + cs(question) in a non-cumulative family. I think the hack I use to introduce correlations is a bit harder to interpret in non-cumulative models though.

Best of luck with the model!

1 Like

Hi @mantas.tartenas and friends.

There’s another option.

I’ve been working with a multivariate ordinal data set with a similar problem where there is uneven missingness among my DVs and I really don’t want to drop cases. The solution is to use the resp_subset() helper function. In your data set, make a new column for each of your ordinal DVs. Since it appears you have 3 DVs (Y1 through Y3), we’ll call these new columns sub1 through sub3. In the sub1 column, every row with a missing Y1 value is coded FALSE, while all others are coded TRUE. Follow the same coding procedure for sub2 and sub3. Now you can fit multivariate ordinal model with syntax like this:

my_fit <- brm(
  data = my_data,
  family = cumulative(probit),
  mvbf(
    bf(Y1 | subset(sub1) ~ 1 + (1 |i| id)),
    bf(Y2 | subset(sub2) ~ 1 + (1 |i| id)),
    bf(Y3 | subset(sub3) ~ 1 + (1 |i| id)),
    rescor = FALSE
  )
)

This will use all available cases and allow for correlations among the random effects of all DVs. In my data, this has worked shockingly well. I couldn’t be happier. You can even go wild and set the discrimination parameters to random, too:

another_fit <- brm(
  data = my_data,
  family = cumulative(probit),
  mvbf(
    bf(Y1 | subset(sub1) ~ 1 + (1 |i| id),
       disc ~ 1 + (1 |i| id)),
    bf(Y2 | subset(sub2) ~ 1 + (1 |i| id),
       disc ~ 1 + (1 |i| id)),
    bf(Y3 | subset(sub3) ~ 1 + (1 |i| id),
       disc ~ 1 + (1 |i| id)),
    rescor = FALSE
  )
)

If you want to use LOO or WAIC, just make sure to use the resp argument.

8 Likes

@torkar If you see or know of an example of this approach implemented in brms please let us know! That part of the chapter is difficult to follow, and I’m sure there are many people who would love to see a worked brms example :)

I don’t think brms supports missingness in cumulative() families?

Agreed, @jonnations. I would love to see a weighted average approach to missing data implemented with brms. If @torkar or anyone else has example brms code for this, you have instant GOAT status in my eyes.

1 Like

Sorry but I haven’t gotten around providing a simple example. Perhaps I’ll do it during summer if I have the time :)

2 Likes

Hi all

May I suggest a solution outside of brms. If you are thinking in something like an SEM, as you mentioned the post about latent variables with brms.
You can do those models with the blavaan package. I runs SEMs with Stan in the backend, handles categorical variables as nominal, and even with categorical variables can habdle missing data like FIML (Full Information Maximum Likelihood) in frequentist SEM.