Modeling missing discrete covariates in regression model?

I am looking to model missing covariates in a regression model using Stan and/or brms.

I am able to do this when the covariate is continuous, for example using the mi() syntax in brms as in the “Imputation during model fitting” section here: https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html

However, when the covariate is discrete, I don’t believe this is possible because Stan isn’t able to use discrete parameters because it uses HMC. I’ve read about marginalization as a way to get around this issue, but I haven’t had much luck trying this approach.

For example, what if the missing covariate is binary, and I want to model it with a logistic regression? Or what if it is a count variable, and I want to model it as Binomial or Poisson? Or even as an ordinal variable with a proportional odds model? Because these missing discrete covariates would be treated as parameters in the main regression model of interest, I don’t see how to achieve this in Stan.

Would it be recommended to try another probabilistic programming language like NIMBLE or PyMC that are able to use HMC together with other MCMC samplers that can handle discrete parameters? For example, this R package JointAI (https://www.jstatsoft.org/article/view/v100i20) uses JAGS to achieve the type of imputation I am seeking.

Any guidance on this issue would be greatly appreciated.

1 Like

How about using mice, which can impute discrete covariates? There is a brms vignette Handle Missing Values with brms • brms, which shows how to use mice with brms, but doesn’t mention that mice could be used for discrete data, so you need to read mice documentation, too.

2 Likes

Thanks @avehtari! For the specific problem I’m working on, multiple imputation like MICE doesn’t adequately handle the missing data issue, while the Bayesian implementation does (related to the work here: https://doi.org/10.1002/sim.6944). I’ve verified that the Stan version works while the MICE version doesn’t when I use continuous missing covariates. The issue is that I want to extend this to discrete missing covariates and while I’d like to keep using Stan, I’m not seeing a way forward without changing to another framework.

1 Like

What was the problem? It was difficult to implement? Or you were able to implement it, but got wrong results?

In the case of a single missing binary (or more generally finite support) predictor, you are essentially modelling the outcomes with missing predictor as a finite mixture (if I understand the task correctly), which should work in Stan just nice.

If there are potentially multiple missing predictors, the mixture needs to be over the Cartesian product of the possible missing values, which can get out of hand pretty quickly.

If there are variables with large/infinite support, they can usually be pretty well approximated by a continuous variable, potentially handling important special cases as a mixture (e.g. for neg. binomially distributed variable with low mean, you may want to treat the very low counts as discrete and then the tail as Gamma-distributed continous)

Hope that helps at least a little

1 Like

Thank you @martinmodrak! I am unsure of how to implement the marginalization.

In the case of a single missing binary (or more generally finite support) predictor, you are essentially modelling the outcomes with missing predictor as a finite mixture (if I understand the task correctly), which should work in Stan just nice.

Do you know of any examples that show Stan code for this?

1 Like

Stan User’s Guide at Finite Mixtures has the relevant background. Note that the mixture component indices are explicitly treated as unknown discrete variables that a) can be recovered and b) can be dependent on data/other model parameters

Does that make sense?

EDIT: Just to make the connection to mixtures explicit. If your model is

y_i \sim N(\mu_i + \beta X_i, \sigma)

where X_i is a potentially missing binary predictor and \mu_i is the sum of all other non-missing predictors, then whenever X_i is missing, y_i is a mixture of N(\mu_i, \sigma) and N(\mu_i + \beta, \sigma) with the mixing proportion representing the probability that X_i = 1 derived from your imputation model.

1 Like

You could use a latent variable approach. @scott.claessens’s blog post shows how to do the latent variable modeling with brms. The blog post uses only continuous observations, but you can use also discrete observations. I made the following changes. Replace two of the observed variables with binary variable

z2 <- rnorm(n = 100, mean = x, sd = 0.5)>0
z3 <- rnorm(n = 100, mean = x, sd = 0.5)>0

and use the Bernoulli data model for them

bf2 <- bf(z2 ~ 0 + mi(X), family=bernoulli()) # factor loading 2
bf3 <- bf(z3 ~ 0 + mi(X), family=bernoulli()) # factor loading 3

Eventually, you use the latent continuous X as a covariate in your model

bf5 <- bf(y ~ 0 + mi(X))  # added regression

This is easy with brms when there is no missing data in z. I think with missing data there is a problem how to handle varying number of observations, and you probably need to write the model in Stan code, but hopefully this brms latent variable example helps to get the basic idea.

In this example, there is only one latent variable, but you can use also multivariate latent variable for additional expressiveness if needed.

This kind of latent variable models may have a challenging posterior shape, which may require quite long sampling time, so if you can marginalize that will be computationally faster.

2 Likes

@avehtari’s answer highlights an important consideration I missed:

The fact that your data contains a binary (discrete) variable doesn’t mean you need to impute a binary (discrete) variable.

Imputing as continuous makes sense when the discretization actually loses information.

The clearest example is when your predictor is “age > 40” - there’s no point in imputing this as binary, you can just directly impute age (in fact, it likely makes sense to impute age even for the observations where the age category is available).

A more relevant example is something like “takes antihypertensives” when predicting general health outcomes. Most likely the variable serves as a proxy for overall cardiovascular health and it makes sense to impute it as continuous “propensity to take antihypertensives”.

In such examples, the latent variable approach as descibed by Aki makes a lot of sense and will likely work better than imputing the discrete variable. One could argue that this is the more common case.

On the other hand sometimes the discretization adds information. E.g. for general health outcomes, binary “suffered a stroke” will likely have predictive value beyond any sort of “propensity for strokes” - actually having a stroke meaningfully changes the patient’s outlook for the future.

When the discretization carries information, I think imputing the discrete variable explicitly will improve your model (e.g. with the mixture approach).

An interesting mixed case are categorical predictors like “country” - depending on your goals you could argue they directly carry discrete information or you could argue that they are a proxy for continuous variables like “socioeconomic status” or “rule of law” you may want to impute.

5 Likes

As @martinmodrak pointed out, we can marginalize. The more general chapter is the latent discrete data section of the User’s Guide.

If it’s just a single covariate that’s missing, that should be easy to marginalize and should also give you the exact answer and will also be the most efficient thing to do. If you wind up with multiple covariates missing, then it’s trickier if their regression depend on each other as they do in mi(). Those might not even make a coherent joint density.

When the missing variables have dependencies, as in variable selection for regression, the marginalization quickly becomes intractable.

Just to make this concrete, let’s suppose we have:

y[n] ~ normal(alpha + beta * x[n] + gamma[sex[n]], sigma);

and there are two potential values for sex. We can write a regression for sex in terms of the other continuous predictor, x. Let’s say that looks like this conceptually:

sex[n] ~ bernoulli_logit(gamma + delta * x[n]);

Because sex is discrete, we can’t approach it this way directly. What we have to do is this:

p(y[n] | alpha, beta, x[n], sigma)
  = Pr[sex[n] = 1] * p(y[n] | alpha, beta, x[n], 1, sigma)   // sex 
  + Pr[sex[n] = 0] * p(y[n] | alpha, beta, x[n], 0, sigma);

If you are OK just training on the observed sex data, that looks like this:

for (n in 1:N)
  if (observed_sex[n])
    sex[n] ~ bernoulli_logit(gamma + delta * x[n]);

Then for y, you’d have this:

real log_pr_sex_1 = bernoulli_logit_lpfm(1 | gamma + delta * x[n]);
real log_pr_sex_0 = log1m_exp(log_pr_sex1);
real log_p_yn_if_1 = normal_lpdf(alpha + beta * x[n] + gamma[1], sigma);
real log_p_yn_if_0 = normal_lpdf(alpha + beta * x[n] + gamma[2], sigma);  // fudge indexes
target += log_sum_exp(log_pr_sex_1 + log_p_yn_if1, log_pr_sex_0 + log_p_yn_if_0);
1 Like