Missing data in categorical data models

bsmangel · May 24, 2022, 7:18pm

Hello everyone,

Since a few weeks I have started to learn Stan through rstan. I am working with ordinal variables and therefore with ordinal regression models. I am still getting familiar with the language and output of this software. I’ve started by running some simple models, but I have a problem I didn’t have with WinBUGS (the only Bayesian inference software I’ve ever used): missing values in a response variable.

I’ve been reading the Stan User’s Guide, and I think there would be no problem if I were dealing with a continuous response variable. According to what I have read in the manual, it is not possible, or it does not seem direct, the treatment of discrete or categorical missing values, since we do not have Gibbs sampling as in WinBUGS (where for it it is not a problem, since it generates these missing values in each simulation of the corresponding likelihood). Therefore, I understand that it is not possible to include integer parameters in the parameter block.

Is there a quick way to fix this problem? My doubt could also be applicable to binary data regression models (such as logistic regression) or Poisson regression. I have searched for information on this for several days and it seems that it is recommended to read section 7.2 (Change point models) of Stan User’s Guide. Do you recommend the same or can you help me in a different way? I thank you in advance for your help.

andre.pfeuffer · May 31, 2022, 4:39am

If you have a missing response, then there is no information. You omit it.
If it is partitionally known you may introduce a continuous parameter for it.
This parameter can be seen a predictive inference.

bsmangel · May 31, 2022, 7:22am

In my case, I have a response variable of more than 5000 observations, of which I have no more than 50 missing values. My predictor variables have no missing values. If my response variable were a continuous variable I could consider these missing values as random variables in the parameter block through continous parameters. However, my response variable is a categorical (ordinal) variable, it does not allow me to consider an integer parameter. I think it would also be applicable for missing values in binary, binomial, or Poisson variables, for example.

andre.pfeuffer · June 1, 2022, 12:49am

To impute a response you may add a continuous parameter, create a sample statement in generated quantities block and use this for the estimation of the mode.

logistic regression

parameters {
real<lower=0, upper=1> y_miss;
}
model {
target += y_miss .* log_inv_logit(mu_miss) + (1 - y_miss) .* log1m_inv_logit(mu_miss);
}

credit goes to:

Poisson distribution

parameters {
real<lower=0> y_miss;
}
model {
target +=  y_miss * (mu_miss) - exp(mu_miss);
}

Ordinal probit/logit

Following is referring to Stan manual: 1.8 Ordered logistic and probit regression | Stan User’s Guide
Use a simplex parameter y_miss same dimension D and each y_miss[i] has its corresponding
theta[i].

parameters {
simplex[K] y_miss;
}
model {
 vector[K] theta;
// ... this is for the missing value
  theta[1] = 1 - Phi(eta - c[1]);
  for (k in 2:(K - 1)) {
    theta[k] = Phi(eta - c[k - 1]) - Phi(eta - c[k]);
  }
  theta[K] = Phi(eta - c[K - 1]);
// ...
  target += sum(y_miss .* log(theta));

In case of a ordinal logit model we replace the functions Phi with inv_logit.

saudiwin · June 1, 2022, 4:29am

Thanks for the shoutout, but I don’t think fractional logit will help in this case. @bsmangel will need to marginalize over all the categories in the outcome. The Stan manual on mixture models shows how this is done with a binary variable. Essentially you compute the probability of the outcome separately for each possible value of the categorial outcome for each missing row.

In general, though, I would recommend doing something like multiple imputation, fitting the model to each dataset, and then combining the chains. The R package brms can do this for you (and really can fit many if not most models people want to fit much more easily).

andre.pfeuffer · June 1, 2022, 5:26am

The probability of each possible value of the categorical outcome is given by the simplex.
One may estimate the expected value by \sum_{i=0}^K{y_i * i}. I’d recommend the use of categorical_rng to sample discrete values of that missing value.
@saudiwin Maybe you can share some code example. I’m always to learn about alternative ways.

Mauricio_Garnier-Villarre · June 1, 2022, 2:38pm

What type of categorical regressions are you running?

Because if you can use brms, you can use multiple imputation (keeping the categories) and estimate the regressions with them. For example, the brms vignette uses the package mice

This could be a good solution, that uses the already built packages

Lisdoon_Varna · August 12, 2023, 2:25am

This may be exactly what you need usin brm and MRP (Gelman et al)

Topic		Replies	Views
Marginalising out missing categorical response variable cases provides inaccurate predictor estimates Modeling	6	1160	July 12, 2019
Impute partially missing discrete outcome Modeling specification	1	393	May 22, 2023
Missing data in a 2PL (IRT) model Modeling	37	4291	October 22, 2017
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1498	September 6, 2023
Modeling missing discrete covariates in regression model? Modeling specification , discrete-parameters	8	153	February 10, 2025

Missing data in categorical data models

Related topics