Marginalising out missing categorical response variable cases provides inaccurate predictor estimates

cgoold · July 11, 2019, 9:28pm

The model in the example above is an ordinal probit model. That is, the dependent variable is an ordered categorical variable with J=3 levels, in this example, which is assumed generated by a latent, normally distributed scale cut into K=J-1 thresholds:

y_i \sim Categorical(\pi_{ij})

\pi_{ij}= \Phi[\frac{\theta_k-\mu_i}{\sigma}] - \Phi[\frac{\theta_{k-1}-\mu_i}{\sigma}]

where \pi_{ij} is the probability of case i of ordinal category j \in (1,2,3), \Phi is the standard normal CDF with mean \mu_i and standard deviation \sigma, and \theta_k is the threshold. Here, there are two thresholds. I use the parameterisation where the first and last thresholds are fixed (i.e. both are fixed in this example to 1.5 and 2.5, respectively), allowing for \mu_i and \sigma to be estimated.

In the case of missing y_i data, I want to sum over range the possibilities that y_i could take. So I (think) I want \sum_i \sum_j P(y_i = j | \pi_{ij}) = P(y_i=j) \cdot P(j | \pi_{ij}), i.e. the probability that the missing dependent variable is category j given the estimated probability of category j, for all j.

For a Bernoulli variable z_n \in (0,1) with probability \phi, we can marginalise over the missing data by computing:

\phi \cdot P(z_n = 1 | \phi) + (1-\phi) \cdot P(z_n = 0 | \phi)

Stan’s log_mix function can be used here as (see also Log_mix for missing categorical data):

  target += log_mix(phi, 
                    bernoulli_lpmf(1 | phi),
                    bernoulli_lpmf(0 | phi)
                   )

which evaluates to:

 target += log_sum_exp(
              log(phi) + bernoulli_lpmf(1 | phi),
              log(1-phi) + bernoulli_lpmf(0 | phi)
             )

By extension, for missing 3-category ordinal data cases, I have been using:

target += log_sum_exp(
       log(phi[1]) + categorical_lpmf(1 | phi), 
       log(phi[2]) + categorical_lpmf(2 | phi), 
       log(phi[3]) + categorical_lpmf(3 | phi)
     )

However, this is what is causing the inaccurate parameter estimation.

Topic		Replies	Views
Missing data in categorical data models Modeling rstan	7	1302	August 12, 2023
Imputing missing responses in multivariate ordinal data Modeling ordinal-response , brms , missing-data	15	2356	May 1, 2023
Model Poll data (Categorical Likelihood with Dirichlet Prior) Modeling rstan	1	325	April 21, 2023
Discrete Marginalization of Missing Categorical Variable Modeling techniques	1	571	April 5, 2022
Marginalize missing binary outcome variable for GLM Modeling	11	1079	January 31, 2020

Marginalising out missing categorical response variable cases provides inaccurate predictor estimates

Related topics