Marginalising out missing categorical response variable cases provides inaccurate predictor estimates

The model in the example above is an ordinal probit model. That is, the dependent variable is an ordered categorical variable with J=3 levels, in this example, which is assumed generated by a latent, normally distributed scale cut into K=J-1 thresholds:

y_i \sim Categorical(\pi_{ij})

\pi_{ij}= \Phi[\frac{\theta_k-\mu_i}{\sigma}] - \Phi[\frac{\theta_{k-1}-\mu_i}{\sigma}]

where \pi_{ij} is the probability of case i of ordinal category j \in (1,2,3), \Phi is the standard normal CDF with mean \mu_i and standard deviation \sigma, and \theta_k is the threshold. Here, there are two thresholds. I use the parameterisation where the first and last thresholds are fixed (i.e. both are fixed in this example to 1.5 and 2.5, respectively), allowing for \mu_i and \sigma to be estimated.

In the case of missing y_i data, I want to sum over range the possibilities that y_i could take. So I (think) I want \sum_i \sum_j P(y_i = j | \pi_{ij}) = P(y_i=j) \cdot P(j | \pi_{ij}), i.e. the probability that the missing dependent variable is category j given the estimated probability of category j, for all j.

For a Bernoulli variable z_n \in (0,1) with probability \phi, we can marginalise over the missing data by computing:

\phi \cdot P(z_n = 1 | \phi) + (1-\phi) \cdot P(z_n = 0 | \phi)

Stan’s log_mix function can be used here as (see also Log_mix for missing categorical data):

  target += log_mix(phi, 
                    bernoulli_lpmf(1 | phi),
                    bernoulli_lpmf(0 | phi)
                   )

which evaluates to:

 target += log_sum_exp(
              log(phi) + bernoulli_lpmf(1 | phi),
              log(1-phi) + bernoulli_lpmf(0 | phi)
             )

By extension, for missing 3-category ordinal data cases, I have been using:

target += log_sum_exp(
       log(phi[1]) + categorical_lpmf(1 | phi), 
       log(phi[2]) + categorical_lpmf(2 | phi), 
       log(phi[3]) + categorical_lpmf(3 | phi)
     )

However, this is what is causing the inaccurate parameter estimation.