The model in the example above is an ordinal probit model. That is, the dependent variable is an ordered categorical variable with J=3 levels, in this example, which is assumed generated by a latent, normally distributed scale cut into K=J-1 thresholds:
y_i \sim Categorical(\pi_{ij})
\pi_{ij}= \Phi[\frac{\theta_k-\mu_i}{\sigma}] - \Phi[\frac{\theta_{k-1}-\mu_i}{\sigma}]
where \pi_{ij} is the probability of case i of ordinal category j \in (1,2,3), \Phi is the standard normal CDF with mean \mu_i and standard deviation \sigma, and \theta_k is the threshold. Here, there are two thresholds. I use the parameterisation where the first and last thresholds are fixed (i.e. both are fixed in this example to 1.5 and 2.5, respectively), allowing for \mu_i and \sigma to be estimated.
In the case of missing y_i data, I want to sum over range the possibilities that y_i could take. So I (think) I want \sum_i \sum_j P(y_i = j | \pi_{ij}) = P(y_i=j) \cdot P(j | \pi_{ij}), i.e. the probability that the missing dependent variable is category j given the estimated probability of category j, for all j.
For a Bernoulli variable z_n \in (0,1) with probability \phi, we can marginalise over the missing data by computing:
\phi \cdot P(z_n = 1 | \phi) + (1-\phi) \cdot P(z_n = 0 | \phi)
Stan’s log_mix
function can be used here as (see also Log_mix for missing categorical data):
target += log_mix(phi,
bernoulli_lpmf(1 | phi),
bernoulli_lpmf(0 | phi)
)
which evaluates to:
target += log_sum_exp(
log(phi) + bernoulli_lpmf(1 | phi),
log(1-phi) + bernoulli_lpmf(0 | phi)
)
By extension, for missing 3-category ordinal data cases, I have been using:
target += log_sum_exp(
log(phi[1]) + categorical_lpmf(1 | phi),
log(phi[2]) + categorical_lpmf(2 | phi),
log(phi[3]) + categorical_lpmf(3 | phi)
)
However, this is what is causing the inaccurate parameter estimation.