# Marginalising out missing categorical response variable cases provides inaccurate predictor estimates

I am working on a number of hierarchical models with ordinal response variables. The response variables have missing data that I assume are related to both additional predictor variables (e.g. time of observation) as well as the value of the response variable at the time point in which it is missing, i.e. I have non-ignorable, not missing at random cases.

I have been trying to marginalise out the missing categorical response data, but I am finding I am getting inaccurate estimates for other model parameters.

I have attached an R script that simulates a simple case of (non-hierarchical) ordinal data from an ordinal probit model (with fixed threshold parameterisation similar to John Kruschke’s approach in Doing Bayesian Data Analysis, which includes an fmax term which I know is not the best code-of-practice in Stan but nevertheless seems to be fine in the applications I have used the model in). Subsequently, I induce missingness following a Bernoulli regression model dependent on a predictor variable.

In the attached Stan script, this is the model (excluding priors here) and generated quantities block I am using to marginalise out the missing data and impute plausible values from the posterior:

model{
vector[J] theta;          // vector to hold probabilities of ordinal categories
vector[N] eta;           // linear predictor
vector[N] pi;            // probability of missingness
vector[3] logP_y_eta;   // ln P(y = {1,2,3}, eta)
real logPy;             // ln P(y)

// priors go here

// the likelihood statement
for(n in 1:N){
// first, model the ordinal category probabilities
eta[n] = alpha + beta * x[n];

theta[1] = normal_cdf(thresh[1], eta[n], sigma);
for( l in 2:K ){
theta[l] = fmax(0,
normal_cdf(thresh[l], eta[n],  sigma) -
normal_cdf(thresh[l-1], eta[n],  sigma)
);
}

theta[J] = 1 - normal_cdf(thresh[K], eta[n],  sigma);

// then, compute the likelihood if y is observed
if(y[n] > 0){
y[n] ~ categorical(theta);
}

else{
// we want the log probabilities of each ordinal category given eta
// i.e. P(y | eta) = P(y, eta) / P(eta)

// here is log(P(y,eta)) for each category
logP_y_eta[1] = log(theta[1]) + categorical_lpmf(1 | theta); // P(y=1)P(y=1|eta)
logP_y_eta[2] = log(theta[2]) + categorical_lpmf(2 | theta); // P(y=2)P(y=2|eta)
logP_y_eta[3] = log(theta[3]) + categorical_lpmf(3 | theta); // P(y=3)P(y=3|eta)

// now we marginalise over the log probabilities to get P(y)
logPy = log_sum_exp(logP_y_eta);

// add it to target
target += logPy;
}

// finally, we want to add a missing data model
pi[n] = alpha_miss + beta_miss * x[n];
is_missing[n] ~ bernoulli_logit( pi[n] );
}
}

generated quantities{
matrix[N,J] y_ord_impute_probs;
vector[N] y_ord_impute;
vector[J] theta;
vector[N] eta;
vector[3] logP_y_eta;
real logPy;

for(n in 1:N){
// first, model the ordinal category probabilities
eta[n] = alpha + beta * x[n];

theta[1] = normal_cdf(thresh[1], eta[n], sigma);
for( l in 2:K ){
theta[l] = fmax(0,
normal_cdf(thresh[l], eta[n],  sigma) -
normal_cdf(thresh[l-1], eta[n],  sigma)
);
}

theta[J] = 1 - normal_cdf(thresh[K], eta[n],  sigma);

// then, compute the likelihood
// if y is observed
if(y[n] > 0){
y_ord_impute_probs[n, 1:3] = [0,0,0]; // set probs to 0
y_ord_impute_probs[n, y[n]] = 1;   // set probs to 1
y_ord_impute[n] = y[n];
}
else{
// we want the log probabilities of each ordinal category given eta
// i.e. P(y | eta) = P(y, eta) / P(eta)
logP_y_eta[1] = log(theta[1]) + categorical_lpmf(1 | theta);
logP_y_eta[2] = log(theta[2]) + categorical_lpmf(2 | theta);
logP_y_eta[3] = log(theta[3]) + categorical_lpmf(3 | theta);

// now we marginalise over the log probabilities to get P(y)
logPy = log_sum_exp(logP_y_eta);

// now we compute P(y,eta) / P(eta)
y_ord_impute_probs[n, 1] = exp( logP_y_eta[1] - logPy );
y_ord_impute_probs[n, 2] = exp( logP_y_eta[2] - logPy );
y_ord_impute_probs[n, 3] = exp( logP_y_eta[3] - logPy );
y_ord_impute[n] = categorical_rng(y_ord_impute_probs[n,1:3]');
}
}
}


At the moment, the \beta cofficient (the effect of the x predictor on the observed data) I am estimating in this model is not accurate, particularly when I make missingness dependent on the x predictor variable. I am unsure whether this is just because the amount of missingness that I am inducing is causing uncertainty in the responses and, therefore, leads to different slope estimates or whether there is a model mis-specification causing the issue.

For instance, here are the \alpha, \beta and \sigma coefficient estimates for a representative run between the missing (i.e. marginalising missingness) and non-missing (true, observed data) models.

Could anyone clarify if this approach is correct?

Furthermore, does anyone know if the imputed value could be used as a predictor in the model block in this case?

Here are the attachments:
Stan-marginalise-missing-categorical.stan (3.6 KB)
Marginalise-missing-categorical.R (3.8 KB)
Ordinal.stan (687 Bytes)

EDITS: clarity, forgot to upload files

Just giving this post a bump to see if anyone has suggestions.

Another bump for this topic.

Would it be possible for you to write your model and approach to marginalisation in mathematical language? That’d make it easier to see what is going on.

The model in the example above is an ordinal probit model. That is, the dependent variable is an ordered categorical variable with J=3 levels, in this example, which is assumed generated by a latent, normally distributed scale cut into K=J-1 thresholds:

y_i \sim Categorical(\pi_{ij})

\pi_{ij}= \Phi[\frac{\theta_k-\mu_i}{\sigma}] - \Phi[\frac{\theta_{k-1}-\mu_i}{\sigma}]

where \pi_{ij} is the probability of case i of ordinal category j \in (1,2,3), \Phi is the standard normal CDF with mean \mu_i and standard deviation \sigma, and \theta_k is the threshold. Here, there are two thresholds. I use the parameterisation where the first and last thresholds are fixed (i.e. both are fixed in this example to 1.5 and 2.5, respectively), allowing for \mu_i and \sigma to be estimated.

In the case of missing y_i data, I want to sum over range the possibilities that y_i could take. So I (think) I want \sum_i \sum_j P(y_i = j | \pi_{ij}) = P(y_i=j) \cdot P(j | \pi_{ij}), i.e. the probability that the missing dependent variable is category j given the estimated probability of category j, for all j.

For a Bernoulli variable z_n \in (0,1) with probability \phi, we can marginalise over the missing data by computing:

\phi \cdot P(z_n = 1 | \phi) + (1-\phi) \cdot P(z_n = 0 | \phi)

Stan’s log_mix function can be used here as (see also Log_mix for missing categorical data):

  target += log_mix(phi,
bernoulli_lpmf(1 | phi),
bernoulli_lpmf(0 | phi)
)


which evaluates to:

 target += log_sum_exp(
log(phi) + bernoulli_lpmf(1 | phi),
log(1-phi) + bernoulli_lpmf(0 | phi)
)


By extension, for missing 3-category ordinal data cases, I have been using:

target += log_sum_exp(
log(phi[1]) + categorical_lpmf(1 | phi),
log(phi[2]) + categorical_lpmf(2 | phi),
log(phi[3]) + categorical_lpmf(3 | phi)
)


However, this is what is causing the inaccurate parameter estimation.

It looks theta (in the original model; phi in your snippet) is acting as both the expected value of y and as your mixing parameter; in pretty much every other discrete missing data / mixture model I’ve seen, these are handled by two different sets of parameters.

Try adding some of the following:

data {
int n_missing; // total number of missing observations
int<lower = 1, upper = N> which_missing[n_missing]; // which of your observations are missing?  Easier to compute in R and pass as data than do in stan.
}
parameters {
simplex[3] mixing_probs[n_missing]; // This gets a dirichlet(1,1,1) prior by default.
}
model {
...
for(n in 1:N){
...
//
logP_y_eta[1] = categorical_lpmf(1 | theta);
logP_y_eta[2] = categorical_lpmf(2 | theta);
logP_y_eta[3] = categorical_lpmf(3 | theta);
target += log_sum_exp(logP_y_eta + log(mixing_probs[which_missing[n]]));
}
}

1 Like

Hi @Christopher-Peterson. Thanks!

I did wonder about this, and enquired about it at the end of this post: Log_mix for missing categorical data

The example of missing binary data seems to use the bernoulli probability estimated by the model as the mixing proportion, which obviously evaluates to P(z_n =1) P(1 | \phi) = \phi \cdot \phi.

I will have a go at including the mixing proportions and let you know.