I am working on a number of hierarchical models with ordinal response variables. The response variables have missing data that I assume are related to both additional predictor variables (e.g. time of observation) as well as the value of the response variable at the time point in which it is missing, i.e. I have non-ignorable, not missing at random cases.
I have been trying to marginalise out the missing categorical response data, but I am finding I am getting inaccurate estimates for other model parameters.
I have attached an R script that simulates a simple case of (non-hierarchical) ordinal data from an ordinal probit model (with fixed threshold parameterisation similar to John Kruschke’s approach in Doing Bayesian Data Analysis, which includes an fmax
term which I know is not the best code-of-practice in Stan but nevertheless seems to be fine in the applications I have used the model in). Subsequently, I induce missingness following a Bernoulli regression model dependent on a predictor variable.
In the attached Stan script, this is the model (excluding priors here) and generated quantities block I am using to marginalise out the missing data and impute plausible values from the posterior:
model{
vector[J] theta; // vector to hold probabilities of ordinal categories
vector[N] eta; // linear predictor
vector[N] pi; // probability of missingness
vector[3] logP_y_eta; // ln P(y = {1,2,3}, eta)
real logPy; // ln P(y)
// priors go here
// the likelihood statement
for(n in 1:N){
// first, model the ordinal category probabilities
eta[n] = alpha + beta * x[n];
theta[1] = normal_cdf(thresh[1], eta[n], sigma);
for( l in 2:K ){
theta[l] = fmax(0,
normal_cdf(thresh[l], eta[n], sigma) -
normal_cdf(thresh[l-1], eta[n], sigma)
);
}
theta[J] = 1 - normal_cdf(thresh[K], eta[n], sigma);
// then, compute the likelihood if y is observed
if(y[n] > 0){
y[n] ~ categorical(theta);
}
else{
// we want the log probabilities of each ordinal category given eta
// i.e. P(y | eta) = P(y, eta) / P(eta)
// here is log(P(y,eta)) for each category
logP_y_eta[1] = log(theta[1]) + categorical_lpmf(1 | theta); // P(y=1)P(y=1|eta)
logP_y_eta[2] = log(theta[2]) + categorical_lpmf(2 | theta); // P(y=2)P(y=2|eta)
logP_y_eta[3] = log(theta[3]) + categorical_lpmf(3 | theta); // P(y=3)P(y=3|eta)
// now we marginalise over the log probabilities to get P(y)
logPy = log_sum_exp(logP_y_eta);
// add it to target
target += logPy;
}
// finally, we want to add a missing data model
pi[n] = alpha_miss + beta_miss * x[n];
is_missing[n] ~ bernoulli_logit( pi[n] );
}
}
generated quantities{
matrix[N,J] y_ord_impute_probs;
vector[N] y_ord_impute;
vector[J] theta;
vector[N] eta;
vector[3] logP_y_eta;
real logPy;
for(n in 1:N){
// first, model the ordinal category probabilities
eta[n] = alpha + beta * x[n];
theta[1] = normal_cdf(thresh[1], eta[n], sigma);
for( l in 2:K ){
theta[l] = fmax(0,
normal_cdf(thresh[l], eta[n], sigma) -
normal_cdf(thresh[l-1], eta[n], sigma)
);
}
theta[J] = 1 - normal_cdf(thresh[K], eta[n], sigma);
// then, compute the likelihood
// if y is observed
if(y[n] > 0){
y_ord_impute_probs[n, 1:3] = [0,0,0]; // set probs to 0
y_ord_impute_probs[n, y[n]] = 1; // set probs to 1
y_ord_impute[n] = y[n];
}
else{
// we want the log probabilities of each ordinal category given eta
// i.e. P(y | eta) = P(y, eta) / P(eta)
logP_y_eta[1] = log(theta[1]) + categorical_lpmf(1 | theta);
logP_y_eta[2] = log(theta[2]) + categorical_lpmf(2 | theta);
logP_y_eta[3] = log(theta[3]) + categorical_lpmf(3 | theta);
// now we marginalise over the log probabilities to get P(y)
logPy = log_sum_exp(logP_y_eta);
// now we compute P(y,eta) / P(eta)
y_ord_impute_probs[n, 1] = exp( logP_y_eta[1] - logPy );
y_ord_impute_probs[n, 2] = exp( logP_y_eta[2] - logPy );
y_ord_impute_probs[n, 3] = exp( logP_y_eta[3] - logPy );
y_ord_impute[n] = categorical_rng(y_ord_impute_probs[n,1:3]');
}
}
}
At the moment, the \beta cofficient (the effect of the x predictor on the observed data) I am estimating in this model is not accurate, particularly when I make missingness dependent on the x predictor variable. I am unsure whether this is just because the amount of missingness that I am inducing is causing uncertainty in the responses and, therefore, leads to different slope estimates or whether there is a model mis-specification causing the issue.
For instance, here are the \alpha, \beta and \sigma coefficient estimates for a representative run between the missing (i.e. marginalising missingness) and non-missing (true, observed data) models.
Could anyone clarify if this approach is correct?
Furthermore, does anyone know if the imputed value could be used as a predictor in the model block in this case?
Here are the attachments:
Stan-marginalise-missing-categorical.stan (3.6 KB)
Marginalise-missing-categorical.R (3.8 KB)
Ordinal.stan (687 Bytes)
EDITS: clarity, forgot to upload files