Log_mix for missing categorical data


In the past I have had success modelling discrete missing outcome variables by marginalizing over the missing values with log_mix. However, I don’t know how to extend this approach to more than two categories. I had been dealing with missing binary data with the following:

target += log_mix( theta[i] ,
binomial_logit_lpmf( 1 | 1, p[i] ),
binomial_logit_lpmf(  0 | 1, p[i] )

Where theta is the probability that y = 1, and p is a linear model of theta on the logit scale. How could I extend this strategy to more than two levels, say if I wanted to marginalize over three possible categorical values of y?

Imputation of a 3 category covariate to model a binary outcome
Missing data of main effects in model with interaction terms

The log_mix function is basically just a wrapper for log_sum_exp in the two component case. There was a PR merged into develop that generalized this to the case of more than two components. See

but it is essentially just

target += log_sum_exp(log(theta) + log(PMFs));


Thanks! For those who might be trying to do this with a categorical logit model with a K-1 parameterization, this was my solution:

target += log_sum_exp( log_softmax(p) + categorical_logit_lpmf( 1 | p ) + categorical_logit_lpmf( 2 | p ) + categorical_logit_lpmf( 3 | p ))

Where p is a vector of probabilties on the logit scale with the last category set to a constant (0). The log_softmax function makes sure that all the probabilities sum to 0 and takes the natural logarithm.


hi Erik:

Would you be so kind to upload the complete code please?


Attached is an r script that demonstrates how to do this with a missing categorical predictor.
multi_missing.R (2.9 KB)

Imputation of a 3 category covariate to model a binary outcome