Log_mix for missing categorical data

In the past I have had success modelling discrete missing outcome variables by marginalizing over the missing values with log_mix. However, I don’t know how to extend this approach to more than two categories. I had been dealing with missing binary data with the following:

target += log_mix( theta[i] ,
binomial_logit_lpmf( 1 | 1, p[i] ),
binomial_logit_lpmf(  0 | 1, p[i] )

Where theta is the probability that y = 1, and p is a linear model of theta on the logit scale. How could I extend this strategy to more than two levels, say if I wanted to marginalize over three possible categorical values of y?

The log_mix function is basically just a wrapper for log_sum_exp in the two component case. There was a PR merged into develop that generalized this to the case of more than two components. See
https://github.com/stan-dev/math/pull/751
but it is essentially just

target += log_sum_exp(log(theta) + log(PMFs));
1 Like

hi Erik:

Would you be so kind to upload the complete code please?

Hi @Erik_Ringen,

I realise this is an old topic. However, I’m curious about this approach to marginalising out the categorical response variable.

In:

target += log_sum_exp( log_softmax(p) + 
                              categorical_logit_lpmf( 1 | p ) + 
                              categorical_logit_lpmf( 2 | p ) + 
                              categorical_logit_lpmf( 3 | p ))

doesn’t log_softmax( p )[1] == categorical_logit_lpmf( 1 | p ), log_softmax( p )[2] == categorical_logit_lpmf( 2 | p ) etc., so we are effectively summing the same things twice? Is this a correct interpretation?

I would have thought we need something like e.g. P(y=1)P(y=1|p), P(y=2)P(y=2|p) etc., instead, where k is some ordinal category and P(y=k) != P(y=k|p).

Thanks for any help you have!

Check out the solution in this thread: Imputation of a 3 category covariate to model a binary outcome

Thanks @Erik_Ringen, I appreciate the time getting back to me. However, I am still struggling to see how that post applies to a missing categorical response variable.

In the post you linked to, the probability of the response variable given the different levels of the covariate was marginalised over, i.e. P(X=x) \cdot P(y|X=x). I understand this.

When there is a missing categorical response variable Y \in (1,2,...,y), we want P(Y=y) \cdot P(Y=y|x) where x is a functioning determining y, e.g. a linear equation estimated from the observed data. However, P(Y=y) is the same thing as P(Y=y|x), as far as I can tell.

I can just start a new thread for this topic if it’s easier. I am probably just missing something!