# Log_mix for missing categorical data

In the past I have had success modelling discrete missing outcome variables by marginalizing over the missing values with log_mix. However, I don’t know how to extend this approach to more than two categories. I had been dealing with missing binary data with the following:

target += log_mix( theta[i] ,
binomial_logit_lpmf( 1 | 1, p[i] ),
binomial_logit_lpmf(  0 | 1, p[i] )


Where theta is the probability that y = 1, and p is a linear model of theta on the logit scale. How could I extend this strategy to more than two levels, say if I wanted to marginalize over three possible categorical values of y?

The log_mix function is basically just a wrapper for log_sum_exp in the two component case. There was a PR merged into develop that generalized this to the case of more than two components. See
https://github.com/stan-dev/math/pull/751
but it is essentially just

target += log_sum_exp(log(theta) + log(PMFs));

1 Like

hi Erik:

Hi @Erik_Ringen,

I realise this is an old topic. However, I’m curious about this approach to marginalising out the categorical response variable.

In:

target += log_sum_exp( log_softmax(p) +
categorical_logit_lpmf( 1 | p ) +
categorical_logit_lpmf( 2 | p ) +
categorical_logit_lpmf( 3 | p ))


doesn’t log_softmax( p )[1] == categorical_logit_lpmf( 1 | p ), log_softmax( p )[2] == categorical_logit_lpmf( 2 | p ) etc., so we are effectively summing the same things twice? Is this a correct interpretation?

I would have thought we need something like e.g. P(y=1)P(y=1|p), P(y=2)P(y=2|p) etc., instead, where k is some ordinal category and P(y=k) != P(y=k|p).