Model Poll data (Categorical Likelihood with Dirichlet Prior)

I am trying to create a model on the following data (the data provided are just 15 rows out of the whole dataset):

X1 X2 X3 X4 X5 Y X6
3 0 2 2 1 2 1
2 0 3 3 1 3 1
1 0 2 NA 1 2 1
3 1 2 1 1 1 0
3 1 3 3 3 2 NA
1 0 2 1 3 3 NA
3 1 1 3 3 1 1
1 0 1 3 2 1 0
3 1 NA 3 3 2 0
1 0 NA 1 1 2 1
2 1 2 3 1 2 NA
3 1 1 3 2 2 0
2 0 2 1 3 3 0
3 0 NA 3 2 2 0
3 0 3 3 3 3 1

Where I am trying to predict Y based on the rest of the variables (X1,…,X6). I use categorical distribution for the likelihood with dirichlet prior, when I omit the NA values. Though as both Y and Xs have missing values, I decided to create a second model and include them as well, so I use categorical distribution for X1, X3, X4, X5 and Bernoulli for X2 and X6 in order to predict their missing values as well. Though as I am a new joiner to Stan, I can not write the model properly and I can find any similar situation in order to copy, paste and modify the code. If anyone could propose a model to start with or a similar example, I would be glad.

Hi, @LittleAstro, and sorry it’s taken us so long to respond.

I take it the NA is the R notation for missing?

It looks like Y in {1, 2, 3}. In this situation, you can just build a multi-logit regression with missing data. See the User’s Guide section: 1.6 Multi-logit regression | Stan User’s Guide.

In Stan, with discrete missing data, you have to marginalize it out. See the User’s Guide section:

The question is whether you wan to treat the X1 to X6 as regular predictors or whether you want to use random effects (the latter is both more standard and more general). Then, you will need to marginalize out the missing data. See the User’s Guide section: 7 Latent Discrete Parameters | Stan User’s Guide

The missing data is the tricky part for this given that it looks like it can come in as many as 2^6 - 1 = 63 forms. I’m not 100% sure on the easiest way to code this up. If at most one of the values is missing, then you can just write the cases out explicitly.