Model Poll data (Categorical Likelihood with Dirichlet Prior)

LittleAstro · April 8, 2023, 12:48pm

I am trying to create a model on the following data (the data provided are just 15 rows out of the whole dataset):

X1 X2 X3 X4 X5 Y X6
3 0 2 2 1 2 1
2 0 3 3 1 3 1
1 0 2 NA 1 2 1
3 1 2 1 1 1 0
3 1 3 3 3 2 NA
1 0 2 1 3 3 NA
3 1 1 3 3 1 1
1 0 1 3 2 1 0
3 1 NA 3 3 2 0
1 0 NA 1 1 2 1
2 1 2 3 1 2 NA
3 1 1 3 2 2 0
2 0 2 1 3 3 0
3 0 NA 3 2 2 0
3 0 3 3 3 3 1

Where I am trying to predict Y based on the rest of the variables (X1,…,X6). I use categorical distribution for the likelihood with dirichlet prior, when I omit the NA values. Though as both Y and Xs have missing values, I decided to create a second model and include them as well, so I use categorical distribution for X1, X3, X4, X5 and Bernoulli for X2 and X6 in order to predict their missing values as well. Though as I am a new joiner to Stan, I can not write the model properly and I can find any similar situation in order to copy, paste and modify the code. If anyone could propose a model to start with or a similar example, I would be glad.

Bob_Carpenter · April 21, 2023, 6:37pm

Hi, @LittleAstro, and sorry it’s taken us so long to respond.

I take it the NA is the R notation for missing?

It looks like Y in {1, 2, 3}. In this situation, you can just build a multi-logit regression with missing data. See the User’s Guide section: 1.6 Multi-logit regression | Stan User’s Guide.

In Stan, with discrete missing data, you have to marginalize it out. See the User’s Guide section:

The question is whether you wan to treat the X1 to X6 as regular predictors or whether you want to use random effects (the latter is both more standard and more general). Then, you will need to marginalize out the missing data. See the User’s Guide section: 7 Latent Discrete Parameters | Stan User’s Guide

The missing data is the tricky part for this given that it looks like it can come in as many as 2^6 - 1 = 63 forms. I’m not 100% sure on the easiest way to code this up. If at most one of the values is missing, then you can just write the cases out explicitly.

Topic		Replies	Views
Treating missing data "NA" in functioning model Modeling	1	689	September 9, 2018
Missing data problem: Missing probabilities in categorical distributions General specification , meta-analysis , missing-data	2	904	December 4, 2021
Missing data in categorical data models Modeling rstan	7	1323	August 12, 2023
Impute partially missing discrete outcome Modeling specification	1	393	May 22, 2023
Guidelines for Practical Imputation with Stan? Modeling cmdstan , rstan , techniques , specification , missing-data	4	1511	September 6, 2023

Model Poll data (Categorical Likelihood with Dirichlet Prior)

Related topics