Imagine a fairly standard multinomial outcome, such as the contraceptive method used by women in a survey dataset. For simplicity, let’s say that there are four different methods, dubbed *a*, *b*, *c*, and *d*.

Now suppose that some of the researchers wrote semi-illegible entries on their survey forms. As a result, it is subsequently not possible to discern the exact letter written on the form. Sometimes it is not possible to distinguish between *a* and *d*, for instance. But in that particular case, it’s definitely possible to say that the response is not *b* or *c*.

To model such data, what methods are used by members of the Stan community?

Yeah, you basically have to marginalize over all of the ways the survey form could say Y = y, which is Pr(y | a) + Pr(y | b) + Pr(y | c) + Pr(y | d). See for example,

https://scholar.google.com/scholar?cluster=1703778821448663685&hl=en&as_sdt=0,33

but ignore all the pre-Stan stuff about how to draw from such posterior distributions.

2 Likes

Thanks, Ben. I’ll take a look. Meanwhile, am I inferring that it’s possible in Stan to attach different probabilities to the outcomes, as in we might be 90% sure that it’s *a* with the remaining 10% probability attached to *d*?

That should be fine, but you have to construct the probability vector accordingly. It may be better to just set the hyperpriors so that there is only a very small probability of the answer appearing to be on some category given that the truth was another category.

2 Likes