Hi. How can I address the following missing data problem? I have a regression model (a meta-regression) that looks like this:
Y and \Sigma are known. The estimation target is \beta.
Each row of X is a categorical distribution, and the categories correspond to numbers of times that an event occurred (i.e., each row contains normalized event frequencies, so the categories have labels 0, 1, 2, …, k, where k is less than 10). Hence it makes sense to talk about the categorical distributions having medians, which is of relevance in the following.
Here are the possible scenarios:
- Data for a given row are complete. This case is trivial.
- All data for a given row are missing and we do not know anything further. In this case I would put a “flat” Dirichlet prior on the distribution.
- Data for a given row are completely missing, but we do know the median of the distribution. In this case I can construct a Dirichlet that gives categorical distributions that, with high probability, have the known median but are relatively “flat” otherwise. Do you have a better solution though?
- Data for a given row are incomplete (i.e., partially known) but we don’t even know the median. I’m thinking of using a “flat” Dirichlet for the unknown elements and then scaling the sampled values to ensure that the row is a valid distribution (i.e., sums to unity). Do you have a better solution?
- Data for a given row are incomplete but we do know the median. I don’t have any idea how to address this issue and, because it’s quite rare in my data, I think it would be a reasonable approximation to treat it as if it’s scenario 4.
Modeling and coding suggestions are very welcome. In particular, given the “raggedness” that arises from the missing data pattern (i.e., it may be necessary to use a bunch of Dirichlets of differing dimension), coding suggestions that address this are also appreciated.
Thanks in advance.
Chris