Missing data problem: Missing probabilities in categorical distributions

Hi. How can I address the following missing data problem? I have a regression model (a meta-regression) that looks like this:

Y \sim N(X \beta, \Sigma)

Y and \Sigma are known. The estimation target is \beta.

Each row of X is a categorical distribution, and the categories correspond to numbers of times that an event occurred (i.e., each row contains normalized event frequencies, so the categories have labels 0, 1, 2, …, k, where k is less than 10). Hence it makes sense to talk about the categorical distributions having medians, which is of relevance in the following.

Here are the possible scenarios:

  1. Data for a given row are complete. This case is trivial.
  2. All data for a given row are missing and we do not know anything further. In this case I would put a “flat” Dirichlet prior on the distribution.
  3. Data for a given row are completely missing, but we do know the median of the distribution. In this case I can construct a Dirichlet that gives categorical distributions that, with high probability, have the known median but are relatively “flat” otherwise. Do you have a better solution though?
  4. Data for a given row are incomplete (i.e., partially known) but we don’t even know the median. I’m thinking of using a “flat” Dirichlet for the unknown elements and then scaling the sampled values to ensure that the row is a valid distribution (i.e., sums to unity). Do you have a better solution?
  5. Data for a given row are incomplete but we do know the median. I don’t have any idea how to address this issue and, because it’s quite rare in my data, I think it would be a reasonable approximation to treat it as if it’s scenario 4.

Modeling and coding suggestions are very welcome. In particular, given the “raggedness” that arises from the missing data pattern (i.e., it may be necessary to use a bunch of Dirichlets of differing dimension), coding suggestions that address this are also appreciated.

Thanks in advance.


1 Like

Each row of X is a categorical distribution … each row contains normalized event frequencies,

I take it this means each row is a simplex, or in other words, X is a stochastic matrix.

A \textrm{Dirichlet}([1 \ 1 \cdots 1]) is uniform over simplexes. Is that what you mean by “flat”?

I’m not sure what you mean in (3) by “median of the distribution”. Medians are intrinsically 1D, so I’m not sure what that means for a multivariate quantity. Is it something like a marginal median of one component? Over what distribution?

What do you mean by flat Dirichlet for the unknown elements? The Dirichlet is a distribution over simplexes. If you know some components of a simplex, you can let the unknown ones be proportional to a simplex, scaled by the probability mass left after subtracting the sum of the known values from 1.

The over-arching point is that you want the same model for observed and missing data, then you can infer the missing data along with other parameters.

Thanks for taking the time to look at this.

Yes, each row of X is a simplex. Yes, when I said “flat”, I meant \mathrm{Dirichlet}([1\;1 \cdots 1]).

Regarding (3) “median of the distribution”: This is a meta-analysis using trial-level (aggregate) data. The distributions — rows of X — are over number of events (measured at baseline, prior to a treatment). So, whoever did the analysis of trial i (i.e., the source of Y_i, X_{i,\cdot}, and \Sigma_{i,i}) had access to patient-level data and would have been able to compute the (sample) distribution over number of events, the sample median number of events, etc. This is what I meant in my original post when I wrote “it makes sense to talk about the categorical distributions having medians”.

Number of events is believed to be an effect modifier. I.e., Y is believed to be different for patients with zero, one, …, k events, but we don’t have much prior information about how (I’ll come back this this point, below). We only have trial-level data. Some trials provide the full distribution (e.g., we know that 20% of patients in the trial had zero events, 20% had one event, and 60% had two events). Some trials provide an incomplete distribution (e.g., they might say that 20% had zero events and 20% had one event, but for unknown reasons do not provide information about the remaining 60% of patients). Some trials provide the median but no further information about the distribution. Some trials provide no information about the distribution.

Coming back to the relationship between number of events and outcome: We have input from a domain expert that it is “obvious” that Y decreases with increasing number of events. However, I think this is an informed assumption, which may be perfectly reasonable, but does not seem to be based on evidence. It looks like it might be possible, in principle, to use the information available to estimate this relationship, but doing so requires addressing the issues in my question. If it is not possible to address these issues — e.g., if X lacks enough known values to fit the model — then my fallback would be to try to adopt the assumption that Y is linear in median number of events. However, this would then require imputing the unknown medians.

Regarding your “over-arching point”: Yes, agreed, this is what I’m trying to do. I’ve implemented models in Stan that use the principle you mentioned. I’m mainly posting to elicit ideas and issues I may have overlooked. Thanks!