Classification/diagnosis via multivariate normal mixtures for n = 1 cases

I’m currently thinking about the next iteration of this model and some tweaks that I can make to generalize the use-case in clinical settings. One of the issues that I immediately noted and wanted to correct is that the model only works optimally when all testing data can be entered at once; however, this is not really a reasonable expectation because clinicians and researchers often utilize different batteries of tests.

What is more reasonable as a use-case, however, would be chaining chunks of the battery together. For example, a clinician might find reference samples that all use the same 3-4 tests and then some other reference samples that use a few different tests, etc. In this way, the data for the testing battery that the clinician used might be found across multiple different studies, requiring multiple runs of the model. Updating post-test probabilities is not hard since the post-test probability from one set of tests across reference samples is just a new pre-test probability for the next set of tests in the next set of reference samples; however, as currently written, the model would lose information about the uncertainty of the post-test probability since the input data is just point estimates of pre-test probability.

I’ve seen several post on the forum raising this question of passing posterior distributions to priors, notably this one: Composing Stan models (posterior as next prior). The problem generally is some information loss, and to be perfectly fair, I’m OK with some information loss since at least some information about uncertainty is better than none.

My first thought would be to add the variance of the pre-test probability to the data and write a transformed data block to convert those means and variances into beta distributions that can be used as priors. My concern, however, is that there’s not really anything I can do to keep the multivariate normal likelihood that the post-test probabilities are derived alongside. As currently written, the model assumes that the intercorrelation of all the tests is known, but if intercorrelations between only certain tests are known, then there’s not a way to tell the model how performance on one chunk of tests should be updated based on performance from a prior chunk of tests.

I’m left weighing two different options. The first is to use a beta prior to approximate what the posterior post-test probability estimate is and just continually update post-test beliefs using that approximation each time, letting go of any desire to try to infer the full intercorrelation of tests across batteries and accepting loss of information. This is fine to me because the generated post-test probability is really the desire goal here and is already taking the multivariate normal likelihood of the test data into account.

The second option is to try to coerce the model into running all the data at once so that no posterior as prior approximations are needed. I would default to this option, specifying a prior over the full intercorrelation matrix and treating some of that information as missing, if this weren’t being written for the case of n = 1. The reason this would be a desirable option would be (a) it’s just easier and less error-prone to have one model run versus several and (b) it would allow for full posterior inference to explore the relationships between the tests in the battery rather than focusing just on post-test probabilities as the endpoint. I’m curious whether there are other options that I might be neglecting or some other ways of thinking about this approach that could open up some alternatives?

1 Like