Classification/diagnosis via multivariate normal mixtures for n = 1 cases

I’m working on a model for examining the probability that an observed score on a cognitive assessment comes from a particular normative or reference sample (i.e., whether a score is likely an observation from the population of scores expected to be obtained from those with X condition of interest). In application, the model would be used to examine whether or not a score is more typical of a population of scores for condition X vs. Y vs. Z vs. etc. Toward this end, I want to generate a Bayes’ factor and post-test probability for group memberships as these are relatively easy and straightforward for clinicians to understand. The model I’ve tested so far is below:

data {
real y; // obtained score
real<lower=0> mn; // prior sample average
real<lower=0> stdv; // prior sample standard deviation
real<lower=1> N; // sample size
real<lower=0, upper=1> rxx; // prior reliability
real<lower=1> n; // reliability sample size
}

transformed data{
real<lower=0> alpha; // define alpha
real<lower=0> beta; // define beta
alpha = n/2; // compute alpha
beta = n/fabs(2*(1-rxx)); // compute beta
}

parameters {
real<lower=0> mu; // population average
real<lower=0> sigma; // population standard deviation
real<lower=0> tau; // reliability
real theta; // true score
}

transformed parameters{
real<lower=0, upper=1> r; // define reliability estimate
real<lower=0> SEM; // define standard error of measurement
r = (1-tau)/-tau; // convert tau back to reliability
SEM = sigma*sqrt(1-r); // compute standard error of measurement
}

model {
mu ~ normal(mn, stdv); // estimation of population average
tau ~ inv_gamma(alpha, beta); // estimation of reliability
sigma ~ student_t(N-1, sqrt(((stdv^2)*N)/(N-1)), 0.71*(stdv/sqrt(N))); // estimation of population standard deviation
theta ~ normal(y, SEM); // estimation of true score
target += normal_lpdf(theta | mu, sigma); // likelihood
}

As the code probably betrays, I’m coming from a purely frequentist educational background. I had a pretty good idea about how I would tackle this problem with bootstrapping, so I just adapted the code to work with Stan based on the hierarchical normal example. The model generally works OK, but I have some issues with it.

First, does this model make sense for what I’m trying to do? I’ve looked for resources on single-case Bayesian methods, but they are generally about a single case measured over a period of time. I have just a single observation from a single individual to compare to known sample descriptive statistics. I haven’t been able to find examples of this kind of analysis in Stan.

Second, how can I get the priors to be just a little less informative? I’ve tried to some code adjustments based around sample size (thinking that larger samples should produce more informative priors), but I just end up with more divergent chain and missing information warnings.

Third, how can I reparameterize things to stay on unit scaling in the model block? While this current code does work most of the time, I’ve run into some issues of divergent chains (note: changing adapt_delta hasn’t fixed this issue), and I think that has to do with the scales of the parameters.

Finally, are there any suggestions for generalizing the model for a single case vs. reference across a set of assessments? In other words, if one were administered 6 tests, then how could we look at whether the scores obtained from these 6 tests are likely to have come from the same population of scores as those obtained from a reference sample (assuming the reference sample provides sufficient data to compute the covariance matrix)? I guess it’s essentially a kind of pattern analysis comparing a single case measured on multiple variables to a sample.

Thank you in advance for any and all help/resources/advice!

Looks like you are in for some steep learning curve :-) I wish you best of luck. I am not an expert on this, but since no one else replied, I will give it a try.

If I understand you correctly, your current approach is to run one fit of the model for each reference class you have and then somehow compare the fits? If this is the case, I would actually advise to move this to a single model and use the probability the measurement is from each of the classes as an explicit parameter in the model - I can elaborate a bit more on this if it looks like something you are interested in. You can also look up “mixture models” here or in the manual to get some idea on what this is about. This approach would IMHO also neatly scale to the multiple measurement case (you have different mixture components for each assesment, but the mixture probabilities are shared across assesments).

Some notes about the model (may and may not be related to your issues):

  • theta is simultaneously distributed according to two different distributions - this is probably not correct. I am not sure I understand the reasoning behind this so I am not sure how to make this right.
  • It is always a good idea to have simulator that creates fake data in exactly the same way the model assumes, then you can test whether you can recover the simulated unobserved parameters. If you cannot write a simulator, it is a strong hint there is some fishy in the model (as the two likelihood terms for theta in the current model).
  • you can diagnose divergences using visualisation such as mcmc_parcoord and pairs. The shinystan package creates a bunch of those easily for you. Some starting points might be (shameless plug, written partly by me):https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html, https://www.martinmodrak.cz/2018/02/19/taming-divergences-in-stan-models/
  • Since mu is constrained to be positive (and I presume this means the test scores are constrained to be positive), are you sure normal is a good distribution? Maybe a log-normal or gamma would model the measurement error better.
  • For location-scale families it often makes sense to parametrize as
parameters {
 real mu_raw;
} 

transformed parameters {
 real mu = mu_raw * stdv + mn;
}

model {
 mu_raw ~ normal(0,1); 
}

this keeps the parameters Stan works with on the unit scale which tends to be a good thing.

Hope at least some of this helps.

1 Like

After quite a while, I’ve returned to this particular project. I appreciate your recommendation for scaling and mixture models @martinmodrak, and I had spent some time trying to make those work before I had to change to some other projects. The most recent version of the model is shown below:

data {
  int<lower=1> pop;                                   // number of populations
  int<lower=0> dim;                                   // number of dimensions
  vector[dim] raw_x;                                  // obtained scores
  row_vector[dim] ref_mn[pop];                        // sample means
  row_vector<lower=0>[dim] ref_stdv[pop];             // sample SDs
  matrix[dim, dim] cor_mat[pop];                      // intercorrelation matrix
  real<lower=1> ref_N[pop];                           // input sample size
  real<lower=0> eta[pop];                             // LKJ prior eta
}

transformed data{
  real<lower=1> nu[pop];                              // degrees of freedom
  cholesky_factor_corr[dim] L[pop];                   // correlation matrix
  
  for(i in 1:pop)
    nu[i] = ref_N[i] - 1;                             // compute degrees of freedom
  
  for(i in 1:pop)
    L[i] = cholesky_decompose(cor_mat[i]);            // define the correlation matrix from input
}

parameters {
  row_vector[dim] mu[pop];                            // population mean
  row_vector<lower=0>[dim] sigma2[pop];               // population variance
  simplex[pop] theta;                                 // model mixing value
}

transformed parameters {
  row_vector[dim] sigma[pop];                         // dimulation standard deviation
  row_vector[dim] SEmn[pop];                          // standard error of mean
  row_vector[dim] means[pop];                         // compute scaled means
  row_vector[dim] Zscore[pop];                        // scaled score
  
  for(p in 1:pop)
    for(j in 1:dim)
      sigma[p, j] = sqrt(((ref_N[p]-1)*(ref_stdv[p, j]^2))/sigma2[p, j]);     // compute standard deviation
  
  for(p in 1:pop)
    for(j in 1:dim)
      SEmn[p, j] = sigma[p, j]/sqrt(ref_N[p]);        // compute standard error of the mean
  
  for(p in 1:pop)  
    for(j in 1:dim)
      means[p, j] = (mu[p, j]*SEmn[p, j]) + ref_mn[p, j];     // rescale means vector
  
  for(p in 1:pop)
    for(j in 1:dim)
      Zscore[p, j] = (raw_x[j] - means[p, j]) / sigma[p, j];    // compute scaled scores
}

model {
  vector[pop] log_theta = log(theta);                 // convert to a log scale
  
  for(p in 1:pop)
    target += chi_square_lpdf(sigma2[p, ] | nu[p]);   // estimated population SD
  for(p in 1:pop)
    target += normal_lpdf(mu[p, ] | 0, 1);            // estimated population means
  for(p in 1:pop)
    target += lkj_corr_cholesky_lpdf(L[p] | eta[p]);  // correlation matrix
  
  for(p in 1:pop)
    log_theta[p] += multi_normal_cholesky_lpdf(Zscore[p, ] | to_vector(rep_array(0.0, dim)), L[p]);   // likelihood
  
  target += log_sum_exp(log_theta);
}

generated quantities {
  vector[dim] rScore[pop];
  
  for(o in 1:pop)
    rScore[o] = multi_normal_rng(means[o, ], quad_form_diag(multiply_lower_tri_self_transpose(L[o]), sigma[o, ]));    // simulate raw scores
}

The model now works very reliably and efficiently, taking almost no time to run. My simulation data suggests that the model is also doing what I expect it to do. I’m still working on single case data (i.e., we obtain scores on specific tests for just one individual and whom we treat as a random observation from a mixture of possible populations whose parameters we informatively guess from sample summary statistics). The “diagnosis” or classification of the individual into a particular population comes from whichever corresponds to the greatest theta parameter.

There are a couple questions I was hoping to get clarity on for extending the model. First, I’m not sure that I’m interpreting the mixing ratio correctly. I want the result to be essentially a post-test probability of group membership, but I don’t know that the theta parameter that gets estimated is really that.

I’d also like to extend the model to incorporate prior beliefs about group membership (i.e., a pre-test probability). I had tried to do this naively with something like the following:

data {
...
simplex[pop] lambda;     //pre-test probabilities
}

...

model {
...
for(p in 1:pop)
    log_theta[p] += log(lambda[p]) + multi_normal_cholesky_lpdf(Zscore[p, ] | to_vector(rep_array(0.0, dim)), L[p]);   // likelihood
}

This approach “worked,” but it caused the mixing ratios to just come out to essentially whatever the pre-test probabilities were. In retrospect, that makes total sense because I’m using just a single observation that won’t overpower that kind of prior information. Is there perhaps an alternative way of specifying a pre-test probability in the model or at least a post-hoc way of adjusting the theta estimate to account for different pre-test probabilities?

1 Like

You are right, if I understand the model correctly, posterior theta is not doing that - it is the posterior distribution of a probability that a new population will come from each of the groups you have. Since you treat all of your subjects as from a single population, you correctly note that you are in some sense adding at most a “single observation” worth of data to inform theta beyond your prior.

I think that part of my advice from before was actually a bit misguided - as you are marginalizing out just a single discrete parameter (the group membership), and the hypothetical “population of populations” (which theta would describe) is not an entity you care about, you don’t even need theta as a parameter in your model. You can just marginalize out the probabilities (as described in 7 Latent Discrete Parameters | Stan User’s Guide and maybe even better intro is given in https://arxiv.org/abs/2010.09335)

The main point would be that theta is removed, you’ll initialize log_theta as

 vector[pop] log_theta = your_prior_for_membership;

and it will usually be useful to have log_theta be computed in transformed parameters, because then you can easily extract post-test probabilities of class membership in generated quantities as

post_probs = softmax(log_theta);

The math behind marginalization can look difficult, but is in the end actually pretty simple conceptually, so I encourage you take the time to make sure you understand it (I admit I’ve spent a lot of time actually not understanding it until late last year)

Best of luck with your model!

1 Like

Thank you for the advice! I’ll certainly read up on the marginalization problem, and my hope is that the pre-print you linked is more helpful. I’ve read through the Stan manual’s sections on latent discrete parameters and cluster analysis several times over the years, but I don’t quite understand it well-enough to pro-actively generalize it out to other application than those presented in the manual. I’ve found increasingly that some of the problems I’m trying to address with Stan are some form or require some form of marginalization

1 Like

I’m currently thinking about the next iteration of this model and some tweaks that I can make to generalize the use-case in clinical settings. One of the issues that I immediately noted and wanted to correct is that the model only works optimally when all testing data can be entered at once; however, this is not really a reasonable expectation because clinicians and researchers often utilize different batteries of tests.

What is more reasonable as a use-case, however, would be chaining chunks of the battery together. For example, a clinician might find reference samples that all use the same 3-4 tests and then some other reference samples that use a few different tests, etc. In this way, the data for the testing battery that the clinician used might be found across multiple different studies, requiring multiple runs of the model. Updating post-test probabilities is not hard since the post-test probability from one set of tests across reference samples is just a new pre-test probability for the next set of tests in the next set of reference samples; however, as currently written, the model would lose information about the uncertainty of the post-test probability since the input data is just point estimates of pre-test probability.

I’ve seen several post on the forum raising this question of passing posterior distributions to priors, notably this one: Composing Stan models (posterior as next prior). The problem generally is some information loss, and to be perfectly fair, I’m OK with some information loss since at least some information about uncertainty is better than none.

My first thought would be to add the variance of the pre-test probability to the data and write a transformed data block to convert those means and variances into beta distributions that can be used as priors. My concern, however, is that there’s not really anything I can do to keep the multivariate normal likelihood that the post-test probabilities are derived alongside. As currently written, the model assumes that the intercorrelation of all the tests is known, but if intercorrelations between only certain tests are known, then there’s not a way to tell the model how performance on one chunk of tests should be updated based on performance from a prior chunk of tests.

I’m left weighing two different options. The first is to use a beta prior to approximate what the posterior post-test probability estimate is and just continually update post-test beliefs using that approximation each time, letting go of any desire to try to infer the full intercorrelation of tests across batteries and accepting loss of information. This is fine to me because the generated post-test probability is really the desire goal here and is already taking the multivariate normal likelihood of the test data into account.

The second option is to try to coerce the model into running all the data at once so that no posterior as prior approximations are needed. I would default to this option, specifying a prior over the full intercorrelation matrix and treating some of that information as missing, if this weren’t being written for the case of n = 1. The reason this would be a desirable option would be (a) it’s just easier and less error-prone to have one model run versus several and (b) it would allow for full posterior inference to explore the relationships between the tests in the battery rather than focusing just on post-test probabilities as the endpoint. I’m curious whether there are other options that I might be neglecting or some other ways of thinking about this approach that could open up some alternatives?

1 Like

I think that the “one big model” approach is definitely better whenever it is practical. There are some approaches to summarising posterior and maintaining some information about correlations (e.g. fitting a multivariate normal distribution to suitably transformed posterior samples), but they are all approximate and it is kind of hard to verify that they do what you think they do.

Best of luck with the project!