A mixture model with skewed mixing distributions


#1

Hi,

I have developed the following mixture model and my application is to infer the class distribution from the thresholds created by some upstream binary classification models. The problem I am having is that the model seems to be working for the populations where the mixing distributions are around 20%-80%, but not as well when the mixing distributions are 2%-98% (i.e., that is skewed mixtures).

I am looking for some directions for modeling with STAN when the mixing distribution is skewed. Thanks.

Here is my model code:
data {
int<lower=0> J; // number of cases
real scores[J]; // score of each transaction
real mu_s[2];
real <lower=0> sigma_s[2];
vector<lower=0>[2] alpha;
}
parameters{
simplex[2] theta;
real conj_mu_s[2];
real conj_sigma_s[2];
}

model {  
  
  conj_sigma_s[1] ~ gamma(.5, sigma_s[1]);
  conj_sigma_s[2] ~ gamma(.5, sigma_s[2]);
  
  conj_mu_s[1] ~ normal(mu_s[1], conj_sigma_s[1]);
  conj_mu_s[2] ~ normal(mu_s[2], conj_sigma_s[2]);

  theta ~ dirichlet(alpha);

  for (n in 1:J){
      real gamma[2];
      for (k in 1:2){
        gamma[k] = log(theta[k]) + normal_lpdf(scores[n] | conj_mu_s[k], conj_sigma_s[k]);
        }
      //increment_log_prob(log_sum_exp(gamma));  // likelihood
      target += log_sum_exp(gamma);
    }
}

Here is a figure when I think it is not working (overestimating). Gray is the STAN output, Red is truth, blue is an alternative method.


#2

You want to do the fit with uncertainty and see if the intervals have the right coverage.

Is the data you simulated consistent with the priors? What are you passing in as data for alpha?
(I’m assuming you simulated given that you plotted something you caled the truth.)

You can check the ability of the model to fit by simulating from the prior and checking the coverage of posterior intervals.

I couldn’t tell how p50 and event_dt linked to the model you posted.

You can do all this a little more efficiently with a beta and a single parameter constrained to lie in (0, 1) than with the Dirichlet, but it won’t change the fit.

If you’re thinking about mixtures, there’s a lot of useful info in Michael’s case study:

http://mc-stan.org/documentation/case-studies/identifying_mixture_models.html


#3

@Bob_Carpenter , thank you for you response. It has been a long time, but I am picking this back up just now.

This is all real data based on an output of a classifier. The truth is coming from human labelers. And there can be some distribution shift in this domain also.

I might be asking a little too much from a mixture modeling framework here.

I ran this mode daily to infer class proportions, and I have been plotting p50 with some interval over time to see how sensitive it is to potential daily distributional shifts.

My eventual goal is to be able to say something about class proportions just by looking at classifier outputs. I hope I was able to answer your questions.

Let me see if I can figure out to fit with uncertainty.


#4

You might be interested in a measurement error model for your human labelers. I implement the Dawid and Skene model for the multinomial case in the manual chapter on latent discrete parameters.