Summing out discrete paramters in a mixture. How can I derive that it works for individual observations?

flosch · May 22, 2020, 3:28pm

Hi,

I have some problems with specifying a model and the models listed in the user guide weren’t quite applicable or at least I could not see that they were. Here is my problem:

I observe discrete timepoints of death of some test organisms. I assume that the observed survival times (or. time of observed death - ToD) follow a Gamma distribution.
Additionally I observe that the distribution has three modes, which I can explain by three predictors which produce the different effects.
Furthermore it is censored after the end of the experiment, although some individuals are still alive.

So far I have been working with a finite mixturemodel of Gamma distributions. Which I adapted mainly from the finite mixtures section in the user’s guide and the censoring section.

data {
  int<lower=1> K;                // number of mixture components
  int<lower=1> N;                // number of data points
  real y[N];                     // observations (death counts at day)
  vector[N] UL;
  matrix[N,K] weights;           // matrix of K predictor weights (N K-simplexes)
  
}
parameters {    
  vector<lower=0>[K] mu;         // locations of mixture components
  vector<lower=0>[K] sigma;      // scales of mixture components
}
transformed parameters {
  vector[K] shape;
  vector[K] rate;
  shape = square(mu) ./ square(sigma);
  rate = mu ./ square(sigma); 
}
model {
  matrix[N,K] log_weight;        // cache log-weights calculation

  mu ~ normal(0, 10);
  sigma ~ normal(0, 10);
  
  log_weight = log(weights);
  for (n in 1:N) {  
    // weighing                                    
    row_vector[K] lps = log_weight[n];                       
    for (k in 1:K) 
      // censoring
      if(y[n] < UL[n]){
        lps[k] += gamma_lpdf(y[n] | shape[k], rate[k]);
      } else {
        lps[k] += gamma_lccdf(UL[n]  | shape[k], rate[k]); 
      }

    target += log_sum_exp(lps);
  }
}

This has been working satisfactory until I realized that that each individual time of death is better modeled if it would be caused by one single predictor and not a weighed mixture.

After some more reading in the forum I realized that this is a general issue that discrete parameters can’t (yet) be sampled in Stan and are instead better marginalized out, which is more efficient. So far so good. What I don’t understand at this point is the following:

5.2 Latent Discrete Parameterization
…
This model is not directly supported by Stan because it involves discrete parameters zn, but Stan can sample μ and σ by summing out the z parameter as described in the next section.
.
5.3 Summing out the Responsibility Parameter
To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If Y is a mixture of K normal distributions with locations μk and scales σk with mixing proportions λ in the unit K-simplex, then
p_Y(y~ |~ \lambda, \mu, \sigma) = \sum_{k=1}^K \lambda_k~ normal(y~|~\mu_k, \sigma_k)

The way this is coded in stan appears to model each observation y_n as the sum of all K weighed probabilities. Is this an approximation, which works well for mixtures? I can see that all Y observations can be represented as the quoted formula but I seem to be missing how this also applies to any single observation. Could someone explain this to me? I’m sure I missed something.

At the moment I am trying to implement this model with latent discrete paramters like in the change point model (7.2 - Stan User’s Guide). Is this generally recommendable for my problem or is there a better alternative?

Thank you, Florian

flosch · May 25, 2020, 5:55am

Meanwhile, I have tried to recover the posterior probability of the latent responsibility paramter as described in the corresponding section in the user guide.

generated quantities {
  for ( n in 1:N){
    for (k in 1:K) {
      z[n,k] = gamma_rcensed(y[n], shape[k], rate[k], UL[n]) + 
               categorical_lpmf(k | to_vector(weights[n]));
    }
    // normalize
    z[n] = z[n] - log_sum_exp(z[n]);
  }
}

where gamma_rcensed is just gamma_lpdf or gamma_lccdf depending on whether y_n < upper~limit (the same as the probability function above)

The values I get out of it seem plausible to me. Could someone help me and check whether I coded this correctly?

martinmodrak · May 29, 2020, 9:05pm

I am not sure I understand your inquiry completely (don’t worry, I also had trouble wrapping my head around marginalization - and I still do). But in this specific case, the actual density of a random variable that is a mixture of two others is just a weighted sum of the two component densities. So we can either think of this as three random variables with simple densities (two continous - the components, and one discrete to choose component) or just as one variable with density being the weighted average. So the mixture density is in principle no different from a normal distribution - it just has a different formula.

One can check this via a simulation, here a two component gamma mixture:

library(tidyverse)

N <- 100000
a <- c(3, 6)
b <- c(20, 11)

weights <- c(0.3, 0.7)

k <- sample.int(2, N, prob = weights, replace = TRUE)

samples <- rgamma(N, a[k], b[k])


computed_density <- data.frame(x = seq(0, max(samples) * 1.01, length.out = 200)) %>%
  mutate(y = weights[1] * dgamma(x, a[1], b[1]) + weights[2] * dgamma(x, a[2], b[2]))

data.frame(x = samples) %>% 
  ggplot(aes(x = x)) + geom_density() + geom_line(aes(x = x, y = y), data = computed_density, color = "blue")

Here, black is empirical density of samples built by simulating the discrete parameters directly while blue is the weighted average of the two analytical densitites. They line-up pretty well (there is some mismatch around zero due to sampling and the way kernel estimates are computed but I promise it is not important)

I don’t see any obvious bugs, but it is often hard to judge a model “by eye”. But you can check for yourself if it is correct! The best way is to try with simulated data - draw all the parameters from their priors (the same priors you have in your Stan model) and simulate observations, including censoring etc. - here working with discrete parameters is easy. Than see if you can recover the parameters from this data. Try this a few times and if the model generally recovers the parameters it is a strong indication you are not messing up (this can be formalized and made more robust as Simulation-Based Calibration which you’ll find a bit of posts about around the forum)

Hope that answers your question.

Best of luck with your model.

mike-lawrence · May 29, 2020, 10:51pm

You don’t literally mean that you have a second measured (or manipulated) variable that you’re saying you think predicts the outcome, right? Because that would make this a case of needing simple regression. I’m guessing you mean that you theorize that there’s some latent (unmeasured) 3-level variable that explains the three modes , hence the mixture model.

flosch · June 2, 2020, 8:00am

Thank you for the coded example and visualization! It was very helpful.

Thanks for the advice. I will try that.

flosch · June 2, 2020, 8:45am

I do have actual predictors, but they they are only partially helpful in predicting the outcome. To make this a little more plastic I’ll give an example:

I have a population of N individuals which are subject to different combinations of k predictor variables. Each combination is replicated 20 times. After the experiment is over, different modes of survival time in the outcome are observed. I assume the predictor variables are responsible for the the weights of the mixed distribution (however, in some cases, more components to the mixture may emerge (interactions?) for which I don’t have predictors), but they do not predict the shape of the component’s curves, which (I assume, is roughly similar for all levels of the predictors).

Of course I also see parallels to regression models, but it does not quite seem to fit to the problem (I think I can imagine it best as a mixture nested in a regression frame). Is there a simpler model for such a problem than using mixtures? I have been struggling with the analysis of this for quite a bit and it might also be that I don’t see the forest for all the trees. So I am thankful for any ideas or different perspectives on the problem.

mike-lawrence · June 2, 2020, 12:05pm

Are you saying you performed a standard regression, inspected the residuals and still felt like there were three modes? Because that’s the only scenario where you should bother attempting mixture modelling (absent strong prior information motivating a latent mixture over-and-above what can be accounted for by a regression from the observed variables). If instead you’re saying that the regression didn’t achieve perfect prediction, well, that’s science in a noisy world. Only if there’s still structure in what’s left over should you move to adding (in addition to the regression using measured variables) more complex latent structure.

flosch · June 2, 2020, 3:07pm

This was not my aim at the mixture. If a draw the residuals of a simple regression model, I get residuals like that. Each row represents one subset, where predictors are the same. I end up with two modes which are relatively close together and one mode very far away (censored values).

So, depending on the combination of predictor’s I have a different story, but there is, at least in my opinion, more structure to the model than a standard regression would allow.

I’m quite confident that I have such information. The pyhsical process behind the data, could be imagined like this.

A group of patients is exposed to disease XYZ and ABC, 10% of the patients catch disease XYZ and die very quickly because of it. Another 30% catch ABC, and die after a moderate time, the rest recovers again and eventually dies because of high age. I only know the different exposure times to the diseases and use this as predictors for the weights of the distribution. I have no information to the exact cause of death, only I observe a distribution of three modes which I suppose are characteristic for the survival time after infection. With a regression, I would get an average survival time, for each of the exposures to the disease, but this would be misleading. Hence, with the mixture model, I can work out the latent discrete variable cause of death.

Could I convince you that I need a mixture for my data? At least, I feel more secure now.

mike-lawrence · June 2, 2020, 4:00pm

Ok, yup, looks like you have indeed done the work to verify that the latent mixture is a warranted addition to the model. Sorry if I came off as overly skeptical, just wanted to make sure you weren’t frustrating yourself with a harder-to-use addition if the simpler regression could suffice.

flosch · June 2, 2020, 4:02pm

Thanks. I’m glad that I didn’t already spent some weeks worth into working out this model in vain. But I feel much more confident now than I did before. So thanks for your well placed skepticism.

Topic		Replies	Views
Fitting a discrete integer mixture model within the Stan paradigm Modeling	10	2417	May 11, 2017
Estimation, not marginalisation, of discrete parameters: a solution? General discrete-parameters	3	728	May 18, 2022
Marginalization of latent discrete states Modeling techniques , ecology	17	5361	September 15, 2017
Mixture formula of discrete and continuous distributions Modeling	9	1332	July 20, 2018
Discrete parameter in Stan Developers	14	5625	February 7, 2019

Summing out discrete paramters in a mixture. How can I derive that it works for individual observations?

Related topics