Summing out discrete paramters in a mixture. How can I derive that it works for individual observations?


I have some problems with specifying a model and the models listed in the user guide weren’t quite applicable or at least I could not see that they were. Here is my problem:

  1. I observe discrete timepoints of death of some test organisms. I assume that the observed survival times (or. time of observed death - ToD) follow a Gamma distribution.

  2. Additionally I observe that the distribution has three modes, which I can explain by three predictors which produce the different effects.

  3. Furthermore it is censored after the end of the experiment, although some individuals are still alive.

So far I have been working with a finite mixturemodel of Gamma distributions. Which I adapted mainly from the finite mixtures section in the user’s guide and the censoring section.

data {
  int<lower=1> K;                // number of mixture components
  int<lower=1> N;                // number of data points
  real y[N];                     // observations (death counts at day)
  vector[N] UL;
  matrix[N,K] weights;           // matrix of K predictor weights (N K-simplexes)
parameters {    
  vector<lower=0>[K] mu;         // locations of mixture components
  vector<lower=0>[K] sigma;      // scales of mixture components
transformed parameters {
  vector[K] shape;
  vector[K] rate;
  shape = square(mu) ./ square(sigma);
  rate = mu ./ square(sigma); 
model {
  matrix[N,K] log_weight;        // cache log-weights calculation

  mu ~ normal(0, 10);
  sigma ~ normal(0, 10);
  log_weight = log(weights);
  for (n in 1:N) {  
    // weighing                                    
    row_vector[K] lps = log_weight[n];                       
    for (k in 1:K) 
      // censoring
      if(y[n] < UL[n]){
        lps[k] += gamma_lpdf(y[n] | shape[k], rate[k]);
      } else {
        lps[k] += gamma_lccdf(UL[n]  | shape[k], rate[k]); 

    target += log_sum_exp(lps);

This has been working satisfactory until I realized that that each individual time of death is better modeled if it would be caused by one single predictor and not a weighed mixture.

After some more reading in the forum I realized that this is a general issue that discrete parameters can’t (yet) be sampled in Stan and are instead better marginalized out, which is more efficient. So far so good. What I don’t understand at this point is the following:

5.2 Latent Discrete Parameterization

This model is not directly supported by Stan because it involves discrete parameters zn, but Stan can sample μ and σ by summing out the z parameter as described in the next section.
5.3 Summing out the Responsibility Parameter
To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If Y is a mixture of K normal distributions with locations μk and scales σk with mixing proportions λ in the unit K-simplex, then
p_Y(y~ |~ \lambda, \mu, \sigma) = \sum_{k=1}^K \lambda_k~ normal(y~|~\mu_k, \sigma_k)

The way this is coded in stan appears to model each observation y_n as the sum of all K weighed probabilities. Is this an approximation, which works well for mixtures? I can see that all Y observations can be represented as the quoted formula but I seem to be missing how this also applies to any single observation. Could someone explain this to me? I’m sure I missed something.

At the moment I am trying to implement this model with latent discrete paramters like in the change point model (7.2 - Stan User’s Guide). Is this generally recommendable for my problem or is there a better alternative?

Thank you, Florian

1 Like

Meanwhile, I have tried to recover the posterior probability of the latent responsibility paramter as described in the corresponding section in the user guide.

generated quantities {
  for ( n in 1:N){
    for (k in 1:K) {
      z[n,k] = gamma_rcensed(y[n], shape[k], rate[k], UL[n]) + 
               categorical_lpmf(k | to_vector(weights[n]));
    // normalize
    z[n] = z[n] - log_sum_exp(z[n]);

where gamma_rcensed is just gamma_lpdf or gamma_lccdf depending on whether y_n < upper~limit (the same as the probability function above)

The values I get out of it seem plausible to me. Could someone help me and check whether I coded this correctly?

I am not sure I understand your inquiry completely (don’t worry, I also had trouble wrapping my head around marginalization - and I still do). But in this specific case, the actual density of a random variable that is a mixture of two others is just a weighted sum of the two component densities. So we can either think of this as three random variables with simple densities (two continous - the components, and one discrete to choose component) or just as one variable with density being the weighted average. So the mixture density is in principle no different from a normal distribution - it just has a different formula.

One can check this via a simulation, here a two component gamma mixture:


N <- 100000
a <- c(3, 6)
b <- c(20, 11)

weights <- c(0.3, 0.7)

k <-, N, prob = weights, replace = TRUE)

samples <- rgamma(N, a[k], b[k])

computed_density <- data.frame(x = seq(0, max(samples) * 1.01, length.out = 200)) %>%
  mutate(y = weights[1] * dgamma(x, a[1], b[1]) + weights[2] * dgamma(x, a[2], b[2]))

data.frame(x = samples) %>% 
  ggplot(aes(x = x)) + geom_density() + geom_line(aes(x = x, y = y), data = computed_density, color = "blue") 

Here, black is empirical density of samples built by simulating the discrete parameters directly while blue is the weighted average of the two analytical densitites. They line-up pretty well (there is some mismatch around zero due to sampling and the way kernel estimates are computed but I promise it is not important)

I don’t see any obvious bugs, but it is often hard to judge a model “by eye”. But you can check for yourself if it is correct! The best way is to try with simulated data - draw all the parameters from their priors (the same priors you have in your Stan model) and simulate observations, including censoring etc. - here working with discrete parameters is easy. Than see if you can recover the parameters from this data. Try this a few times and if the model generally recovers the parameters it is a strong indication you are not messing up (this can be formalized and made more robust as Simulation-Based Calibration which you’ll find a bit of posts about around the forum)

Hope that answers your question.

Best of luck with your model.

1 Like

You don’t literally mean that you have a second measured (or manipulated) variable that you’re saying you think predicts the outcome, right? Because that would make this a case of needing simple regression. I’m guessing you mean that you theorize that there’s some latent (unmeasured) 3-level variable that explains the three modes , hence the mixture model.

1 Like

Thank you for the coded example and visualization! It was very helpful.

Thanks for the advice. I will try that.

I do have actual predictors, but they they are only partially helpful in predicting the outcome. To make this a little more plastic I’ll give an example:

I have a population of N individuals which are subject to different combinations of k predictor variables. Each combination is replicated 20 times. After the experiment is over, different modes of survival time in the outcome are observed. I assume the predictor variables are responsible for the the weights of the mixed distribution (however, in some cases, more components to the mixture may emerge (interactions?) for which I don’t have predictors), but they do not predict the shape of the component’s curves, which (I assume, is roughly similar for all levels of the predictors).

Of course I also see parallels to regression models, but it does not quite seem to fit to the problem (I think I can imagine it best as a mixture nested in a regression frame). Is there a simpler model for such a problem than using mixtures? I have been struggling with the analysis of this for quite a bit and it might also be that I don’t see the forest for all the trees. So I am thankful for any ideas or different perspectives on the problem.

Are you saying you performed a standard regression, inspected the residuals and still felt like there were three modes? Because that’s the only scenario where you should bother attempting mixture modelling (absent strong prior information motivating a latent mixture over-and-above what can be accounted for by a regression from the observed variables). If instead you’re saying that the regression didn’t achieve perfect prediction, well, that’s science in a noisy world. Only if there’s still structure in what’s left over should you move to adding (in addition to the regression using measured variables) more complex latent structure.


This was not my aim at the mixture. If a draw the residuals of a simple regression model, I get residuals like that. Each row represents one subset, where predictors are the same. I end up with two modes which are relatively close together and one mode very far away (censored values).

So, depending on the combination of predictor’s I have a different story, but there is, at least in my opinion, more structure to the model than a standard regression would allow.

I’m quite confident that I have such information. The pyhsical process behind the data, could be imagined like this.

A group of patients is exposed to disease XYZ and ABC, 10% of the patients catch disease XYZ and die very quickly because of it. Another 30% catch ABC, and die after a moderate time, the rest recovers again and eventually dies because of high age. I only know the different exposure times to the diseases and use this as predictors for the weights of the distribution. I have no information to the exact cause of death, only I observe a distribution of three modes which I suppose are characteristic for the survival time after infection. With a regression, I would get an average survival time, for each of the exposures to the disease, but this would be misleading. Hence, with the mixture model, I can work out the latent discrete variable cause of death.

Could I convince you that I need a mixture for my data? At least, I feel more secure now.


Ok, yup, looks like you have indeed done the work to verify that the latent mixture is a warranted addition to the model. Sorry if I came off as overly skeptical, just wanted to make sure you weren’t frustrating yourself with a harder-to-use addition if the simpler regression could suffice.

1 Like

Thanks. I’m glad that I didn’t already spent some weeks worth into working out this model in vain. But I feel much more confident now than I did before. So thanks for your well placed skepticism.

1 Like