Mixture model: Different perspective to the schools example

Thank you.
Well, the point is that you don’t know how many students you will have and and how many exams they will be able to take. That’s why Y_{jm} is a mixture where the weights are random proportions of exams taken in each of the subjects for a class (group of students).
I think I’ll just model the means, avoiding the variation at individual level, and then I’ll try to do data fusion in order to add this lower level in the hierarchy.

Hi, the thread is quite long and I didn’t read it whole, but it appears that your model isn’t a mixture, but what in the literature is called convolution. If you have

Y_i \sim F(\theta_i) \\ Y_{all} = \sum_i w_i Y_i

where F is some family of distributions (normal, gamma, Poisson, …), \theta_i parameters of the distrubution you are interested in and w_i are weights, then the distribution of Y_{all} is a convolution of F.

Convolutions are challenging. It might be impossible to retrieve \theta_i from only observations of Y_{all} even if w_i are known. For example, when F is the normal distribution or the Poisson distribution, all information about Y_i is erased by the summation and you can only infer the sum of the respective means, but not even a hint on how they were split among the i elements.

Even when the information is not completely erased (e.g. when Y_i are binomially distributed with different success rates), it is usually still heavily obscured. The intuition for this comes from the Central limit theorem: when you sum multiple random variables they often get closer to a normal distribution, and we’ve just established that the task is impossible for the normal distribution.

It appears that your case is slightly better, because in fact you have a matrix w_{ij}Y_{ij} and know allthe row and column sums (A_j = \sum_i w_{ij} Y_{ij} and B_i = \sum_j w_{ij}Y_{ij}). Still, row and column sums don’t uniquely determine the elements of a matrix with more than two rows or more than two columns.

So if I understand your problem correctly, the information about the individual scores is most likely erased from your data and not accessible, unless you can make some very strong assumptions about the individual scores (e.g. that the parameters \theta_i are the same for all students).

Hope that answers your question or is at least relevant to your inquiry.


A useful way to think about this IMHO is: “If I assumed there is no random variation in the data, would I be able to solve for the variables of interest from the data I observe?” if the answer to this is no, the problem with randomness added is almost always impossible as well.


Yes. I will start building a model with the data I have, as you said. Although, I can’t answer to the same questions with this model.
As a second step, I am willing to assume strong assumptions about individual scores, reasonable assumptions that will lead to a hypothetical scenario, with the main goal of showing the methodology that can be used in a similar environment (real data).
About the convolution, yes, usually this problem is modeled as a sum of individual rvs. But can it be also modelled as a mixture for a single score? Isn’t it the same if I sample n values from this distribution m times and I sum them up in each m=1,…,M? In order to get the distribution of the sum. It sounds computationally expensive because we would have to do it for different n but is it correct?

Thanks for jumping in, @martinmodrak!

I guess you could model a single score as a mixture, but that is mostly useful when you don’t know in advance for which subject the next exam will be. In practice, I would expect that for a given exam, you know the subject, so it doesn’t make sense (to me) to model that as a mixture.

As far as your final question, I think it’s usually a good idea to simulate fake data from this process and see if it behaves similarly to your actual data. If it doesn’t, then your process is “missing something”.

Yes, the point is we don’t know in which subject the next exam will be, cause we have a group of students and not all of them will take every exam. The number of students is known in advance but the proportions of exams taken are not.
A convolution is useful to model the total score for a class and school, which considers each subject performance and the total number of exams as random variables. Then, for a fixed school m and class j, we get S=\sum_{j=1}^4\sum_{i=1}^{n_j} Y_{jm}^{(i)}. where Y_{jm}^{(i)}....Y_{jm}^{(n_j)} are i.i.d.
I was thinking if we can reach the same answers modeling S than simulating n individual scores from f_{Y_{jm}}, summing them up, and then vary n.

1 Like

@martinmodrak I’ve been re reading your answer. I have a question:
You think it is possible to model a Compound Poisson and then a Hierarchical model over the parameters of the components?
We have J observations of S_j = \sum_{k=1}^{n_j} X_k, X_k's are not observed but we have \bar{X} and n_j. The variance of X, although, it is hidden in the variation of S.
Can we write in Stan a hierarchical model for N_j and similarly for X_k and recover their parameters fitting a distribution of S_j to the data?

I don’t think so. If each X_k is a Poisson, then S_j is a Poisson which means you can only learn about its mean E(S_j). You then have no way to learn how the mean E(S_j) is partitioned across E(X_k). The best you can do is IMHO to aggregate the predictors you would use to predict X_k and use them all together to predict S_j. If some predictors are unique to a component (say a unique ID of the test taken) and it ends up being associated with S_j you can reluctantly assume this says something about the component, but for predictors shared among multiple components (say the colour of the paper the test was printed on) you IMHO can’t really know if it played a huge role in one test and almost no in others or whether its effect is equally distributed (or anything in between).

Does that make sense?

Hi @martinmodrak,
Thank you for answering. What I meant was modeling S_j (the sum of N_j individual scores X_k for a fixed subject in class j) as a compound Poisson, where N_j is Poisson and X_k is gamma. Assuming that outcomes of S from different classes have same distribution of X_k, then the variance of S should show somehow the behavior of X_k.

Sorry, I don’t think I get it. Could you provide the simplest possible code that would simulate synthetic data from such a model? This might make things clearer.


Hi martin, I am not able to code that example cause I still don’t know how to do it. But I can clarify it:

S=\sum_{i=1}^N X_i, where N follows a Poisson distribution and let’s say X_i are i.i.d Exponential distributed with mean 1/\theta. If I am not wrong S\sim gamma(n_i,\theta). Is it possible to write this model in Stan marginalizing out n_i?

Sorry, I am still not sure I get it completely, in particular, this looks like quite a different model I thought you were using earlier (and also sorry for slow response, it is unfortunately possible my response time will not improve for a while).

If I understand your model correctly, than it is likely to be challenging/impossible as you can’t have N as a parameter. The density f_S of S would be:

f_S(y) = \sum_{n=0}^\infty Poisson(n | \lambda)Gamma(y | n, \theta) = \frac{\lambda^n e^{-n}}{n!} \frac{\theta^n}{(n-1)!}y^{n - 1}e^{-\theta y}

where \lambda is the mean for the Poisson that generates N. It is possible this sum can be analytically computed or well approximated with an integral, but it might not work. My calculus is not strong enough to claim anything with certainty about this formule.

But there are some contradictions - e.g. the value above is not defined for N = 0 (which is possible with Poisson), so I am not sure this is actually the model you had in mind.

What I meant is a code in R or Python or anything that would simulate synthetic dataset, capturing the data generating process as you imagine it. Having such code will also come in handy for testing the model, so it tends to be a good investment anyway.

Best of luck!


Thank you @martinmodrak,
I am now generating data myself to understand the model, specifically the impact of the individual amounts over S, reflected on \theta.
This is actually the so-called Tweedie distribution, which represents the Compound Poisson Gamma that we are referring and it is split for N=0. Someone programmed it already in Stan with an approximation.
However, I am not sure I am using all the information since I have observed N. I was thinking a simpler approach would be to model N with a Poisson and, in parallel, fit the Gamma with the observed n. By the end, when I generate yrep, I have to plug in the posterior of n inside the Gamma(y/n, \theta). Not sure if this is statistically correct.
PD: I’ll try to write the likelihood function.

Just in case it’s useful still, here’s a derivation for f_S(y):

\begin{align} f_s(y) &= \sum_{n=0}^\infty \operatorname{Poisson}(n | \lambda)\operatorname{Gamma}(y | n, \theta),\\ &= \sum_{n=0}^\infty \frac{\lambda^n e^{-\lambda}}{n!} \frac{\theta^n y^{n-1} e^{-\theta y}}{\Gamma(n)},\\ & = \frac{e^{-\lambda}e^{-\theta y}}{y} \sum_{n=0}^\infty \frac{(\lambda\theta y)^n}{n!\Gamma(n)}, \\ & = \frac{e^{-\lambda}e^{-\theta y}}{y} \sqrt{\lambda\theta y} I_1(2\sqrt{\lambda\theta y}), \end{align}

where I_1 is the modified Bessel function of the first kind. The last line was gifted to me by Wolfram Alpha.

@Juan_Ignacio_de_Oyarbide, it’d be helpful if you could link to the implementation(s) you saw that helped you with your problem.


Thank you @maxbiostat,
I still have problems to write down the likelihood function of this problem, cause I don’t get how to include n. In the case of observing only y, the posterior is reduced to:

\begin{align} P(\theta,\lambda, n /y) & \propto p(y/\theta,\lambda, n) f(\theta/n) P(n/\lambda) P(\lambda) \end{align}

where f_S(y) is Tweedie distributed and its density can be approximated.

In my case, I have a bivariate rv (y_i,n_i), i=1,...,m where y_i is stochastically dependent of n_i.
Then, I don’t know if the posterior is just:

\begin{align} P(\theta,\lambda /y, n) & \propto p(y, n/\theta,\lambda) f(\theta) P(\lambda) \end{align}

But this leads me to f_S(y,n), not f_S(y). Is it correct to get the marginal of f_S(y) from here? I think that this model is equivalent to fit two models in parallel: one for N, and another for Y (or \sum X_i, \theta is associated to the mean of X_i under the exponential assumption)

Is there anything similar to the following :

\begin{align} P(\theta,\lambda, n^* /y,n) & \propto \sum_{n^*} p(y/n^*) p(n^* /\lambda) p(n/\lambda)p(\lambda)f(\theta) \\ \end{align}
where n^* should affect the estimation of \theta.

I am sorry if I am not being very clear!

I’m afraid I still don’t get what you’re trying to do. You could take a step back, describe the scientific problem you’re trying to solve and then write down the formulation of the statistical model, like so:

\begin{align} X & \sim \operatorname{Normal}(\mu, \sigma),\\ Y\mid X & \sim \operatorname{Normal}(X, \tau). \end{align}

This would help us figure out which computations need to be done and how to translate the model into Stan, if possible.