Mixture model: Different perspective to the schools example

Hi everyone,

I would like to know if I can implement this model in a single Stan code. The aim is to build a mixture where the components come from a hierarchical model and the weights from another model. The density should look like this:

f_{Y_{jm}}(y)=\sum_{i=1}^4\pi_{jm}(k=i) * f_{Y_{jm[k=i]}}\left(y/k=i\right)

Let’s say Y_{jm[k=i]} is one individual score in subject k, class j and school m. I have n=1,...,N scores, k=1,...,4 subjects and J=1,...,J classes, and the classes are in m=1,...,M schools.
Data is aggregate so the outcomes at student level Y_{jm[k=i]} are unobserved. I have the average for each subject within a school, plus the total number of students and exams for each subject.
The aim is to model the distribution of individual scores in class j and school m, with consistent partial pooling in a subject component towards the overall mean from all the schools for that subject, plus different weights given the proportion of exams allocated to each of them (again with partial pooling towards the overall mean of the weights).
I thought to recover the individual scores, imputing \sigma_{jm[k=i]} and a distribution of Y_{jm[k=i]} (both based on external research) and then simulate data. Finally, I’d fit the mixture model on the simulated dataset (or datasets) without the need of running two models.

Any help will be appreciated :)

1 Like


Yes, it’s possible to fit mixtures with Stan. In fact, there’s a chapter in the user guide on mixtures: https://mc-stan.org/docs/2_22/stan-users-guide/mixture-modeling-chapter.html

However, with mixture models things can go wrong (see the chapter on problematic posteriors: https://mc-stan.org/docs/2_22/stan-users-guide/problematic-posteriors-chapter.html)

In your case, I don’t really understand the notation used (although \mu and \sigma sort of implies you’re using a Gaussian density at some level). Can you please elaborate a bit more on the exact specification of your model?

1 Like

In this case I can’t fit the mixture on the data because observations are not at individual level. So I’ll need two models in the stan code, or one code for the unobserved Y and another one for the weights. Then build up the mixture with the posteriors (I am not sure about this).
I edited the notation, tell me if you understand.

Thanks! I’m not an expert at this, but most confusing for me is the notation of the indices. You have i, j, k, m and to me it’s not immediately clear what is what. Furthermore, you have Y which you refer to as a ‘component’ - is this in the hierarchical model sense? Can you tell me a bit more about the nature of your data? What exactly are the Y s ? What are the N s?

At this point I’m just trying to understand the structure of your problem, even before trying to help with any Stan-specific advice :)

Yes, the indices are just indicator variables on the observations. They allow me to build a hierarchical structure of E(Y). Let’s say j and m are class and school, where k represents the outcome from specific subject (k=1,...,4). Then E(Y_{jm[k=i]}) is the mean of the outcomes for subject k in class j and school m and N the total number of outcomes/exams.
My final goal is to simulate individual scores from the distribution of class j and school m adding the uncertainty of the lowest level of the hierarchy (student level) and the corresponding weights according to the number of outcomes in each of the subjects.

Ok, so let me unpack that a little. You have n = 1,\ldots, N exams, taken by i = 1, \ldots, I students, who are in j = 1, \ldots, J classes, and the classes are in m = 1, \ldots, M schools. That is a pretty clear hierarchical structure there, although exams are of course not part of this hierarchy. And then you have subjects k = 1, \ldots, K where you fixed K = 4.

And finally you mentioned you only have aggregate data. So what data do you have, exactly?


I have the observed means for class j, school m and subject k (some schools have less classes and less exams/students), and additionally the number of students and exams for each of them. I want the distribution of individual scores for class j and school m. I don’t care about the outcomes from specific students (there can be imbalances across students in terms of number of exams taken). I only care about the variation of these individual scores, and the variation in higher levels of the hierarchy.
I am approaching this with a mixture because this distribution of Y_{jm} is clearly multi-modal given by the different subjects.

1 Like

I definitely should write the post again. You are right, I wasn’t clear at all haha. Thank you

1 Like

No worries. It helps my thinking (and hopefully yours as well :-) ) to get these details written down explicitly!

So you’re interested in getting the distribution of scores from the means. Is there a specific ‘prior’ insight you have into what the distribution of a given subject and/or class might be? Honestly, I still have a bit of trouble trying to envision exactly how you’ll arrive at this distribution from the data that you have…

Yes, I have, let’s say, expertise and I can get external studies about the distribution and possible variation of scores within a subject. Possibly, I will assume a Lognormal distribution for the individual scores and then just normal distributions (or generalized normals, I have to take a look into this) for the means in higher levels.

1 Like

Are there any limits on what the scores might be (e.g. say between 1 and 100)?

I could put reasonable limits. But why is that useful?

I was just thinking that unconstrained distribution (like the normal) or semi-constrained distributions (e.g. log-normal) might not be best in such cases. On the other hand, they may be much more convenient approximations so maybe you should go ahead and start with the log-normal / normal model that you mentioned.

Yes, I don’t think there is a distributional problem or at least it is not the main concern. Now the question is, should I simulate the data and then fit a mixture, or run two different models as I mentioned and then plug them in the mixture? I don’t know which one is more appropriate/convenient for this case.
Recalling the objective of my model: I want the distribution of class j and school m, with consistent partial pooling in each of the subjects towards the mean from all the schools in that subject, plus different mass probabilities allocated to the components of the mixture given the proportion of exams in each of the subject (again with partial pooling towards the overall mean of the weights).

I think it’s relatively typical for mixture models that components and weights are recovered simultaneously. The reason being (I suspect) that they are jointly distributed and dependent on each other. For that reason I’d imagine it isn’t appropriate to source them from separate independent models. IMHO.

1 Like

So just to be sure, you have the mean score of class j and of subject k but NOT of how class j specifically did on subject k?

Yes, I have the mean score on subject k for class j, number of students, number of exams. But not the individual scores within the class.

1 Like

Thinking about this some more, I’m not sure if this is actually a mixture model, given that you have the mean score at the intersection of class and subject. You also know how many exams were taken by each class in each subject, so you don’t have to infer the mixture weights - you already have them. Then the mean class score is a weighted average of the class scores on each subject (and you know these weights). You can then use the hierarchical structure of classes within schools for partial pooling. You might add a parameter for difficulty of subject and an interaction between class (or school) and subject, for ‘how much a school pays attention to a specific subject’. I hope any of this is helpful - these are just some general ideas that I came up with while thinking about your question.

Yes, I understand. But if I had those individual scores, then I could apply a mixture model directly. The mixture weights are the proportions of expected exams in a subject over the expected overall sum of exams. I don’t want to fix this number because the number of exams vary over time and depends 1) on the number of students and 2) other factors related to the students.
When we introduce the missing variation at individual level, then it can become a mixture model and this is the part where I am not sure how to approach it. If we impute this variation, simulate data, and then fit a mixture model. Or just make a “modular approach” and then reconstruct the density of a mixture with the outputs from different models.

Hello! I think I finally understand what you mean. Usually in the context of a mixture, you model an observation as coming from a collection of distributions - you just don’t know which one. This would apply to the result of a single exam - if you didn’t know for which subject the exam was taken. In the case of a ‘final grade’ I wouldn’t call this a mixture, but it’s a weighted average of the scores for the individual subjects - a weighted average of individual observations.
To answer your question - honestly I don’t know how to approach this problem well. I get the feeling that if you simulate the individual level data and then use that data for inference, you may just get back your simulation parameters in the first place. Given the data that you have it may be impossible to reasonably estimate the distribution of individual scores - or I may just not be the right person to help you :-)

Perhaps @martinmodrak may be able to point out knowledgeable people with more experience in these kinds of questions?