# Hierarchical mixture

Take the simple hierarchical model:

data{
int n_id ;
int n_obs ;
matrix[n_obs,n_id] obs ;
}
parameters{
vector[n_id] mu ;
real mu_mean ;
real<lower=0> mu_sd ;
real<lower=0> obs_noise ;
}
model{
mu ~ normal(mu_mean,mu_sd) ;
mu_mean ~ std_normal() ;
mu_sd ~ weibull(2,1) ;
obs_noise ~ weibull(2,1) ;
for( i_id in 1:n_id){
obs[,i_id] ~ normal( mu[i_id], obs_noise) ;
}
}


I’d like to modify this to express a model whereby there’s two latent groups, one group where mu_mean is positive and one that is negative. I’m uncertain however if I should have a single mixture probability parameter, as in:

...
parameters{
vector[n_id] mu_neg ;
vector[n_id] mu_pos ;
real<upper=0> mu_mean_neg ;
real<lower=0> mu_mean_pos ;
real<lower=0> mu_sd ;
real<lower=0> obs_noise ;
real<lower=0,upper=1> group_prob ;
}
model{
mu_neg ~ normal(mu_mean_neg , mu_sd) ;
mu_pos ~ normal(mu_mean_pos , mu_sd) ;
mu_mean_neg ~ std_normal() ;
mu_mean_pos ~ std_normal() ;
mu_sd ~ weibull(2,1) ;
obs_noise ~ weibull(2,1) ;
for( i_id in 1:n_id){
target += log_mix(
group_prob
, normal_lupdf( obs[,i_id] | mu_neg[i_id], obs_noise)
, normal_lupdf( obs[,i_id] | mu_pos[i_id], obs_noise)
)
}
}


or a mixture probability parameter for each individual:

...
parameters{
...
vector<lower=0,upper=1>[n_id] group_prob ;
}
model{
...
for( i_id in 1:n_id){
target += log_mix(
group_prob[i_id]
...
)
}
}


Thoughts?

3 Likes

I think it’s more common to have a single mixture probability parameter. I should add that I’m not an expert, but as far as I understand mixtures, if you want to sample from a mixture distribution you first sample group membership from the common group membership distribution (say, Bernouilli in your case), and then you sample each observation according to its group distribution.

Thinking about this in another way, if each observation had its own mixture probability, then if you knew the prior probability for those individual probabilities, wouldn’t that also ultimately control group membership probabilities in a single parameter?

What I’m trying to say is that, if your individual group membership probabilities were drawn from (e.g.) a Beta(1,1) distribution, then you could marginalize out the intermediate step and conclude that the observations had an equal probability of belonging to each group, right?

I hope this helps :-)

Hey, MauritsM, I’m grateful for your comment! Thank you!
Also, I think that OP author created a good idea of hierarchy

1 Like

I had a somewhat different thought last night: ultimately I think there is something about each individual that determines their latent group membership, and as you say at present the only constraint on each is a common prior. But if we have other kinds of information manifest by other variables measured about each individual that I think might help predict the latent group membership, only the probability-parameter-associated-with-each-individual permits me to start incorporating this information in the model, which in turn leaves me thinking that that even absent such information, that parameterization at least makes sense even if it can be marginalized to an equivalent one-probability model with an appropriate prior.

1 Like

Ah, that makes sense. I believe that is a slightly different model than the “plain vanilla” mixture model, though. In that one, you only observe the values without knowing anything about group memberships, and the groups are purely a latent variable that helps explain the data better than a single distribution would. If you have additional information “prior to observing the outcome” then it makes a lot of sense to have individual-level group membership probabilities.

Just as a thought experiment, the most useful information you could have is the actual group memberships - knowing that it would obviously not make sense to disregard that information :-)

Sometimes it is easiest to think about these models in their unmarginalized form, where the latent state is a parameter. The mixture probability is a prior on the group membership, and in practice we generally use a hierarchical prior since the mixture probability is typically a fitted parameter with a prior of its own.

Let’s consider the class of priors that have the form of a logistic regression; i.e. m_i = L(\alpha + X_i\theta) where m is the mixture probability, L is the inverse logit, X is covariates, and \alpha and \theta are parameters with priors of their own. The intercept-only regression is the case of just one mixture probability, but it’s also clear that we can add covariates.

This is all a restatement of what @mike-lawrence and @MauritsM have already said, but there’s an important twist that gets revealed by viewing the problem this way. Namely, observation-level random effects are not well identified in logistic regression except via the prior. Thus, I think it is highly unlikely that fitting observation-specific latents will work well unless you have a highly restrictive hierarchical prior (like a logistic regression), and not a prior that induces observation-level flexibility to adjust the mixture probabilities for observations one-at-a-time.

2 Likes

This is a paper that may be similar to what you are trying to do.