Mixture modelling with a reference class (i.e. K-1 latent binary logistic regressions)

I’m attempting to implement a mixture model where the latent class allocations are essentially a multinomial logistic regression - an individual’s probability of belonging to cluster k is defined as their probability of belonging to cluster k over the reference (last) cluster K:

In terms of implementing this in stan, is the following snippet of code correct?


theta ~ dirichlet(rep_vector(10,K));

for(n in 1:N){
vector[K] tmp;
    for(k in 1:K){
        tmp[k] = log(theta[k]) + normal_lpdf(eta[n] | mu[k], sigma);
        tmp[k] = tmp[k] - (log(theta[K]) + normal_lpdf(eta[n] | mu[K], sigma));

target += (tmp - log_sum_exp(tmp));

I’m essentially attempting to marginalise over the K-1 independent binary logistic regressions. My logarithm algebra is rusty at best, so I’m not sure if this is the correct approach.


That’s just in the “prior” on the mixture probabilities. The posterior class memberships will depend on the data and multiply in the likelihood (and then renormalize).

Not quite. For a start you need to replace the Dirichlet with the logistic regression!

The other thing you should read in the manual is the discussion of K vs. K - 1 coefficient vector paramterizations of multi-logit. You need to use the softmax output of that and then add the likelihoods for the mixture components. All in, it’ll look like this:

data {
  vector[J] x[N];   // predictors for component membership
parameters {
  matrix[K - 1, J] beta;  // mixture regression coeffs
model {
  for (n in 1:N) {
    vector[K] lp = softmax(append_col(beta * x[n], 0));
    for (k in 1:K)
      lp[k] += normal_lpdf(eta[n] | mu[k], sigma);
    target += log_sum_exp(lp);

I’m going to update the manual in the next release to make this all a bit clearer (or maybe the one after as I probably won’t have time before 2.17 ships):

That’s extremely helpful, thanks for the detailed response!

Quick question, should the softmax() function above actually be log_softmax()?

Suppose we have N observed realizations, i = 1,…,N with same cluster size K each,and
Y[i] represents one observation, eg. Y is vector and at each i this vector has one value 1,
the rest 0’s.
I’d code it in Stan as:

Y[i] ~ categorical_logit(append_col(beta * x[i], 0));

Note that the last K is the reference element, which is set to 0 or subtracted from the 1, …, K-1 ones
for identifiability. This is one way of encoding. Here you’ll find more ways:

Yes, thanks for the correction.

Thanks for the link. Why do they call the standard identified version “dummy coding”?

I didn’t understand all the other things that were somehow convolved with regression. Then they further muddy the waters by contrasting p-values.

And they left out the common machine-learning “one hot coding” (unless it’s in there by a different name and I didn’t recognize it).

“dummy coding”, I think its because it’s based upon dummy variables coding. It’s not uncommon.
People use different wording. I confuses me too, sometimes. “random variables” and “fixed variables” is another example.
Matlab also uses the terminus “dummy variables” too:

An not-scientific interpretation:
The number of variables in regression are limited. Imaging we are in a court and we know N-1 people are sure not guilty,
and one must be guilty, we know person N is guilty. The importance here is the number of individuals is fixed.
So know it comes how we model the dependency, we could use a sum-to-zero. Let say, K is the reference, we could say
number K is the -sum(x[1:(K-1)]). If we set the K to 0, then the loading of the regression goes to the intercept.
We could do something more fancy, as long as our rank of the design matrix is K-1.
This modelling bites us sometime. The variances of 1:K-1 and K is different. With the control of the variance, we can carefully skip the identifiability, leading to better sampling. There is a post with Andrew Gelman around, where he showed an advice to do so.
This comes at a risk that, if we not put a prior on the beta’s we may run into problems and then we have this folks running around and pointing out: “look what you say is not working”.

Andrew’s point is that ithe K-ary coding lets you use symmetric priors, whereas the K-1 thing is asymmetric from the get go. You can identify the K-ary coding with priors, though it’s only a “weak identification” of the kind I discuss in the problematic posteriors chapter of the manual in a simpler case (and a bit in the K vs. K-1 encoding). There’s a lot of discussion in the Gelman and Hill regression book around the IRT/ideal-point models where this comes up so much and there are different remediation strategies for non-identifiability.

I do understand what they were getting at with all these codings, just that a lot of it would’ve been easier to understand from first principles rather than a finite list. At least for me. They also missed the one-hot encodings that are so popular in machine learning.