Mixture modelling with a reference class (i.e. K-1 latent binary logistic regressions)

PsychPhD · August 27, 2017, 11:26am

I’m attempting to implement a mixture model where the latent class allocations are essentially a multinomial logistic regression - an individual’s probability of belonging to cluster k is defined as their probability of belonging to cluster k over the reference (last) cluster K:

In terms of implementing this in stan, is the following snippet of code correct?

...

theta ~ dirichlet(rep_vector(10,K));

for(n in 1:N){
vector[K] tmp;
    for(k in 1:K){
        tmp[k] = log(theta[k]) + normal_lpdf(eta[n] | mu[k], sigma);
        tmp[k] = tmp[k] - (log(theta[K]) + normal_lpdf(eta[n] | mu[K], sigma));
    }

target += (tmp - log_sum_exp(tmp));
}
...

I’m essentially attempting to marginalise over the K-1 independent binary logistic regressions. My logarithm algebra is rusty at best, so I’m not sure if this is the correct approach.

Cheers,
Andrew

Bob_Carpenter · August 28, 2017, 10:02pm

That’s just in the “prior” on the mixture probabilities. The posterior class memberships will depend on the data and multiply in the likelihood (and then renormalize).

Not quite. For a start you need to replace the Dirichlet with the logistic regression!

The other thing you should read in the manual is the discussion of K vs. K - 1 coefficient vector paramterizations of multi-logit. You need to use the softmax output of that and then add the likelihoods for the mixture components. All in, it’ll look like this:

data {
  vector[J] x[N];   // predictors for component membership
  ...
parameters {
  matrix[K - 1, J] beta;  // mixture regression coeffs
  ...
model {
  for (n in 1:N) {
    vector[K] lp = softmax(append_col(beta * x[n], 0));
    for (k in 1:K)
      lp[k] += normal_lpdf(eta[n] | mu[k], sigma);
    target += log_sum_exp(lp);
  }
  ...

I’m going to update the manual in the next release to make this all a bit clearer (or maybe the one after as I probably won’t have time before 2.17 ships):

PsychPhD · August 29, 2017, 2:27am

That’s extremely helpful, thanks for the detailed response!

PsychPhD · August 29, 2017, 2:34am

Quick question, should the softmax() function above actually be log_softmax()?

Andre_Pfeuffer · August 29, 2017, 3:30am

Suppose we have N observed realizations, i = 1,…,N with same cluster size K each,and
Y[i] represents one observation, eg. Y is vector and at each i this vector has one value 1,
the rest 0’s.
I’d code it in Stan as:

Y[i] ~ categorical_logit(append_col(beta * x[i], 0));

Note that the last K is the reference element, which is set to 0 or subtracted from the 1, …, K-1 ones
for identifiability. This is one way of encoding. Here you’ll find more ways:
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

Bob_Carpenter · August 29, 2017, 10:38pm

Yes, thanks for the correction.

Bob_Carpenter · August 29, 2017, 10:44pm

Thanks for the link. Why do they call the standard identified version “dummy coding”?

I didn’t understand all the other things that were somehow convolved with regression. Then they further muddy the waters by contrasting p-values.

And they left out the common machine-learning “one hot coding” (unless it’s in there by a different name and I didn’t recognize it).

Andre_Pfeuffer · August 30, 2017, 2:16am

“dummy coding”, I think its because it’s based upon dummy variables coding. It’s not uncommon.
People use different wording. I confuses me too, sometimes. “random variables” and “fixed variables” is another example.
Matlab also uses the terminus “dummy variables” too:
https://www.mathworks.com/help/stats/dummy-indicator-variables.html

An not-scientific interpretation:
The number of variables in regression are limited. Imaging we are in a court and we know N-1 people are sure not guilty,
and one must be guilty, we know person N is guilty. The importance here is the number of individuals is fixed.
So know it comes how we model the dependency, we could use a sum-to-zero. Let say, K is the reference, we could say
number K is the -sum(x[1:(K-1)]). If we set the K to 0, then the loading of the regression goes to the intercept.
We could do something more fancy, as long as our rank of the design matrix is K-1.
This modelling bites us sometime. The variances of 1:K-1 and K is different. With the control of the variance, we can carefully skip the identifiability, leading to better sampling. There is a post with Andrew Gelman around, where he showed an advice to do so.
This comes at a risk that, if we not put a prior on the beta’s we may run into problems and then we have this folks running around and pointing out: “look what you say is not working”.

Bob_Carpenter · August 30, 2017, 7:37pm

Andrew’s point is that ithe K-ary coding lets you use symmetric priors, whereas the K-1 thing is asymmetric from the get go. You can identify the K-ary coding with priors, though it’s only a “weak identification” of the kind I discuss in the problematic posteriors chapter of the manual in a simpler case (and a bit in the K vs. K-1 encoding). There’s a lot of discussion in the Gelman and Hill regression book around the IRT/ideal-point models where this comes up so much and there are different remediation strategies for non-identifiability.

I do understand what they were getting at with all these codings, just that a lot of it would’ve been easier to understand from first principles rather than a finite list. At least for me. They also missed the one-hot encodings that are so popular in machine learning.

Topic		Replies	Views
Growth Mixture Modeling (Latent Class Linear Mixed Model) Modeling	2	2079	February 14, 2020
Help writing latent class multinomial logit model (for conjoint data) Modeling fitting-issues , specification , performance	21	3331	October 16, 2020
Dirichlet_log: probabilities is not a valid simplex Modeling	26	3624	September 27, 2017
Mixture Bayesian Poisson Regression Model Modeling	45	3902	April 17, 2019
Identification issues with Latent Class Analysis/Mixture model Modeling fitting-issues , specification , mixture	25	3665	October 24, 2022

Mixture modelling with a reference class (i.e. K-1 latent binary logistic regressions)

Related topics