Degeneracy in a hierarchical mixture models

My GMM models interaction times. A number of persons are working on items (it has a bit of hierarchy). The time spent ,Tau, between items is calculated consecutively and is normally distributed. It can be either when they quit working after an item or when they are in a “session” and (typically) do one or more items. . No interactions between people and items. No time based components are in the Data Generating Process( DGP ).

The two components are:

  1. When the people are doing work the time spend is a function of a person’s \mu^r and an item’s( \mu^r) parameters respectively.
    Tau[s, i] \sim Normal(\mu^r_{[s]} + \mu^r_{curr.item[s, i]}, \sigma_{work})
  2. When people are not working the time is dependent on just a person’s (\mu^{nonwork}) parameter and a common \sigma_{nonwork}.
    Tau[s, i] \sim Normal(\mu^{nonwork}_{[s]}, \sigma_{nonwork})

The probability of switching between these two components is:

P(person\ s\ quits\ working\ after\ item\ i) = \frac{1}{(1+exp(\mu^q_{[s]} + \mu^q_{next.item[s]})}
were s is index of the person and i is the index of the event

The data table per person looks like:

  Tau  curr.item next.item  
  12.3    15             2
  18.2     1             4
  56.2     13           18

I read Michael’s excellent post on degeneracy. However I am not 100% on the labeling degeneracy mechanism within stan in case of exchangeable priors:

  1. My Component 1 and 2 have different names (mu_s_r_xx + mu_i_r_yy and mu_work_). My intuition suggests this would also not be able to break the degeneracy definitively, because the priors are exchangeable. Am I right?
  2. So assuming I need to break the degeneracy in (1). I am assuming the right thing to do is combine mu_s_r , mu_i_r , mu_s_nowork into one vector and make it ordered and slice it later into the individual parts. Right?

data {
    int<lower=0> Ni;
    int <lower=0> Ns;
    int<lower=0> Ne;
    real taus[Ne, Ns]; //Ne * Ns person histories
    int<lower=1, upper=Ni> curr_items[Ne, Ns]; // current item indexes that a person works on
    int<lower=1, upper=Ni> next_items[Ne, Ns]; // next item indexes that a person works  on 

parameters {
    real mu_s_r[Ns];
    real mu_i_r[Ni];
    real mu_s_q[Ns];
    real mu_i_q[Ni];
    real mu_s_nowork[Ns];
    real<lower=0> sigma_work;
    real<lower=0> sigma_nowork;

model {
    mu_s_r ~ normal(0, 1);
    mu_s_q ~ normal(0, 1);
    mu_i_r ~ normal(0, 1);
    mu_i_q ~ normal(0, 10);
    // Weakly informative because when a person is not working he can be away for a large amount of time.
    mu_s_nowork ~ normal(0, 10); 
    sigma_work ~ normal(0, 10);
    sigma_nowork ~ student_t(1, 0, 1);
    for(s in 1:Ns) {
        // For each person
        for (ev in 1:Ne) {
            // For each event
            real lambda;
            // So the mu_i_q and mu_s_q update correctly?
            lambda = inv_logit(mu_s_q[s] + mu_i_q[curr_items[ev, s]]);
            target += log_mix(lambda,
                normal_lpdf(taus[ev, s]| mu_s_r[s] + mu_i_r[next_items[ev, s]], sigma_work),  // ----> Component 1
                normal_lpdf(taus[ev, s]| mu_s_nowork[s], sigma_nowork) // -----------> Component 2

I don’t fully understand the problem you are working on, but since noone else responded, I’ll give it a try :-).

Your problem is different from Micheal’s case study in that your components are not identical. Thus it depends on the data, whether the problem is identifiable - if there is a very strong signal of item’s \mu^r, the posterior might be identified well while if the distinction is very blurry and most breaks take about the same time, you might get the labelling degeneracy.

However, imposing ordering seems problematic in this case - can you really claim that e.g. the \mu^{nowork}_{[s]} > \mu^r_{[s]} for all s ? That sounds like a strong assumption the human psyche :-). But if you are OK with that, just have an array of ordered pairs to represent it. I however don’t see how to fit the item-specific effect into the ordering. It might be OK to ignore it for the ordering (especially if it should be small). Or you may want to claim something like:

E(\mu^{nowork}_{[s]} + \mu^r_{curr.item[s,i]}) > \mu^r_{[s]} where the expectation is over all events for the person.

If you can make the latter claim, you could probably reparametrize with \nu_{[s]} as the expected break per person to something like:

Tau[s,i] \sim N(\nu_{[s]} + \mu^r_{curr.item[s,i]},\sigma_{work}) (case 1)
Tau[s,i] \sim N(\nu_{[s]} - \mu^{nonwork}_{[s]},\sigma_{nonwork}) (case 2)
and force \mu^{nonwork}_{[s]} > 0

But that’s just a wild guess.

Also note that you are not really modelling any “switching”. The model you have is memory less and does not favor consecutive work or consecutive non work. The probability of work/non work at each event seems to be independent. This might be OK, but I was not sure from your description if this is what you intended.

@martinmodrak Thanks for replying. It seems I misunderstood Michael’s article, in that the components must necessarily be well separated in the data OR get the priors to do it for you (i.e. using ordered cannot work if this is not the case). In light of that ordering over item and person means seems to be not obvious or correct to assume. Also the data generating process is memoryless as you guessed, I did not intend to model anything related to time as I noted in the description of my DGP in the question.

I have one more question. Can Stan be claim to do estimation of the means as well as other estimation techniques, for example like, EM? In other words if Stan cannot disambiguate these means can other techniques claim do any better?

I do not claim to understand a lot of techniques, but this type of labelling ambiguity usually means multiple local maxima of the posterior (and also the likelihood). In this case, most algorithms will just give you one of the maxima. My best guess is that EM will do exactly that.

Expectation maximization (EM) is usually applied to posterior marginal mode finding problems. This is not a mean. Posterior modes and max likelihood estimates only have nice properties asymptotically in well-behaved models, where they converge to the true value.

Bayesian posterior estimation typically uses means, which minimize expected squared error, a strong staistical property in finite samples. Using the posterior median gives an alternative point estimate the minimizes expected absolute error.

Stan doesn’t disambiguate means—only Stan plus a model does something. If the model is continuous and differentiable, Stan’s likely your best bet for fitting.