How to deal with "total ordering" in practice under label switching

In ecological models, let’s say mark-recapture for survival, we’re using the forward algorithm to compute the log likelihood of each individual’s capture history, finally incrementing the target with the log_sum_exp of the two probabilities (dead and alive) at the last survey occasion. But in these models, we have more than one individual but the likelihood only factorises at the level of the individual, so the log_lik of this model is the log likelihood of each individual in the dataset. So for each individual we’re running the forward algorithm and then incrementing the log likelihood at the final occasion.

In your model, the log_lik is accumulated at each time point, which would be analogous to having a separate log_lik for each survey occasion for each individual in a mark-recapture model.

I think I’ve said this before in this long thread, but

Stan has an efficient built-in C++ implementation of the forward-backward algorithm, which improves on the forward algorithm by generating sufficient statistics. You can find the doc here:

https://mc-stan.org/docs/functions-reference/hidden_markov_models.html

People largely use expectation-maximization to find max marginal likelihood point estimates (perhaps with penalties). You’ll find that Gibbs sampling the discrete responsibilities usually won’t get you anywhere with sampling because of the heavy correlation among the discrete parameters and the general lack of gradients to guide sampling. Using HMC will be way way faster than using Gibbs for this. It’s essentially doing the same marginalization as EM does for HMMs, but then it’s sampling the continuous parameters rather than optimizing.

As to the label switching problem, aside from convergence monitoring, the posterior predictive inferences you want to do with HMMs will marginalize out the states and thus label switching won’t be an issue. I tried to get this across in the User’s Guide chapter, but it’s a pretty subtle point.

I think for me and most ecologists it still makes sense to implement the algorithm by hand to incorporate time or unit (site/individual) level variation.

Good point—we only have an implementation of the homogeneous case. Maybe we should add something more general like in the moveHMM package in R.

I’m kind of seeing what your confusion is. You are saying that, (focus on the target thing in forward prob):

  1. in my above code, i increment for regimes at the final T (where I have only 1 individual)
  2. in your code, you increment for regimes for all individuals at the final occasion (n_surv).

How about we think in this way, dirrecting from our code, your log likelihood is
log_lik[i] = log_sum_exp(lmps[:, n_surv]); where i in 1:n_ind
my log likelihood is
log_likelihood[t] = log_sum_exp(logalpha[t, ]); where t in 1:T

for the view of “observations”, yours is multiple individuals with one time, and mine is multiple times with one individual, the computation is essentially the same if we say lmps==logalpha, n_ind==T, i == t.

Or do you think what the log likelihood should I compute? Given that I only have one series in the above model for time length T, given that for every observation, in my case 1 series with T times, it has its own log likelihood?

What you’re looking for is conditional independence of the observed data given the parameters.

These give you different mixture models, as would just mixing the whole thing at a high level of multiple individuals and times.

Corrent. I’m not sure which log likelihood you should compute. Maybe @jsocolar can chime in, because I didn’t know about having to compute the log likelihoods at the lowest level where the likelihood factorises until I read the introductory paper of his excellent flocker package.

Here you include only the last logalhpha

Here you include all elements of logalpha, so there is a mismatch, and the resulting loo computation is wrong.

I don’t understand in your code why each observation is used K times in terms normal_lpdf(y[t] | mu[j], sigma[j]), so I don’t know how to compute the likelihood contribution for a single y[t].

The different usages of the last logalhpha in model block and log_sum_exp() in generated quantities block are from Forward Backward algorith.

logalpha in target += log_sum_exp(logalpha[T]); is the log forward prob in FFBS, and it is log(\alpha_T) = log(p(Z_t = j | X_{1:T})), where Z_t is the latent variable.
The forward probability logalpha[t, k] represents the log of the probability of being in state k at time t given all observations up to time t.
since logalpha is computed recursively across all time steps , such as p(Z_t=j|X_{1:t-1}) = p(Z_t=j|Z_{t-1}=i) p(Z_{t-1}=i|X_{1:t-1}), so we only need to use the last time T.

This part is because that K is the number of regimes in HMM, and we need to compute the weighted sum of the log likelihoods of
being in state i at t-1 to state j at t and being in state j at t-1 to state j at t
, and
being in state i at t-1 to state i at t and being in state j at t-1 to state i at t
, this is why you often see loops like

for (j in 1:K) { // j = current (t)
    for (i in 1:K) { // i = previous (t-1)
         .....
    }
}

My code of logalpha_1 which is exactly the same as logalpha explains this idea well:

        logalpha_1[t, j] = log_sum_exp({
          logalpha_1[t-1, 1] + log(A[1, j]) + normal_lpdf(y[t] | mu[j], sigma[j]),
          logalpha_1[t-1, 2] + log(A[2, j]) + normal_lpdf(y[t] | mu[j], sigma[j])
        });

logalpha_1[t, j] (likelihood of being state j at time t ) is the sum of

  1. Previous logalpha logalpha_1[t-1, 1] of being at state 1 or 2 (idea of recursively)
  2. log of transition matrix, such as A[1, j], the prob was state 1 and transit to state j
  3. log observation likelihood

So the logalpha terms are cumulative? And then in generated quantities you sum together terms which are already cumulative, so that the log_likelihood[t] includes likelihood contribution from y[1:t]?