Question about the Viterbi algorithm

Hi everyone,

In the User Guide about recovering the most likely sequence of states in a hidden Markov model, the hidden state at the final time point is assigned to be the state with the highest probability:

log_p_y_star = max(best_logp[T_unsup]);
for (k in 1:K) {
  if (best_logp[T_unsup, k] == log_p_y_star) {
    y_star[T_unsup] = k;
  }
}

I am wondering if this should actually be a categorical draw from those log probabilities:

y_star[T_unsup] = categorical_logit_rng(best_logp[T_unsup]);

Can someone (presumably @Bob_Carpenter) confirm?

I am running into an issue with a Jolly-Seber model where all augmented individuals have the same log-likelihood, and therefore they all have the exact same latent state sequence. I believe that categorically sampling from the final log probabilities and then following the backpointers back is the thing that actually samples from the respective posteriors.

Cheers,

Matt

The output of the Viterbi algorithm is literally the single highest probability path, not a random sample from the posterior distribution of the state trajectories. To get a random sample of a trajectory, instead use the forward-backward algorithm, but instead of smoothing during the backward pass to compute the marginal probabilities associated with the states, actually sample during the backward pass to get a sample from the joint distribution of the states.

Hereā€™s an example for use in dynamic occupancy models. Contrast this forward-filtering-backward-sampling function with the previous forward-filtering-backward-smoothing function, which is what is most commonly meant when we say ā€œforward-backward algorithmā€.

https://github.com/jsocolar/flocker/blob/main/R/get_Z.R/#L500

Hey Jacob, thanks a lot, that makes perfect sense. Here I was thinking the Viterbi and forward-backward algorithm were the same thing.

Does the foward-backward algorithm together with em, are used for model-fitting and viterbi is used for computing the highest probability trajectory based on the outcome of model-fitting? thanks

Fitting is generally based on the forward algorithm, which is nothing more or less than an efficient technique for marginalizing over all possible hidden state sequences. When you say ā€˜emā€™ do you mean expectation maximization? More commonly with Stan we might use full MCMC. In either case, the forward algorithm computes the likelihood.

The forward-backward algorithm in its usual sense of forward-filtering-backward-smoothing computes (conditional on the fitted parameters) the marginal probabilities of each hidden state at each timestep. In an MCMC context, this would be applied iteration-wise over the posterior samples to get a posterior distribution for the marginal state probabilities.

The forward-filtering-backward-sampling algorithm takes a random draw from the joint distribution of hidden states (conditional on the fitted parameters). Again in an MCMC context, this would be applied iteration-wise over the posterior samples to get samples from the posterior distribution of the joint state sequences.

The Viterbi algorithm yields the single highest-probability trajectory conditional on the fitted parameters. In an MCMC context, if the Viterbi algorithm is applied iteration-wise over the posterior samples, it is hard to put oneā€™s finger on what conceptual quantity the resulting posterior samples represent, and in general I donā€™t think the Viterbi algorithm is widely used in MCMC contexts. If one were to fit the model via maximum likelihood, then conditional on the maximum likelihood parameter estimates the Viterbi algorithm would yield the most likely state sequence. However, in a Bayesian context I donā€™t think that the Viterbi algorithm output at the maximum a posteriori parameter estimate necessarily yields the maximum a posteriori state sequence. It seems possible that some other state sequence could have more posterior mass overall when integrating over the posterior distribution of the parameters, despite having lower mass conditional on the maximum a posteriori parameter values.

1 Like

Thanks for the detailed explanation. So, in stan, since we donā€™t care about the marginal distribution of the hidden state at each time, we only care about the probability of having the observation given the parameter, we donā€™t necessarily need the forward-backward algorithm (we can marginalize them all with forward algorithm to get the likelihood)?

We might or might not care about the marginal state probabilities, but we donā€™t need to compute them in order to compute the likelihood and fit the model. We compute the likelihood via the forward algorithm, and optionally we use the forward-backward algorithm post-hoc to recover the marginal state probabilities, if we happen to care about them.

1 Like

As @jsocolar pointed out, not for recovering best state. But you can use that idea to simulate draws from the posterior to do things like posterior predictive checks.

This isnā€™t what actually happens. What we do is just calculate the log density, calculate gradients, and then follow those using either optimization or Hamiltonian Monte Carlo or variational inference.

The marginalization calculation is identical for Bayes or optimization. When you do the marginalization in Stan and run optimization, you get what is called a generalized EM algorithm (GEM), where the M step just makes progress without going all the way to the optimumā€”thatā€™s enough to guarantee you eventually get to the global optimum.

Having said that, the way people often fit HMMs is using the forward-backward algorithm, where the backward pass pulls out sufficient statistics that can be more easily and efficiently used for updating.

This is the key insight. I you look at the latent discrete parameters of the Userā€™s Guide, we go into some more detail.