Getting access to "internal" log probs in HMM code (& how they behave across time)

Thanks to encouragement on this forum to look into HMMs, I’m digging in. (Thanks again, @rfc!)

Ultimately, I’m interested in continuous state spaces, but am taking things a step at a time, so am starting with discrete HMMs. The case study Tagging Basketball Events with HMM in Stan has been super-helpful. (And also 2.6 Hidden Markov models | Stan User’s Guide, which the case study is based on.) But in working it, I’ve got two questions. (Really, I just want to make sure I’m on the right track.)

First, the two questions (summarized; more details below):

  1. I was able “access” the values of posterior log-probabilties local to the model block by recreating them in the generated quantities block. Is there a better way?
  2. The estimated log probabilities decrease over time (as shown in a plot below). I assume (?) this is because I’m computing the probability of y_{1:t}, vs just y_t. So the sample space increases with t, and thus the probability of any particular sequence, y_{1:t} decreases with t. At the “hand waving” level, is this (at least somewhat) correct? (This is more of a conceptual/math question than a Stan question; hope it’s appropriate here!)

Background:

I wanted to get a handle on the response of the HMM model over time, and at each t, vs. just the stationary transition and emission probabilities across t = 1:N, as in the case study. (The case study does use the Viterbi algorithm to estimate the most likely sequence of x at each t. But that’s not my focus.)

So, I wanted to get access to the gamma matrix (of posterior log probabilities for the two states) – that is, gamma[N observations, K states] in the Tagging Basketball… code.

I tried all sorts of things, but the only way I could get “access” was just to replicate the model code in the generated quantities block, thus using the sampled parameters mu and theta to “recreate” the gamma values, via my model:

data {
  int<lower=0> N;
  int<lower=0> K;
  real y[N];
}

parameters {
  simplex[K] theta[K];
  positive_ordered[K] mu;
}

model {
  // priors
  target+= normal_lpdf(mu[1] | 3, 1);
  target+= normal_lpdf(mu[2] | 10, 1);
  // forward algorithm
  {
  real acc[K];
  real gamma[N, K];
  for (k in 1:K)
    gamma[1, k] = normal_lpdf(y[1] | mu[k], 1);
  for (t in 2:N) {
    for (k in 1:K) {
      for (j in 1:K)
        acc[j] = gamma[t-1, j] + log(theta[j, k]) + normal_lpdf(y[t] | mu[k], 1);
      gamma[t, k] = log_sum_exp(acc);
    }
  }
  target += log_sum_exp(gamma[N]);
  }
}

// My added code, so I can access gamma
// Note: I just reused the model block's variable names, acc and gamma
generated quantities {
    real acc[K];
    real gamma[N, K];
    for (k in 1:K)
    gamma[1, k] = normal_lpdf(y[1] | mu[k], 1);
  for (t in 2:N) {
    for (k in 1:K) {
      for (j in 1:K)
        acc[j] = gamma[t-1, j] + log(theta[j, k]) + normal_lpdf(y[t] | mu[k], 1);
      gamma[t, k] = log_sum_exp(acc);
    }
  }
}

So, question 1: is there some other way get access to these “internal” variables (specifically, gamma), vs. “recreating” them in the generated quantities block?

And then question 2: why do the posterior probabilities decrease over time? (As noted above, I assume it’s because I’m computing the probability of y_{1:t}, vs just y_t – ?)

Here’s a plot, showing the log probability of the two states, which both decrease exponentially. And which nicely reflects the posterior distribution of the two states at each time step, t. The data (from Tagging Basketball…) is in the bottom panel, and the log-posteriors are in the top panel. (Actually, I’m plotting just 100 gamma “samples.”) It shows how, at each point in time, the probability of the data’s current state is quite a bit larger – which is good!

Picture1

(One thing that still confuses me about Stan is how it computes and then summarizes normalized probabilities from its samples. So I want to make sure I’m not mixing up unnormalized log probabilities and actual probabilities, by somehow not accounting for normalization and/or its absence!)

If anyone is interested, here’s the code I used to create the above plot, which is a modification of the Tagging Basketball… case study code. Note that I’ve saved the .Stan code above as models/hmm_example_G.stan. (Sorry, my code is a bit messy!)

library(ggplot2)
library(patchwork)
library(rstan)

hmm_data <- readRDS("data/hmm_example.RDS")

stan_data <- list(N = length(hmm_data$y),
                  K = 2,
                  y = hmm_data$y)

hmm_fit <- stan("models/hmm_example_G.stan", data = stan_data, iter = 1e3, chains = 4)

samples <- as.matrix(hmm_fit)
# Remove the non-Gamma columns (including the last column, _lp)
samplesG = samples[, -c(1:8, dim(samples)[2])]

nSims = 100 # number of sims to plot
samplesGTrunc = samplesG[1:nSims, ]

dfTrunc = data.frame( sim=rep(1:nSims, 100*2),
                      time=rep( rep(seq(1:100), 2), each=nSims ), 
                      state=factor(rep( rep(1:2, each=100), each=nSims )),
                      logProb=c(samplesGTrunc) ) 

pltProbs = ggplot( data=dfTrunc, aes(x=time, y=logProb, color=state)) +
  geom_point(alpha=0.1) + 
  ylab("logProb(y)") +
  theme_light()

dataDf = data.frame(time=1:100, y=stan_data$y)
pltData = ggplot( data=dataDf, aes(x=time, y=y)) +
  geom_point(color="steelblue") +
  theme_light()

plt = pltProbs + pltData +  plot_layout(ncol=1, heights=c(2, 1))
plot(plt)
2 Likes

Excellent! Glad to see that you’re making progress!

I think I can help with one point:

If I follow the model code, I believe that gamma[t, k] is the unnormalised log-probability of the latent state k conditioned on the observation sequence y_{1:t}. This isn’t a normalised log-probability as the usual HMM forward algorithm doesn’t bother calculating Pr(y_{1:t}) – the prior probability of observing the specific observation sequence. You can correct for this by renormalising gamma[t, 1:K] after each HMM forward algorithm time step by doing something like the following:

z[t] = log_sum_exp(gamma[t, 1:K])
gamma[t, 1:K] = gamma[t, 1:K] - z[t]

where by subtracting the log-sum-exp of the log probabilities in log-probability space we’re doing the same thing as dividing by the sum of the probabilities in probability space.

To get a little intuition of why these are unnormalised, consider a simpler situation with only one timestep where a single observation Y is generated by a latent state X. We want to estimate \Pr(X|Y) and we can do it by applying Bayes theorem: \Pr(X|Y) = \Pr(Y|X) \Pr(X) / \Pr(Y). The \Pr(Y|X) term is your HMM observation model and \Pr(X) is your prior belief (before this time steps’ observation – over many time steps this also involves a transition from the prior state). You can see terms corresponding to those in the HMM forward calculation. The HMM forward algorithm calculation that you have implemented omits the 1/\Pr(Y=y_{1:t}) factor as it is constant with respect to the hidden state X, so you can just normalise your posterior probabilities afterwards – either per timestep (more compute but might be more numerically stable, especially for long chains of timesteps), or once after the final timestep, and save computational effort.

To get a better understanding, work through how to derive the equations for the HMM forward algorithm from \Pr(X_T | Y_{1:T}) and the Markov modelling assumptions . There’s one derivation in the Russell & Norvig AI textbook.

2 Likes

Many thanks, @rfc, for the thought and care you put into your response; super-helpful! (You’re great at explaining things! :-))

I think my intuition was in the ballpark. I actually had computed normalized probabilities, although I didn’t mention it in my post to keep it a bit shorter… I wasn’t sure, though, that I had the right idea. And, I had computed them in probability space, which I know is not in general numerically stable. (Also, I did it in R, vs. directly in the generated quantities block, which is much more convenient.) So your response and suggested code is a really big help.

After thinking about it a bit (and (re)reading a bit of Russell & Norvig – what a great book!), I see why the unnormalized “posteriors” get smaller at each time step: at each step, the priors get multiplied by likelihoods, all the elements of which are < 1.0. So the unnormalized posteriors always get smaller at each update. I had mistakenly thought it was maybe something to do with the forward algorithm, but it’s actually just a property of “basic” Bayesian updating, if you leave off the normalization step at each update.

Anyway, here’s a plot that uses your suggested code (and still showing log-probabilities). The corresponding probabilities are all quite close to 0 and 1, as they should be.

And here’s Stan code (an updated generated quantities block) that includes your normalization suggestion (which I used for the above plot):

generated quantities {
    real gamma[N, K];
    {
    real acc[K];
    real z[N];
    for (k in 1:K)
    gamma[1, k] = normal_lpdf(y[1] | mu[k], 1);
    for (t in 2:N) {
      for (k in 1:K) {
        for (j in 1:K)
          acc[j] = gamma[t-1, j] + log(theta[j, k]) + normal_lpdf(y[t] | mu[k], 1);
        gamma[t, k] = log_sum_exp(acc);
      }
      z[t] = log_sum_exp(gamma[t, 1:K]);
      for (k in 1:K) {
        gamma[t, k] = gamma[t, k] - z[t];
      }
    }
    }
}

Many thanks!

1 Like