Fitting a Hidden Markov Model

#1

Hello,

I am trying to fit a custom Hidden Markov Model with an unknown number of hidden states. Each state has 1 or more normal emissions as illustrated in the plot below (the data is faceted by the hidden states). HMC is very slow so I am using optimizing with multiple restarts; however, on my real data set I am getting a different local minimum each time I fit the model even when I take the best of 100 restarts. My questions are: is there something I could change about my model to make fitting easier? and is there a better approach than random restarts that I could try?

Here is my simulated data attached:
simulated_data.csv (21.3 KB)

R code:

df <- read.csv('simulated_data.csv')
model = stan_model(file='hmm.stan')
K <- 5
y <- df$norm_value
df <- df[order(df$date_index), ]
date_rle <- rle(df$date_index)
date_index <- cumsum(date_rle$lengths) - date_rle$lengths

stan.data <- list(
  K = K,
  y = y,
  N = length(y),
  W = length(unique(df$date_index)),
  V = length(unique(df$compound_index)),
  date_index = date_index,
  date_length = date_rle$lengths,
  cpd_index = df$compound_index,
  l = min(y),
  u = max(y)
)

fit <- optimizing(model, data=stan.data, as_vector=FALSE)
for (i in 1:10){
  candidate_fit <- optimizing(model, data=stan.data, as_vector=FALSE)
  if (candidate_fit$value > fit$value){
    fit <- candidate_fit
  }
}

Thanks,
Ben


data {
  int<lower=2> K;                 // number of categories
  int<lower=1> W;                 // number of unique dates
  int<lower=1> V;                 // number of compounds
  int<lower=1> N;                 // number of observations
  real y[N];                      // observations
  int date_index[W];
  int<lower=1> date_length[W];
  int<lower=1> cpd_index[N];
  real l;
  real u;
}
parameters {
  simplex[K] theta[K];
  matrix<lower=0>[V, K] sigma_unordered;
  matrix<lower=l, upper=u>[V, K] mu_unordered;
}
transformed parameters {
  matrix[V, K] mu;
  matrix[V, K] sigma;
  {
    int idx[K];
    idx = sort_indices_asc(mu_unordered[1]);
    for (v in 1:V){
      mu[v,] = mu_unordered[v, idx];
      sigma[v,] = sigma_unordered[v, idx];
    }
  }
}
model {
  for (k in 1:K){
    vector[K] alpha;
    alpha = rep_vector(1, K);
    alpha[k] = 100; 
    theta[k] ~ dirichlet(alpha);
    sigma_unordered[, k] ~ cauchy(100, 10);
  }
  { // Forward algorithm
    real accumulator[K];
    real gamma[W, K];
    for (k in 1:K) {
      gamma[1, k] = 0;
      for (t in 1:date_length[1]) {
        int cpd;
        cpd = cpd_index[t];
        gamma[1, k] += normal_lpdf(y[t] | mu[cpd, k], sigma[cpd, k]);
      }
    }
    for (w in 2:W) {
      for (j in 1:K) { // j=current date (w)
        for (i in 1:K) { // i=previous date (w-1)
          int date;
          date = date_index[w];
          accumulator[i] = gamma[w-1, i] + log(theta[i, j]);
          for (t in 1:date_length[w]) {
            int cpd;
            cpd = cpd_index[date + t];
            accumulator[i] += normal_lpdf(y[date + t] | mu[cpd, j], sigma[cpd, j]);
          }
        }
      gamma[w, j] = log_sum_exp(accumulator);
      }
    }
    target += log_sum_exp(gamma[W]);
  }
}
generated quantities {
  int<lower=1, upper=K> zstar[W];
  real logp_zstar;
  { // Viterbi algorithm
    int back_pointer[W, K];   // backpointer to most likely previous state
    real delta[W, K];     // max prob for the sequence up to t
    
    for (k in 1:K) {
      delta[1, k] = 0;
      for (t in 1:date_length[1]) {
        int cpd;
        cpd = cpd_index[t];
        delta[1, k] += normal_lpdf(y[t] | mu[cpd, k], sigma[cpd, k]);
      }
    }
    for (w in 2:W) {
      for (j in 1:K) { // j=current (w)
        delta[w, j] = negative_infinity();
        for (i in 1:K) { // i=previous (w-1)
          real logp;
          int date;
          logp = delta[w-1, i] + log(theta[i, j]);
          date = date_index[w];
          for (t in 1:date_length[w]) {
            int cpd;
            cpd = cpd_index[date + t];
            logp += normal_lpdf(y[date + t] | mu[cpd, j], sigma[cpd, j]);
          }
          if (logp > delta[w, j]) {
            back_pointer[w, j] = i;
            delta[w, j] = logp;
          }
        }
      }
    }
    logp_zstar = max(delta[W]);
    for (k in 1:K) {
      if (delta[W, k] == logp_zstar) {
        zstar[W] = k;
      }
    }
    for (w in 1:(W-1)) {
      zstar[W - w] = back_pointer[W - w + 1, zstar[W - w + 1]];
    }
  }
}

0 Likes

#2

Can you run HMC if you fix the sigma parameters? Is it possible to do this without too much trouble?

This prior:

sigma_unordered[, k] ~ cauchy(100, 10)

seems really wide.

0 Likes

#3

The prior for sigma is quite wide. I set the prior for sigma this way because I do not know ahead of time how many states there are. When fitting 5 states to the simulated data that only has 3 states I get a sigma that looks like:

            [,1]     [,2]       [,3]      [,4]      [,5]
[1,]   0.2462307 100.0000 99.9999820 1.0566073  99.99992
[2,]   0.2715795 100.0002  0.1195531 0.6511171 100.00006
[3,]  99.9999119 100.0000  0.2436370 0.1743294 100.00004
[4,]  99.9999999 100.0001  0.1557341 0.6845725 100.00018

Essentially the model is not using states 2 or 5 to describe the data. Decreasing the location in the prior causes the model to tend towards using more states than it should. While decreasing the scale of the prior does not leave enough probability density near the data.

I did try changing the prior to cauchy(0, 1), but the HMC sampling was not any faster.

Ben

0 Likes

#4

Oh yeah, I’m not sure what to do about the unknown number of hidden states. That seems like a model selection problem which is outta my league. And parameters not informed by the data, especially heavy tailed ones, aren’t something you wanna have in your model. If you end up having to use Cauchies, there’s a reparameterization that might help – check https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html and https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html.

What I was suggesting was more exploratory than anything I guess. What if you had the right number of states and good estimates of sigma, does the model sample fast then?

If so, then the follow up questions are: is there a way to get a rough estimate of the noises of the emission processes? Is there a reasonable assumption where you only have a single noise parameter to estimate?

0 Likes

#5

Let me know if the following is not what you had in mind.

Since I am using simulated data I know there are 3 states and I know the variance of each emission for each state. That is:

      [,1]      [,2]      [,3]
[1,] 0.2486687 1.0673293        NA
[2,] 0.2748552 0.6577700 0.1206325
[3,]        NA 0.1764864 0.2475144
[4,]        NA 0.6929653 0.1571392

There are NAs because not all emissions exist in all states. I replaced the NAs with 100 and used it to initialized the HMC sampling. W/o initialization 200 iterations took 1095 seconds, with initialization sampling took 2947 seconds (longer). I also got a warning message from the initialized model that all the transitions after the warm up exceeded the maximum tree depth which I have already increased to 15.

0 Likes

#6

This means the sampler is having trouble exploring the posterior. Are the Neff values on any parameters particularly low? Like is there one that stands out? Or are they all really bad together?

0 Likes

#7

When I initialize sigma as before all of the n_eff’s are bad:

Without initialization (still bad, but better):

0 Likes

#8

It looks like Neff is lowest for the 3,1 and 4,1 parameters, which are the ones that aren’t being used. So these uninformative parameters are the ones that are probably limiting the performance of HMC.

It’ll be a little awkward, but you can make an array of 14 parameters and then fill up a 4x4 matrix in transformed parameters and then use that. For the two parameters you aren’t using, you can put whatever in there. I’m not sure I said that clearly – does that make sense?

0 Likes

#9

If I’m guessing right you want to initialize all the parameters for the model. To make that less awkward I just used the optimizing fit which looks like:

I checked these values and they looked pretty good. Initializing the sampling with the fit$par gives:

The sampling time was better (1527 seconds); however, there were still 63 transitions that exceeded the max tree depth of 15. The n_effs look better.

0 Likes

#10

Sorry for the delay. I wasn’t looking at my unread messages to find my unread messages, so that’s on me :P.

Nah, I want to remove the parameters that are not being used. I do not want to sample them at all. Unused parameters are really hard on the sampler.

For instance for a 2x2 matrix, say we’re using all but the parameter at the 2,2 index, we could do:

parameters {
  real params[3];
}

transformed parameters {
  matrix[2, 2] mparams;
  
  m[1, 1] = params[1];
  m[1, 2] = params[2];
  m[2, 1] = params[3];
}

model {
  params ~ normal(0, 1);
}

This way we’re only sampling three things instead of four.

0 Likes