Simple HMM does not converge

Hi all, I am learning to use HMM and starting with estimating simulated data. However, even the simplest model, there is divergence. I have checked my code, and approached it in several different ways, either by using forward algorithm myself or using the hmm_marginal() function, neither worked. Could anyone kindly help me take a look?

Here is the code for the simulated data. I simply assume there are 3 hidden states, and individuals can only transit between adjacent state.

np.random.seed(42)

# Parameters
N = 500          # number of customers
T = 20           # time periods
K = 3            # number of hidden states

# Transition matrix
Gamma = np.array([
    [0.75, 0.25, 0.0],
    [0.25, 0.50, 0.25],
    [0.0,  0.75, 0.25]
])

# Emission probabilities for each state
emission_probs = np.array([0.1, 0.5, 0.9])

# Simulate data
states = np.zeros((N, T), dtype=int)
observations = np.zeros((N, T), dtype=int)

# Initial state: uniform random across states
states[:, 0] = np.random.choice(K, size=N)

# Generate first observation
for i in range(N):
    s0 = states[i, 0]
    observations[i, 0] = np.random.binomial(1, emission_probs[s0])

# Simulate transitions and emissions over time
for t in range(1, T):
    for i in range(N):
        prev_s = states[i, t-1]
        curr_s = np.random.choice(K, p=Gamma[prev_s])
        states[i, t] = curr_s
        observations[i, t] = np.random.binomial(1, emission_probs[curr_s])

# Reshape into long format
stan_data = {
    'N': N,
    'T': T,
    'K': K,
    'y': observations.tolist()
}

Here is the stan code, approach 1:

data {
  int N; // Number of observations
  int T; // Number of obs per individuals
  array[N, T] int<lower=0, upper=1> y; //observations
}


parameters {
  // Parameters of measurement model
  ordered[3] mu;
  
  // Initial state
  simplex[3] rho;
  
  // Rows of the transition matrix
  simplex[2] t1;
  simplex[3] t2;
  simplex[2] t3;
}
transformed parameters {
  matrix[3, 3] Gamma = rep_matrix(0, 3, 3);
  array[N] matrix[3,T] log_omega;

  vector<lower=0, upper=1>[3] theta;       // emission probs: P(y_t=1 | z_t=k)
  for (k in 1:3) theta[k] = inv_logit(mu[k]);   // preserves ordering
  
  // Build the transition matrix
  Gamma[1, 1:2] = t1';
  Gamma[2,  : ] = t2';
  Gamma[3, 2:3] = t3';

  for (n in 1:N){
    for (t in 1:T){
      log_omega[n][1,t] = bernoulli_lpmf(y[n,t]|theta[1]);
      log_omega[n][2,t] = bernoulli_lpmf(y[n,t]|theta[2]);
      log_omega[n][3,t] = bernoulli_lpmf(y[n,t]|theta[3]);
    }
  }
  
}
model {
  mu ~ normal(0, 4);
  rho ~ dirichlet([1, 1, 1]);
  
  t1 ~ dirichlet([1, 1]);
  t2 ~ dirichlet([1, 1, 1]);
  t3 ~ dirichlet([1, 1]);

  for (n in 1:N){
   target += hmm_marginal(log_omega[n],Gamma,rho);
  }
}

here is stan code, approach 2:

//simple example

data {
  int N; // Number of observations
  int T; // Number of obs per individuals
  array[N, T] int<lower=0, upper=1> y; //observations
}

transformed data {
  row_vector[3] rho = [1.0/3.0,1.0/3.0,1.0/3.0];
}
parameters {
  // Parameters of measurement model
  ordered[3] mu;
  
  // Initial state
  //simplex[3] rho;
  
  // Rows of the transition matrix
  simplex[2] t1;
  simplex[3] t2;
  simplex[2] t3;
}
transformed parameters {
  matrix[3, 3] Gamma = rep_matrix(0, 3, 3);
  
  vector<lower=0, upper=1>[3] theta;       // emission probs: P(y_t=1 | z_t=k)
  for (k in 1:3) theta[k] = inv_logit(mu[k]);   // preserves ordering
  
  // Build the transition matrix
  Gamma[1, 1:2] = t1';
  Gamma[2,  : ] = t2';
  Gamma[3, 2:3] = t3';
  
}
model {
  mu ~ normal(0, 4);
  //rho ~ dirichlet([1, 1, 1]);
  
  t1 ~ dirichlet([1, 1]);
  t2 ~ dirichlet([1, 1, 1]);
  t3 ~ dirichlet([1, 1]);

  for (n in 1:N){
    vector[3] log_alpha;

    for (k in 1:3) //gamma[1,k] = log(theta[k]);
      log_alpha[k] = log(rho[k]) + bernoulli_lpmf(y[n,1]|theta[k]);
      //when t=1, p(z,y) = p(z)*p(y|z)

      for (t in 2:T) {
        vector[3] log_alpha_new;
        for (k in 1:3) {
          vector[3] temp;
          for (j in 1:3)
            temp[j] = log_alpha[j] + log(Gamma[j,k]);// + bernoulli_lpmf(y[n,t]|theta[k]);//log(theta[k])
          log_alpha_new[k] =  bernoulli_lpmf(y[n,t]|theta[k]) + log_sum_exp(temp);
          }
          log_alpha = log_alpha_new;
  }
  target += log_sum_exp(log_alpha);
  }
}

the trace_plot is like the following:

Hopefully someone with more experience in Hidden Markov Models will respond soon, but I took your detailed post as an opportunity to learn a little more about divergences. It’s particularly helpful that you included the code to simulate data.

I was able to fit your model without divergences by 1.) choosing a more informative prior for \mu, and 2.) setting adapt_delta=0.95. I also used fewer customers (N=100 instead of 500), to aide rapid testing.

Your prior for \mu puts most of the prior probability mass near 0 and 1 after the inv_logit transform. Here’s what I mean…

I couldn’t really explain why the first prior is problematic, but I’d argue that \mu \sim N(0,1) is actually less “informative” after the transform than \mu \sim N(0,4), so it’d be a better default. At the least, the fact that this change reduces or eliminates the divergences should help you investigate the root cause.

PS: I used CmdStanPy to call stan and check for divergences:

fit = model.sample(
   data=stan_data, chains=4, parallel_chains=4, show_console=True, adapt_delta=0.95
)
print(fit.diagnose())

hi thanks for taking a look! I followed your suggestions, and it indeed reduces the divergences. When I checked the traceplot, it still seems not well behaved. For istance, in the Gamma[0,0] below, it seems that the parameters have long tails in the left side, and do not seem to be well mixed. Do you think if this is acceptable/normal? Maybe for the prior, I should use other distributions than the normal distributions?

I am also wondering if there is any particular thing that I need to pay attention to in order to estimate the HMM, such as assuming initial state to be the equilibrium state of the transition matrix etc, this is an identification type of question.

I personally don’t use traceplots much for diagnosing issues, but I don’t see anything particularly concerning, especially if the R-hat values look good.

All the chains seem to have found the same mode, and the differences in the exact densities could just be a result of finite samples and kde settings. Try running longer chains and see if the plots look better. The long tail to the left reads, to me, like a reflection of true uncertainty in the parameter.

In short, looks good to me, for what that’s worth. I would try to test the model using simulated data. Does the model recover the true parameters of the data-generating process? Does uncertainty decrease with more and more data? Etc. This can be made more rigorous: Simulation-Based Calibration

As for the prior, I don’t think there’s anything necessarily wrong with your choice of distribution. You could always put the prior directly on theta and simplify things a little. If you can capture your true prior knowledge with a distribution that behaves better numerically, that’s a win in my book.

I can’t answer your question on the initial state. I found this previous post: HMM : how to specify the initial distribution - #2 by martinmodrak and it makes sense to me that the influence of the initial state should decay over time, so you can test your hypothesis by increasing T.