# HMM with GP

Hello Everyone,

I am trying to implement a model that combines a Gaussian Process with HMM. Specifically, it assumes that the observations are generated through a process as follows:

Obs ~ GP + \alpha_s

where the \alpha_s is a state dependent intercept is emitted by a Gaussian HMM process and GP is a Gaussian process with SE kernel.
I am trying to implement it in Stan. While checking the model with simulated data, the estimated HMM means are equally shifted by a random amount. For example, in the following sample output each estimated mean is shifted approximately by an amount -0.74:

Warning messages:

1: There were 1000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See

http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

True Transition Matrix

      [,1]       [,2]      [,3]


[1,] 0.64 0.05 0.30

[2,] 0.10 0.06 0.83

[3,] 0.14 0.20 0.64

Estimated Transition Matrix

      [,1]           [,2]                 [,3]


[1,] 0.67 0.06 0.26

[2,] 0.11 0.22 0.65

[3,] 0.24 0.15 0.59

True State Means

[1] 1.30 1.94 2.05

Estimated State Means

[1] 0.56 1.16 1.28

The amount of shift randomly differs every time I run the code with different test conditions. The mixing in the trace-plots seem to be very poor. It seems that there is an identification issue and I am not sure how to resolve that.

I have imposed ordering constraint on the HMM means. I have also tried Non-centered parameterizations for both GP and HMM parameters. Even set the initial value to zero for the latent function in GP. But, nothing seems to work.

data {
int<lower=1> N;
real x[N];
vector[N] y;
int<lower=1> S;
}

transformed data {
real delta = 1e-9;
}

parameters {
vector[N] eta;

real rho_tilde;
real<lower=0> rho_m;
real<lower=0> rho_s;

real alpha_tilde;
real<lower=0> alpha_m;
real<lower=0> alpha_s;

real sigma_tilde;
real<lower=0> sigma_m;
real<lower=0> sigma_s;

vector[S] muH_tilde;
vector<lower=0>[S] muH_s;
vector<lower=0>[S] muH_m;

simplex[S] A[S];
simplex[S] pi1;

real<lower=0> sigmaH[S];
real alphaH[N];
vector[N-1] free_pnum;
}

transformed parameters {
vector[N] f;
vector[1] z;
matrix[N, N] L_K;
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[S] muH;
vector[S] muTemp;

rho = exp(log(rho_m) + rho_s * rho_tilde);
alpha = exp(log(alpha_m) + alpha_s * alpha_tilde);
sigma = exp(log(sigma_m) + sigma_s * sigma_tilde);

for (i in 1:S)
muTemp[i] = exp(log(muH_m[i]) + muH_s[i] * muH_tilde[i]);

muH = sort_asc(muTemp);

z[1] = 0;
f = append_row(z,free_pnum);
{
matrix[N, N] K = cov_exp_quad(x, alpha, rho);
for (n in 1:N)
K[n, n] = K[n, n] + delta;

L_K = cholesky_decompose(K);
f = L_K * eta;
}
}

model {
vector[N] theta;
vector[S] logalpha[N];

target += inv_gamma_lpdf(rho_m    | 2, 0.5);
target += normal_lpdf(alpha_m     | 0, 2.0);
target += normal_lpdf(sigma_m     | 0, 1.0);
target += normal_lpdf(muH_m       | 0, 1.0);

target += normal_lpdf(rho_s       | 0, 0.5);
target += normal_lpdf(alpha_s     | 0, 0.5);
target += normal_lpdf(sigma_s     | 0, 0.5);
target += normal_lpdf(muH_s       | 0, 0.5);

target += normal_lpdf(rho_tilde   | 0, 1.0);
target += normal_lpdf(alpha_tilde | 0, 1.0);
target += normal_lpdf(sigma_tilde | 0, 1.0);
target += normal_lpdf(muH_tilde   | 0, 1.0);

target += normal_lpdf(eta         | 0, 1.0);
target += gamma_lpdf(sigmaH       | 2, 2.0);

// Forward algorithm
{
real accumulator[S];

logalpha[1] = log(pi1) + normal_lpdf(alphaH[1] | muH, sigmaH);

for (t in 2:N) {
for (j in 1:S) {
for (i in 1:S) {
accumulator[i] = logalpha[t-1, i] + log(A[i, j]) + normal_lpdf(alphaH[t] | muH[j], sigmaH[j]);
}
logalpha[t, j] = log_sum_exp(accumulator);
}
}
}

target += log_sum_exp(logalpha[N]);

for (n in 1:N)
theta[n] = f[n] + alphaH[n];

target += normal_lpdf(y | theta, sigma);

}



Data generation Code in R

library(rstan)
library(ggplot2)
library(shinystan)

set.seed(123456789)

logit <- function(x){1/(1+exp(-x))}

runif_simplex <- function(T) {
x <- -log(runif(T))
x / sum(x)
}

hmm_generate <- function(K, T) {
# 1. Parameters
pi1   <- runif_simplex(K)
A     <- t(replicate(K, runif_simplex(K)))
mu    <- sort(rnorm(K, 1.0 * 1:K, 1))
sigma <- 0.01

# 2. Hidden path
z <- vector("numeric", T)

z[1] <- sample(1:K, size = 1, prob = pi1)
for (t in 2:T)
z[t] <- sample(1:K, size = 1, prob = A[z[t - 1], ])

# 3. Observations
y <- vector("numeric", T)
for (t in 1:T)
y[t] <- rnorm(1, mu[z[t]], sigma)

list(y = y, z = z,
theta = list(pi1 = pi1, A = A, mu = mu, sigma = sigma))
}

K <- 3
N <- 100
dataset  <- hmm_generate(K, N)
dataset
f <- function(x) x^2 + x
x <- rnorm(N, mean = 0, sd = 0.5)
yhmm <-  rnorm(length(x), mean=f(x)+ dataset$y, sd=0.1) plot(x,yhmm) stan_data = list(N=N, x=x, y=yhmm, S=K) fit_hmm <- stan(file = "HMMGp.stan", data=stan_data, verbose = F, chains = 1, control = list(adapt_delta = 0.99)) muH <- extract(fit_hmm, pars = 'muH')[[1]] A <- extract(fit_hmm, pars = 'A')[[1]] sigmaH <- extract(fit_hmm, pars = 'sigmaH')[[1]] hmat = c() for(i in 1:3){ for(j in 1:3)hmat=c(hmat,mean(A[][,i,][,j])) } A = t(matrix(hmat, nrow=3,ncol=3)) dataset$theta$A A mH = c() for(i in 1:3){mH=c(mH,mean(muH[,i]))} mH dataset$theta$mu sH = c() for(i in 1:3){sH=c(sH,mean(sigmaH[,i]))} sH dataset$theta$sigma  Is it even possible to estimate this model? Please have a look at the code and kindly guide me. Thank you so much. 1 Like Are there correlations between corresponding elements of alphaH and muH? If that leads nowhere, have you tried a simpler version of this model with the various standard deviation parameters fixed? Hi Ben, Thank you for responding! I did check the correlations between elements of alphaH and muH, but they are hardly correlated. Yes, I did run the simpler versions with fixed SD first. I am attaching the stan code for the simplest version (fixed SD and just the ordering constraint on muH) here data { int<lower=1> N; real x[N]; vector[N] y; int<lower=1> S; } transformed data { real delta = 1e-9; } parameters { real<lower=0> rho; real<lower=0> alpha; real<lower=0> sigma; vector[N] eta; simplex[S] A[S]; simplex[S] pi1; ordered[S] muH; real<lower=0> sigmaH[S]; real alphaH[N]; } transformed parameters { vector[N] f; matrix[N, N] L_K; { matrix[N, N] K = cov_exp_quad(x, alpha, rho); for (n in 1:N) K[n, n] = K[n, n] + delta; L_K = cholesky_decompose(K); f = L_K * eta; } } model { vector[N] theta; vector[S] logalpha[N]; target += inv_gamma_lpdf(rho | 2, 0.5); target += normal_lpdf(alpha | 0, 2); target += normal_lpdf(sigma | 0, 1); target += normal_lpdf(eta | 0, 1); { // Forward algorithm real accumulator[S]; logalpha[1] = log(pi1) + normal_lpdf(alphaH[1] | muH, sigmaH); for (t in 2:N) { for (j in 1:S) { for (i in 1:S) { accumulator[i] = logalpha[t-1, i] + log(A[i, j]) + normal_lpdf(alphaH[t] | muH[j], sigmaH[j]); } logalpha[t, j] = log_sum_exp(accumulator); } } } // Forward target += log_sum_exp(logalpha[N]); for (n in 1:N) theta[n] = f[n] + alphaH[n]; target += normal_lpdf(y | theta, sigma); }  But the problems were there also â€¦ Hi Ben, Once again I checked the correlations between muH and alphaH, this time for the most basic model (with just the ordering constraint on muH and fixed SD). In this case you are absolutely right â€¦ they are highly correlated !!! This is what the correlations look like â€¦ [1] 0.9933211 [1] 0.9942825 [1] 0.9894899 [1] 0.9941154 [1] 0.9900316 [1] 0.9933434 [1] 0.9901557 [1] 0.9712893 [1] 0.9895402 [1] 0.9900569 [1] 0.9890224 â€¦ I am attaching the replication code here (the corresponding Stan code is already shared in my previous reply) library(rstan) library(ggplot2) library(shinystan) set.seed(900) logit <- function(x){1/(1+exp(-x))} runif_simplex <- function(T) { x <- -log(runif(T)) x / sum(x) } hmm_generate <- function(K, T) { # 1. Parameters pi1 <- runif_simplex(K) A <- t(replicate(K, runif_simplex(K))) mu <- sort(rnorm(K, 1.0 * 1:K, 1)) sigma <- 0.01 # 2. Hidden path z <- vector("numeric", T) z[1] <- sample(1:K, size = 1, prob = pi1) for (t in 2:T) z[t] <- sample(1:K, size = 1, prob = A[z[t - 1], ]) # 3. Observations y <- vector("numeric", T) for (t in 1:T) y[t] <- rnorm(1, mu[z[t]], sigma) list(y = y, z = z, theta = list(pi1 = pi1, A = A, mu = mu, sigma = sigma)) } K <- 3 N <- 100 dataset <- hmm_generate(K, N) f <- function(x) x^2 + x x <- rnorm(N, mean = 0, sd = 0.5) yhmm <- rnorm(length(x), mean=f(x)+ dataset$y, sd=0.1)
plot(x,yhmm)

stan_data = list(N=N, x=x, y=yhmm, S=K)
fit_hmm <- stan(file = "stan/HMMGp3.stan", data=stan_data, verbose = T, chains = 1)
muH   <- extract(fit_hmm, pars = 'muH')[[1]]
A   <- extract(fit_hmm, pars = 'A')[[1]]

hmat = c()
for(i in 1:3){for(j in 1:3)hmat=c(hmat,mean(A[][,i,][,j]))}
A = t(matrix(hmat, nrow=3,ncol=3))
dataset$theta$A
A
mH = c()
for(i in 1:3){mH=c(mH,mean(muH[,i]))}
mH
dataset$theta$mu

alphaH   <- extract(fit_hmm, pars = 'alphaH')[[1]]
aH = c()
for(i in 1:N){aH=c(aH,mean(alphaH[,i]))}
aH

for(j in 1:K){
for(i in 1:N){
print(cor(muH[,j],alphaH[,i]))
}
}


This is what the muH[1] trace looks like

The autocorrelation plot:

So what should be the next step ?

This probably means that, while youâ€™re estimating alphaH and muH, you donâ€™t have the data to estimate them both.

Back to the model, youâ€™re saying:

\mu_1 \sim GP(t)\\ \mu_2 \sim HMM(t)\\ y \sim \text{normal}(\mu_1 + \mu_2, \sigma)

Something like this? And you wanna estimate \mu_1, \mu_2, and all the parameters of the GP and HMM?

Thatâ€™s correct â€¦ I want to estimate the parameters of GP and HMM â€¦ Is there a way to do that ?

The variables \mu_1 and \mu_2 are probly gonna be super unidentified. Either the GP or HMM can explain the output, and both of those models are pretty flexible.

Without heavily constraining the problem, youâ€™re probably gonna have lots of trouble.

Did this start as an HMM and add the GP? Or did this start as a GP and add an HMM? Either way, what was the motivation for the second model? Maybe thereâ€™s a different assumption to be made there.

I understand â€¦ It started as a GP and HMM came later. The idea is as follows: suppose we are looking at a flexible pattern, e.g. we read book for a certain random period of time every day and the duration is generally higher on the weekends. This is a regular pattern, but sometimes our reading time significantly goes up for a few consecutive days (e.g. when we are approaching the climax of the current book and just canâ€™t leave it whether its weekend or not). After this heightened period of action we again go back to the original pattern till we hit the next climax. So the idea was to model the regular pattern with a GP but add a HMM component (say normal vs climax state) to capture the periods of heightened activity.

Oooh, okay. Do you have this much structure in your actual problem? If so, then maybe just encoding things with covariates is the way to go.

Something like:

\mu_1 \sim GP \\ y \sim \text{normal}(\mu_1 + a x)

where x is a 0 and 1 indicator (0 on weekdays, 1 on weekends) and then a represents the weekend effect

Thisâ€™ll be a much simpler model to interpret/fit.

Facebook prophet does this sorta thing if youâ€™re actually dealing with data that has weekly, seasonal, and holiday trends or whatnot: Prophet | Forecasting at scale.

Thatâ€™s right â€¦ the temporal periodicities can be accounted for by the GP â€¦ but the focus is on these heightened activity periods that occur randomly