Markov-Switching Model with time varying transition probabilities in Stan

I’ve been working on the model below and talked to Ben recently as a couple of problems arose. Andrew also suggested simplifying the model to isolate the problem but I also didn’t find the problem that way (see below for some solutions I tried previously). I figured that it may not hurt to ask here as well.

The model is a Markov Switching Model with Time Varying Transition Probabilities, i.e. I assume that the parameters of the time series vary according to a random variable that follows a Markov process of order one whose transition probabilities are a function of an exogenous covariate (z) and an intercept (gamma).

Without TVTPs, the model can identify gamma. When the intercept and the exogenous covariate are included, the model can identity neither gamma nor lambda. Alpha and phi are on average within the 50% HDI but are still biased for N(simulations) = 50. Sometimes gamma[1]/[2] and lambda[1]/[2] have the same standard deviation or same absolute mean value even if the true values diverge widely. Here is just one model output when running the model with some simulated data for N = 1 and T = 300. The true values are alpha (2, 6), phi (0.2, 0.7), gamma (-2, 2), lambda (1, 0.2)

N sim name mean sd q25 q50 q75
1 1 1 alpha[1] 1.023479004 0.52433014 0.68470262 1.033403681 1.37288909
2 1 1 alpha[2] 5.836420795 0.24794741 5.67351637 5.834325272 5.99803102
3 1 1 phi[1] 0.225243761 0.03070192 0.20518439 0.224699188 0.24562940
4 1 1 phi[2] 0.715086473 0.01346963 0.70643604 0.715159144 0.72381077
5 1 1 gamma[1] -0.664868689 0.31815327 -0.87492839 -0.646834929 -0.45570295
6 1 1 gamma[2] 1.670333349 0.12871636 1.58098185 1.668498695 1.75781119
7 1 1 lambda[1] -0.010225889 1.04397066 -0.68672621 0.003543484 0.66562922
8 1 1 lambda[2] -0.002082519 0.10865527 -0.07782131 -0.000808923 0.07186009

My problem is that I don’t understand why this happens and that it is unclear how the model can provide reasonable inference for alpha and phi in the linear part of the model if it does not know in which state it currently is.

This problem does not disappear when only including lambda for one state, removing the cross-sectional part or varying the size of the parameters or number of time periods. I ran some other versions that removed phi and/or alpha to no avail. I am also currently running a simulation varying N between 1 and 100 but as the model is again very slow I don’t know yet whether that would improve the situation. I can provide the R code for the simulated data if necessary.

Thanks in advance for any ideas regarding this (or how to speed the model up).

Model

data {
  int<lower = 0> T; // number of observed time periods - 1
  int<lower = 0> N; // number of observed units
  matrix[N, T] y_head; // head(y, T - 1)
  matrix[N, T] y_tail; // tail(y, T - 1)
  matrix[N, T] z; // exogenous random variable affecting transition probabilities
 //  real<lower = 0> sigma; // variance; fixed for now
}
parameters {
  ordered[2] alpha;
  vector<lower= 0,upper=1>[2] phi; //AR parameter
  real gamma[2]; // intercepts in TPs
  real lambda[2];
  real<lower = 0> sigma; // variance term
}
transformed parameters {
  real p11[N,T];
  real p12[N,T];
  real p21[N,T];
  real p22[N,T];
  real p_sprev_1_givenprev[N, T]; // P(S[t - 1] = 1 | omega[t - 1], theta)
  real p_scur_1_givenprev[N, T];  // P(S[t] = 1 | omega[t - 1], theta)
  real p_scur_1_givencur[N, T];   // P(S[t] = 1 | omega[t], theta)
  vector[2] s[N, T];
  real fy[N, T];
  for (i in 1:N) {// looping over all observations
    // 1st inner loop for initial values
    for (t in 1:T) {
      p11[i,t] = normal_cdf(gamma[1] + lambda[1] * z[i, t], 0, 1);
      p12[i,t] = 1 - p11[i,t];
      p22[i,t] = normal_cdf(gamma[2] + lambda[2] * z[i, t], 0, 1);
      p21[i,t] = 1 - p22[i,t];
    }
    // t = 1
    // Initial transition probabilities (Piger eq. 14)
    // Can also treat initial transition probabilities as a parameter to be estimated
    p_sprev_1_givenprev[i,1] = (1 - p22[i, 1]) / (2 - p11[i, 1] - p22[i,1]);
    p_scur_1_givenprev[i, 1] =  p11[i, 1] *      p_sprev_1_givenprev[i, 1] + 
                                p21[i, 1] * (1 - p_sprev_1_givenprev[i, 1]);
    for (j in 1:2){
      s[i, 1, j] = normal_lpdf(y_tail[i,1] | alpha[j] + phi[j] * y_head[i,1], sigma);
    }
    // Piger eq. 11
    fy[i,1] = log_mix(p_scur_1_givenprev[i,1], s[i,1, 1], s[i,1, 2]);
    // Piger eq. 13
    p_scur_1_givencur[i,1] = exp(s[i,1, 1] + log(p_scur_1_givenprev[i,1]) - fy[i,1]);
    // second inner loop for subsequent values
    for (t in 2:T){
      p_sprev_1_givenprev[i,t] = p_scur_1_givencur[i,(t-1)];
      // Piger eq. 10
      p_scur_1_givenprev[i,t] = p11[i, t] * p_sprev_1_givenprev[i ,t] + 
                       p21[i, t] * (1 - p_sprev_1_givenprev[i,t]);
      for (j in 1:2){
        s[i,t,j] = normal_lpdf(y_tail[i, t] | alpha[j] + phi[j] * y_head[i,t], sigma);
      }
    // Piger eq. 11                      
      fy[i,t] = log_mix(p_scur_1_givenprev[i,t], s[i,t, 1], s[i,t, 2]);
      // Piger eq. 13
       p_scur_1_givencur[i, t] = exp(s[i,t, 1] + log(p_scur_1_givenprev[i,t]) - fy[i,t]);
    }
  }
}
model {
  // likelihood
  for (i in 1:N) {   
    for (t in 1:T) {
      target += fy[i,t];
    }
  }
  // priors
  gamma ~ normal(0, 1);
  phi ~ uniform(0, 1);
  alpha ~ normal(0,1);
  lambda ~ normal(0, 1);
}

Honestly, this model is (for me) very hard to understand as it is quite complex. Simplifying might be necessary to get an answer here. I would suggest to check you can recover z and gamma when you model only the TVTPs (i.e. you have direct noisy observations of the parameters of the time series). This would make sure that the code for the Markov process is OK.

There are some things that have caught my attention which might (and might not) be problematic:

  1. real p_sprev_1_givenprev[N, T]; if those are probabilities, shouldn’t they have <lower = 0, upper = 1> bounds?

  2. Label switching: you use ordered[2] alpha; to enforce ordering on the two states. But are you sure this is enough? I.e. can you be sure that the solution does not involve alpha[1] roughly equal to alpha[2] with just phi differing between the two states? If so, this would induce multimodality in your posterior.

Hope that helps!

Two minor formal nitpicks:

“identify” is usually used on this forums in a technical sense (broadly that data cannot in principle support inferences about a parameter). If you mean (as I presume) that the model posterior does not include the true values, we usually use the term “recover” as in “the model cannot recover gamma from simulated data”. If you meant that gamma is not identifiable in the technical sense, could you be more specific on why you think this is the case?

This almost always makes answering easier, please provide the R code as a default :-).

3 Likes

First thanks a lot for these thoughts.

The problem with the model is that the algorithm from Hamilton (1989) which this is based on operates recursively. In the Stan code, it first estimates the probability that we are in either state at t before predicting y based on a mixture of the two states before then predicting the state in t+1 conditional on all the information up until t. There is not much more happening in the model itself even if it “looks” complex.

Yes, the parameters are bounded between 0 and 1. Implementing that though doesn’t remove the problem. On label switching, I will have to see whether that’s an issue. In the simulation, I set alpha[1] to 2 and alpha[2] to 6 assuming that they are thus sufficiently different but indeed maybe it needs more structure.

I meant recover. I am sorry for the confusion. The R code for the simulation is below.


## libraries and options
library(rstan)
#options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

################################################################# 

## working directory

## seed
set.seed(12223345)


## setup
n_sim <- 1
T <- 300  # number of time periods
min_N <- 1
max_N <- 1
step_N <- 1

n_chains <- 2 # number of chains
n_iter <- 2000 # number of iterations (thinning half)

## parameters
alpha <- c(2, 6)
gamma <- c(0.5, 1) # intercept in TPs
phi <- c(0.2, 0.7) # AR coefficients
lambda <- c(1, 0.2)
sigma <- 1


## container
out_par <- matrix(ncol = 8, nrow = 0)
colnames(out_par) <- c("N", "sim", "name", "mean", "sd", "q25", "q50", "q75")

for (N in seq(min_N, max_N, step_N)){
  ## simulation
  for (i_sim in 1:n_sim){
    
    ## covariates
    z <- matrix(rbinom(N * T, 1, 0.3), ncol = N, nrow = T) #exogenous
    
    # transition probabilities
    p <- array(rep(NA, 4 * T * N), dim = c(2, 2, T, N)) # p is a 2 * 2 * T * N array where the 2 * 2 part represent the transition probabilities 
    # for each time period t for each individual i
    
    # state matrix
    s <- matrix(NA, nrow = T, ncol = N) # a T by N matrix indicating the state the individual i is in at point t in cell [t, i]
    
    # transition probabilities
    p <- array(NA, dim = c(2, 2, T, N))
    
    for (i in 1:N){
      for (t in 1:T){
        s_star <- gamma + lambda * z[t, i] 
        p_s <- matrix(NA, ncol = 2, nrow = 2)
        diag(p_s) <- pnorm(s_star) # variance of 1
        p_s[1, 2] <- 1 - p_s[1, 1]
        p_s[2, 1] <- 1 - p_s[2, 2]
        # p_11 + p_12 = 1 and p_21 + p_22 = 1
        p[ , , t, i] <- p_s                   
      }
      # states
      s[1, i] <- rbinom(
        1, 
        1, 
        (1 - p[2, 2, 1, i]) / 
          (2 - p[2, 2, 1, i] - p[1, 1, 1, i])
      ) + 1 # outcome of rbinom is 1 if we are in state 2 
      # so we add 1 so that the matrix indicates 
      # the state correctly with 2
      for (t in 2:T){
        if (s[t - 1, i] == 1){
          s[t, i] <- rbinom(
            n = 1,
            size = 1,
            prob = p[1 ,2 , t - 1, i] 
          ) + 1
        } else if (s[t - 1, i] == 2){
          s[t, i] <- rbinom(
            n = 1, 
            size = 1, 
            prob = p[2, 2, t - 1, i]
          ) + 1
        }
      }
    }
    
    # outcome
    y <- matrix(NA, nrow = T, ncol = N)
    for (i in 1:N){
      y[1, i] <- rnorm(1, alpha[s[1, i]], sigma)
      for (t in 2:T){
        y[t, i] <- rnorm(1, alpha[s[t, i]] + phi[s[t, i]] * y[t - 1, i], sigma)
      }
    }
    
    # df
    df <- matrix(ncol = 4, nrow = 0)
    colnames(df) <- c("t", "n", "y", "z")
    for (n in 1:N){
      df <- rbind(df, cbind(seq(1, T), n, y[,n], z[,n]))
    }
    df <- as.data.frame(df)
    colnames(df) <- c("t", "n", "y", "z")
    
    # preparation
    times <- max(df$t)
    units <- max(df$n)
    
    y_head <- y_tail <- z_tail <- matrix(nrow = 0, ncol = times - 1)
    for (i in 1:units){
      y_head <- rbind(y_head, head(df[df$n == i, c("y")], times - 1))
      y_tail <- rbind(y_tail, tail(df[df$n == i, c("y")], times - 1))
      z_tail <- rbind(z_tail, tail(df[df$n == i, c("z")], times - 1))
    }
    ## set up data for stan
    data_list <- list(T = times - 1,
                      N = units,
                      y_head = y_head,
                      y_tail = y_tail,
                      z = z_tail)
    ## model
    fit.1 <- stan("18_msm_coefficients_sim_v1.stan", 
                  data = data_list, 
                  chains = n_chains, 
                  iter = n_iter, 
                  init_r = 0.5, 
                  control = list(adapt_delta = .99))
    ## output
    fit.ext <- as.matrix(fit.1)
    for (ii in 1:8){
      i_mean <- mean(fit.ext[, ii])
      i_sd <- sd(fit.ext[, ii])
      i_quan <- quantile(fit.ext[, ii], probs = c(0.25, 0.50, 0.75), names = F) 
      out_par <- rbind(out_par, c(N, i_sim, colnames(fit.ext)[ii], i_mean, i_sd, i_quan ))
    }
  }
}
out_par <- as.data.frame(out_par)

i_seq <- c(1, 2, 4, 5, 6, 7, 8)
for (i in i_seq){
  out_par[, i] <- as.numeric(as.character(out_par[, i]))
}
print(out_par)

Hamilton, J.D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle, Econometrica, 57, 357–384.

So I was able to run the code and the model and there seems to be a lot of problems - the effective sample size (n_eff) is 1 for many parameters (should be at least 50 but larger is better), and the Rhat is waaaaay up (should be very close to 1, over 1.1 is definitely bad). Since your code does not currently display those, are you sure all the diagnostics were OK for the simplified models you’ve tried?

I would also still suggest you try to separate the Markov part from the regression part and make sure it works separately first.

I had looked at the diagnostics before for the full model and didn’t see anything suspicious and just ran it again and R_hat is given as 1 and the effective sample size varies between 1300 and 2000 (for a total of 2000 draws after warm up) for all parameters, transformed or otherwise.

I am looking into this currently. I removed the parameters of the linear model and am feeding the model the correct intercepts so that it can distinguish between the two states through the transformed data block.

transformed data {
  vector[2] alpha;
  alpha[1] = 2;
  alpha[2] = 6;
}

Again thank you for taking the time. It’s really helpful to someone else take a look at this.

edit:

Something I continue to find bizarre is that despite not being able to estimate the TVTP parameters, the model seems to have no problem to recover the correct sequence of states.