MSM-TVTP returns different parameter pairs across state-specific equations under repeated sampling

After solving the previous problem, the MSM-TVTP model below is now able to ‘recover’ parameters with some bias. With some help during Andrew’s method’s playroom, I believe that this bias relates to a problem during the sampling stage. When using the Stan code with data from the same DGP with fixed T (time periods, 300) and S (states, 2) and varying N (separate time series; 1,3,5,7,9,11), it shows that the model generally estimates parameters across the two states that sum to the sum of the true parameters but in three distinct ‘clumps’ of combinations (see pngs). For example, if alpha1 = 2 and alpha = 6, it would frequently estimate alpha1 as 3 and alpha2 as 4.5. The same is true for the other parameters but not for gamma (the intercept in the equation determining the transition probabilities).

Now this also happens when the model is run repeatedly on the same data which led me to the conclusion above that it is not a problem of the DGP.

Any advice on things to try would be greatly appreciated!

The Stan code and the R with the DGP and simulation code are below.

// MSM-TVTP, AR(1), 2 REGIMES, KNOWN VARIANCE ACROSS BOTH REGIMES

data {
  int<lower = 0> T; // number of observed time periods
  int<lower = 0> N; // number of observed units
  matrix[N, T] y; // outcome
  int<lower = 0> M;
  row_vector[M] z[N, T]; // exogenous random variable affecting transition probabilities
}
transformed data{
  int S;
  S = 2;
}
parameters {
  ordered[2] alpha;
  real gamma[2];
  vector[M] lambda[2];
  real<lower = 0, upper = 1> phi[2];
  real<lower = 0> sigma;
}
transformed parameters {
  real fy[N, T];
  for (i in 1:N) {// looping over all observations
    vector[2] s[T];
    real p[T, S, S];
    real p_s1[T, 2];
    // [,1] P(S[t - 1] = 1 | omega[t - 1], theta) / P(S[t] = 1 | omega[t], theta)[t - 1]
    // [,2] P(S[t] = 1 | omega[t - 1], theta)
    for (t in 1:T) {
      p[t, 1, 1] = normal_cdf(gamma[1] +  z[i, t] * lambda[1], 0, 1);
      p[t, 2, 2] = normal_cdf(gamma[2] +  z[i, t] * lambda[2], 0, 1);
      p[t, 1, 2] = 1 - p[t, 1, 1];
      p[t, 2, 1] = 1 - p[t, 2, 2];
    }
    // t = 1
    p_s1[1, 2] = (1 - p[1, 2, 2]) / (2 - p[1, 1, 1] - p[1, 2, 2]); // initial
    p_s1[1, 1] =  p[1, 1, 1] *      p_s1[1, 2] +
                     p[1, 2, 1] * (1 - p_s1[1, 2]);
    for (j in 1:2){
      s[1, j] = normal_lpdf(y[i, 1] | alpha[j], sigma);
    }
    fy[i,1] = log_mix(p_s1[1, 1], s[1, 1], s[1, 2]);
    p_s1[1, 2] = exp(s[1, 1] + log(p_s1[1,1]) - fy[i,1]);
    // t = 2:T
    for (t in 2:T){
      p_s1[t, 1] = p[(t - 1), 1, 1] *      p_s1[(t - 1), 2] +
                      p[(t - 1), 2, 1] * (1 - p_s1[(t - 1), 2]); // predict
      for (j in 1:2){
        s[t,j] = normal_lpdf(y[i, t] | alpha[j] + phi[j] * y[i, (t - 1)], sigma);
      }
      fy[i,t] = log_mix(p_s1[t, 1], s[t, 1], s[t, 2]); // signal
      p_s1[t, 2] = exp(s[t, 1] + log(p_s1[t, 1]) - fy[i,t]); // update
    }
  }
}
model {
  // likelihood
  target += fy;
  // priors
  alpha ~ normal(0, 1);
}

R code

#####################################
### MC simulation                 ###
### for MSM-TVTP                  ###
#####################################

## setup
{
## libraries and options
library(rstan)
#options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

#################################################################

## seed
set.seed(12345)


## Global parameters
n_chains <- 2L # number of chains
n_iter <- 2000L # number of iterations (thinning half)
}

## simulation
dgp <- function(N, T, M){

  ## covariates
  z <- array(rnorm(N * M * T, 0, 1), dim = c(N, T, M))

  # state matrix
  s <- matrix(NA, nrow = N, ncol = T) # a T by N matrix indicating the state the individual i is in at point t in cell [t, i]

  # transition probabilities
  p <- array(NA, dim = c(2, 2, T, N))

  for (i in 1:N){
    for (t in 1:T){
      s_star <- gamma + t(z[i, t, ]) %*% lambda  # [t] because the TP are lagged when determining y
      # s_star <- gamma[1] + t(z[i, t, ]) %*% lambda[, 1]
      p_s <- matrix(NA, ncol = 2, nrow = 2)
      diag(p_s) <- pnorm(s_star) # variance of 1
      p_s[1, 2] <- 1 - p_s[1, 1]
      p_s[2, 1] <- 1 - p_s[2, 2]
      # p_11 + p_12 = 1 and p_21 + p_22 = 1
      p[ , , t, i] <- p_s                    # takes into consideration that lambda should have the same effect
      # in each state i.e. increasing the probability of staying
    }
    # states
    s[i, 1] <- rbinom(1, 1, (1 - p[2, 2, 1, i]) / (2 - p[2, 2, 1, i] - p[1, 1, 1, i])) + 1
    for (t in 2:T){
      if        (s[i, t - 1] == 1){
        s[i, t] <- rbinom(n = 1, size = 1,prob = p[1, 2, t - 1, i]) + 1
      } else if (s[i, t - 1] == 2){
        s[i, t] <- rbinom(n = 1, size = 1,prob = p[2, 2, t - 1, i]) + 1
      }
    }
  }

  # outcome
  y <- matrix(NA, nrow = N, ncol = T)
  for (i in 1:N){
    y[i, 1] <- rnorm(1, alpha[s[i, 1]], sigma)
    for (t in 2:T){
      y[i, t] <- rnorm(1, alpha[s[i, t]] + phi[s[i, t]] * y[i, t - 1], sigma)
    }
  }

  ## set up data for stan
  data_list <- list(T = T,
                    N = N,
                    M = M,
                    y = y,
                    z = z)
  ## out
  out <- list(data = data_list)
  return(out)
}

## parameters
alpha <- c(2, 6)
gamma <- c(0.5, 0.5) # intercept in TPs
lambda <- matrix(c(-0.5, 0.3, 0.9, -0.2), byrow = T, ncol = 2, nrow = M)
phi <- c(0.2, 0.8)
n_parameters <- length(alpha) + length(gamma) + length(lambda) + length(phi)
sigma <- 1


## Start time
orig_time <- Sys.time()
file <- file("progress.txt")


## simulations over
T.seq <- c(300)
N.seq <- c(2)
N.sim <- seq(1, 50)
total.sim <- length(T.seq) * length(N.seq) * length(N.sim)


## container
out_par <- matrix(ncol = 9, nrow = 0)
colnames(out_par) <- c("N", "T", "sim", "name", "mean", "sd", "q025", "q50", "q975")
data.fit.list <- list()


## Simulation
sim_counter <- 0
for (t1 in T.seq){
  for (n1 in N.seq){
    out <- dgp(n1, t1, 2)
    for (i_sim in N.sim){
      sim_counter <- sim_counter + 1
      fit <- stan("02_msm_simulation_N.stan",
                data = out[["data"]],
                chains = n_chains,
                iter = n_iter,
                init_r = 0.5,
                control = list(adapt_delta = .9))
      ## output
      fit.ext <- as.matrix(fit)
      for (ii in 1:n_parameters){
        i_mean <- mean(fit.ext[, ii])
        i_sd <- sd(fit.ext[, ii])
        i_quan <- quantile(fit.ext[, ii], probs = c(0.025, 0.50, 0.975), names = F)
        out_par <- rbind(out_par, c(n1, t1, sim_counter, colnames(fit.ext)[ii], i_mean, i_sd, i_quan ))
      }
      data.fit.list[[sim_counter]] <- list(fit.mat = fit.ext[, 1:n_parameters],
                                           data    = out[["data"]])
    }
  }
}

## clean output
out_par <- as.data.frame(out_par)

i_seq <- c(1, 2, 3, 5, 6, 7, 8, 9)
for (i in i_seq){
  out_par[, i] <- as.numeric(as.character(out_par[, i]))
}

## save output
save(out_par, file = "02_msm_simulation_NTSim_output.RData")
save(data.fit.list, file = "02_msm_simulation_NTSim_data_fit.RData")

02_msm_simulation_N.R (4.4 KB)!
00_msm_N1_11_T300_NSIM40_alpha|582x499


02_msm_simulation_N.stan (2.6 KB)

Hi,
so no one else is picking this so I will try to answer, despite (still) not undestanding your model (sorry, can’t wrap my head around the math), but there are a few things that are fishy and might be related to your problems.

First, the values of alpha you are trying to recover are not very consistent with your prior alpha ~ normal(0, 1);. It is usually a good idea to simulate data using your prior distributions instead of fixed values as this lets you discover when your priors are problematic (lead to problematic simulated data).

AFAIK, you can use sort(rnorm(...) to simulate ordered data and it seems to match what Stan does with the ordered data type.

Similarly, you are not providing (unless I am missing something) a prior for the other parameters.

Further, it looks like phi[1] + phi[2] should sum to 1? If that is so, you probably want to have simplex[2] phi;

Hope that helps at least a bit.

Thank you for taking the time. I figure that this question is quite specific and involved.
I gave alpha more generous priors in other attempts, say alpha ~ normal(4, 2) for alpha[1] = 2 and alpha[2] = 6. That did not solve the problem.

The point on simulating data with the prior distributions that are implemented in the model is well taken. I approached this from a standpoint informed by MC simulations in a Frequentist framework which is inapplicable here.

For the other parameters, I had other versions with more generous priors to the extent of the example above or more generous, i.e. straight on the value of the simulated data. That only speeds up the sampling slightly.

phi[1] and phi[2] should/could be in a simplex but I assumed that having one be 1 - the other should suffice in this basic version without being the issue that prevents this model from consistently recovering the parameters.

Again thank you. Much appreciated!

I think @martinmodrak is right that phi should be declared as a simplex. As far as I can tell nothing in the model code ensures that phi[2] = 1- phi[1], but using the simplex type would give you that. But if there are only two probabilities then you can just use a scalar phi in the parameters block and then use phi and 1 - phi in the model block.

This is IMHO a strong indication that something is very wrong. If you have priors straight on the values and the sampler still struggless, there is probably some fundamental non-identifiability AND/OR a bug in your code/math. Just went over the code a second time and still can’t see any specific issue, sorry.

@jonah
I misread the first question: phi[1] and phi[2]are the autoregressive parameters. So they don’t have to sum to 1. For the transition probabilities they sum to 1 for the two states by construction, one coming from the normal_cdf and the other being 1 - that value.

@martinmodrak
Yes, that I was also wondering about. At this point, I am pretty sure that the model is correct but that there is a problem in terms of consistently identifying the right mode rather than the two sub-optimal alternatives on which the model converges sometimes. I will update here in case that I manage to solve this.