How to deal with "total ordering" in practice under label switching

Suppose I have a simple MSVAR model with 1 lag and 2 variables and 2 regimes, where VAR term is regime-independent:

Regime 1:

y^{1} = y^1_{t-1} \cdot \gamma_{0,1} + y^{2}_{t-1} \cdot \gamma_{0,2} + \gamma_{1,1}

y^{2} = y^{1}_{t-1} \cdot \gamma_{0,1} + y^{2}_{t-1} \cdot \gamma_{0,2} + \gamma_{1,2}

Covariance 1: \Sigma^1= \begin{pmatrix} \Sigma^1_{11} & \Sigma^1_{12} \\ \Sigma^1_{21} & \Sigma^1_{22} \end{pmatrix}

Regime 2:

y^1 = y^1_{t-1} \cdot \gamma_{0,1} + y^{2}_{t-1} \cdot \gamma_{0,2} + \gamma_{2,1}

y^2 = y^1_{t-1} \cdot \gamma_{0,1} + y^{2}_{t-1} \cdot \gamma_{0,2} + \gamma_{2,2}

Covariance 2: \Sigma^2= \begin{pmatrix} \Sigma^2_{11} & \Sigma^2_{12} \\ \Sigma^2_{21} & \Sigma^2_{22} \end{pmatrix}

Since I assume regime 1 is recession and regime 2 is expansion, by their definition the restriction should be \gamma_{1,1} < \gamma_{2,1} and \gamma_{1,2} < \gamma_{2,2}, I think it is called “total ordering” from the paper “Turning point detection with bayesian panel Markov-Switching VAR” .

After estimation, it is possible that when label switching happens, \gamma_{1,1} , \gamma_{2,1}, and \gamma_{1,2} ,\gamma_{2,2} may switch within one iteration.

My question is that, if both of \gamma_{1,1} , \gamma_{2,1}, and \gamma_{1,2}, \gamma_{2,2} are switched, I should swap \Sigma^1 and \Sigma^2 completely, this is easy.

However, I found that it is also possible that only one pair is switched, such as:

Regime 1:

\begin{align} y^1 & = \text{lag_term} + “\boldsymbol{\gamma_{2,1}}” \\ y^2 & = \text{lag_term} + \gamma_{1,2} \end{align}

Regime 2:

\begin{align} y^1 & = \text{lag_term} + “\boldsymbol{\gamma_{1,1}}” \\ y^2 & = \text{lag_term} + \gamma_{2,2} \end{align}

for example, “\boldsymbol{\gamma_{2,1}}” = \gamma_{1,1} > “\boldsymbol{\gamma_{1,1}}” = \gamma_{2,1} and \gamma_{1,2} < \gamma_{2,2}, in this case, I think I should swap the element \Sigma^1_{11} and \Sigma^2_{11}, but I’m not sure about the off-diagonal elements in the covariance matrices, what should I do with \Sigma^1_{12}, \Sigma^1_{21}, \Sigma^2_{12}, \Sigma^2_{21}?

BTW, in practice, I think maybe I can simply drop all iterations in any chains which violates my total ordering? Or, I can discard iterations which has conflict b/w variables. By doing so, I can be easier to implement DIC which relies on posterior means.

Hi @streakdream and welcome to the Stan forums.

One thing you might want to do is think about downstream inference that does not depend on label ID. Then you don’t need to worry about it! For example, posterior predictive inference almost never relies on chain ID.

The main problem with label switching is in diagnosing convergence and effective sample size estimation.

You can introduce all kinds of bias applying these heuristics, they can become exponentially expensive depending on constraints, and it’s almost never addressing the root cause, which is an exchangeable model.

One thing you can do is order the parameters to identify the model. That is, use the ordered vector type or use

real a;
real<lower=a> b;

P.S. You may also be interested in the work of Sarah Heaps on parameterizing VAR for stationarity (that is, constraining to stationary solutions).

Thanks for your relying!
Yes, I agree with your point that label switching is more about convergence and ess.
However, I’ve also tried one-step ahead predictive density, and I think it likely more focuses on predicting, in the paper “Interactions between Eurozone and US Booms and Busts: A Bayesian Panel Markov switching VAR Model”, they tested for 1-4 lags and 2-3 regimes for panel MSVAR and conclude that lag=4 and regime=3 is the best. I doubt that even higher lags and regimes may provide better predictive ability.
So, I wish to use DIC by considering the penalty for complexity, (btw,
WAIC and LOO-CV from package loo seem to be not suiable in MS related mode because I got like

> loo(fit, save_psis = TRUE)
> 
> Computed from 8000 by 1 log-likelihood matrix.
> 
>          Estimate SE
> elpd_loo  -3202.9 NA
> p_loo      3511.7 NA
> looic      6405.8 NA
> ------
> MCSE of elpd_loo is NA.
> MCSE and ESS estimates assume MCMC draws (r_eff in [1.0, 1.0]).
> 
> Pareto k diagnostic values:
>                          Count Pct.    Min. ESS
> (-Inf, 0.7]   (good)     0       0.0%  <NA>    
>    (0.7, 1]   (bad)      0       0.0%  <NA>    
>    (1, Inf)   (very bad) 1     100.0%  <NA>    
> See help('pareto-k-diagnostic') for details.
> Warning message:
> Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

those NAs and Some Pareto k diagnostic values are too high, and even I got the same warnings for my DGP of a simple MSVAR (I could not get no divergent and good rhat in panel MSVAR) which i impose restriction <lower=a> and got posterior mean matching with the trure values), so I guess the loo package is not suitable for MS related model?
Can you suggest some method convincing readers the lag selection? because from the paper I mentioned above, I think their model is not stable, “longger lags is better” seems to be not a good sign of the model.

I’ll defer to @Jonah and @avehtari and @andrewgelman on this one.

The loo package is completely agnostic about where the log likelihoods come from. Are you sure there are not NaN values in the log likelihood vector you send in?

I don’t see why this is a bad sign. Usually we want to fit as rich of a model as we can based on the data we have. So if there are long-range effects, it just may be that you need a lot of data to fit them reliably.

If you get warnings from WAIC or PSIS-LOO, then DIC is unreliable, too. I assume the DIC computation you are using is missing diagnostics warning about the unreliability.

Computed from 8000 by 1 log-likelihood matrix.

Here the number 1 indicates that you are summing all the likelihood terms together and you are doing leave-all-observations-out. I guess you have a mistake in your generated quantities block.

I accidently add up all time points for log_lik, I think it is the main cause of loo reporting warnings.

I’m using annually data, I could try lag=1 to 4 and conclude lag=4 is the best, however, firstly, it is hard to convince people the data from 4 year ago has influence on today. Also, it makes me feel that it is the situation when I tried lag length selection for VAR and found that longer lag has lower AIC BIC and it may be a sign of unstable system (even if all variables pass ADF test).

Yes, you are right, my log_lik sum up all times thus I got this “1”, thank you!

I got SE of elpd_loo, p_loo, looic back, For this kind of result, is it normal for a MSVAR model about Pareto k diagnostic values?

chatgpt says “The latent state sequence [𝑠_1,𝑠_2,…,𝑠_𝑇] is unknown. If you remove 𝑦_𝑡, it affects not just the likelihood for 𝑦_𝑡 but also the probability of being in each regime at time 𝑡 (as well as neighboring times). This is a major issue for all Hidden Markov Models (HMMs), not just MSVARs.” so I got all Pareto k diagnostic values are too high warnings, is it true? Does it mean that all results from loo package are not reliable at all?

> print(loo_1)

Computed from 6000 by 99 log-likelihood matrix.

         Estimate    SE
elpd_loo  -8170.8 339.3
p_loo      5773.3 285.3
looic     16341.6 678.5
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.0]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)      0      0.0%  <NA>    
   (0.7, 1]   (bad)       0      0.0%  <NA>    
   (1, Inf)   (very bad) 99    100.0%  <NA>    
See help('pareto-k-diagnostic') for details.

Overall, it’s a hard problem to take a more general model and design priors that will work with low data and high data regimes. So often simpler models fit better, even though the more general model should, in theory, be able to do anything the less general model can.

I suspect that you are still having an error in the computation of the log-likelihood matrix, as p_loo is almost 60 times bigger than the number of observations (99). Alternatively you have a really bad model misspecification as discussed in the LOO glossary (see all sections discussing p_loo and Pareto-k).

Chatgpt answers are likely to not be reliable for Pareto-k, as there are not much material for Pareto-k in the training data, and it is mixing facts from related topics. Furthermore, the recommendations for Pareto-k have been updated after the training data time window at least for chatgpt 3 and 4, so it’s better to read the loo documentation and Pareto smoothed importance sampling paper for up-to-date information on loo and Pareto-k.

Thanks for your relying!
And to avoid the coding error, I tried the HMM code from A tutorial on Hidden Markov Models using Stan written by * Authors: Luis Damiano, Brian Peterson, Michael Weylandt from the 2018 Helsinki Contributed talks from Materials from StanCon @ GitHub - stan-dev/stancon_talks: Materials from Stan conferences

I’ve tried their HMM gaussian code, and modify a little bit:

  1. Modify logbeta[T, j] = 1; to logbeta[T, j] = 0;, since their original code violates \sum \alpha_{T, 1:K} == \sum \alpha_{t}*\beta_{t} for any t.
  2. Add the calculation of log likelihood to pass to loo package:
vector[T] log_likelihood;
  {
    for (t in 1:T) {
      log_likelihood[t] = log_sum_exp(logalpha[t, ]);
    }
  }
  1. Modify the ordered mu to be vector mu: vector[K] mu;

The R code to replicate is:

rm(list = ls(all = TRUE))
gc()

setwd("C:/Users/Respawn/Desktop/RCode/R code example/stancon_talks-master/2018/Contributed-Talks/04_damiano")

runif_simplex <- function(T) {
  x <- -log(runif(T))
  x / sum(x)
}
hmm_generate <- function(K, T) {
  # 1. Parameters
  pi1 <- runif_simplex(K)
  A <- t(replicate(K, runif_simplex(K)))
  mu <- sort(rnorm(K, 10 * 1:K, 1))
  sigma <- abs(rnorm(K))
  # 2. Hidden path
  z <- vector("numeric", T)
  z[1] <- sample(1:K, size = 1, prob = pi1)
  for (t in 2:T)
    z[t] <- sample(1:K, size = 1, prob = A[z[t - 1], ])
  
  # 3. Observations
  y <- vector("numeric", T)
  for (t in 1:T)
    y[t] <- rnorm(1, mu[z[t]], sigma[z[t]])
  list(y = y, z = z,
       theta = list(pi1 = pi1, A = A,
                    mu = mu, sigma = sigma))
}


hmm_init <- function(K, y) {
  clasif <- kmeans(y, K)
  init.mu <- by(y, clasif$cluster, mean)
  init.sigma <- by(y, clasif$cluster, sd)
  init.order <- order(init.mu)
  list(
    mu = init.mu[init.order],
    sigma = init.sigma[init.order]
  )
}

set.seed(900)
T_length <- 100
K = 2
dataset <- hmm_generate(K=K, T_length)


library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())



# hmm_fit <- function(K, y) {
# stan_file = 'stan/hmm_gaussian.stan'
stan_file = 'stan/hmm_gaussian_verify.stan'
  stan_model <- stan_model(stan_file, auto_write = TRUE)
  stan.data = list(
    T = length(dataset$y),
    K = K,
    y = dataset$y
  )
fit <- sampling(
  stan_model,
       data = stan.data,
       iter = 500, warmup = 200,
       thin = 1, chains = 2,
       # control = list(adapt_delta = 0.82, max_treedepth = 10),
  control = list(adapt_delta = 0.96),
       seed = 9002) #,
       # init = function(){hmm_init(K, y)})
# }

print(sapply(get_sampler_params(fit, inc_warmup = FALSE), function(x) mean(x[, "accept_stat__"])))

# K <- 2

# 
# fit <- hmm_fit(K=2, dataset$y)

print(fit, pars = c("pi1","A","mu","lp__"), digits = 2)
check_hmc_diagnostics(fit)
dataset$theta


print(rstan::extract(fit, pars = "logbeta", permuted = TRUE)$logbeta[1,,])
print(rstan::extract(fit, pars = "logalpha", permuted = TRUE)$logalpha[1,,])

logbeta = rstan::extract(fit, pars = "logbeta", permuted = TRUE)$logbeta[1,,]
logalpha = rstan::extract(fit, pars = "logalpha", permuted = TRUE)$logalpha[1,,]
sumlogalphalogbeta = matrix(nrow = length(dataset$y), ncol = 1)
for (t in 1:length(dataset$y)) {
  sumlogalphalogbeta[t,1] = sum(exp(logalpha[t, 1] + logbeta[t, 1]), exp(logalpha[t, 2] + logbeta[t, 2]))
}

print(sumlogalphalogbeta)
print(log(sumlogalphalogbeta))

print(sum(exp(logalpha[length(dataset$y),])))
print(log(sum(exp(logalpha[length(dataset$y),]))))

print(sum(exp(logbeta[1,])))
print(log(sum(exp(logbeta[1,]))))




print(rstan::extract(fit, pars = "logalpha_1", permuted = TRUE)$logalpha_1[1,,] 
      - 
      rstan::extract(fit, pars = "logalpha", permuted = TRUE)$logalpha[1,,] )





library(loo)
# # log_lik <- rstan::extract(fit, pars = "log_likelihood", permuted = TRUE)$log_likelihood
# # log_lik <- fit$draws("total_log_likelihood", format = "matrix")  # This extracts log_lik as a matrix
log_lik <- extract_log_lik(fit,parameter_name = "log_likelihood", merge_chains = FALSE)
# # Compute the LOO-CV (Leave-One-Out Cross-Validation) using the log-likelihoods
# loo_result <- loo(log_lik)
# # Print the result
# print(loo_result)
#
r_eff <- relative_eff(exp(log_lik), cores = 2)
loo_1 <- loo(log_lik, r_eff = r_eff, cores = 2)
print(loo_1)

library(bridgesampling)
bridge <- bridge_sampler(fit, silent=TRUE  ,maxiter = 1000)
bridge


library(shinystan)
launch_shinystan(fit)

And the stan file I’ve tried is:

functions {
  vector normalize(vector x) {
    return x / sum(x);
  }
}

data {
  int<lower=1> T;                   // number of observations (length)
  int<lower=1> K;                   // number of hidden states
  real y[T];                        // observations
}

parameters {
  // Discrete state model
  simplex[K] pi1;                   // initial state probabilities
  simplex[K] A[K];                  // transition probabilities
                                    // A[i][j] = p(z_t = j | z_{t-1} = i)

  // Continuous observation model
  vector[K] mu;                    // observation means
  real<lower=0> sigma[K];           // observation standard deviations
}

transformed parameters {
  vector[K] logalpha[T];
  vector[K] logalpha_1[T];

  // Forward algorithm log p(z_t = j | x_{1:t})
  {
    real accumulator[K];

    logalpha[1] = log(pi1) + normal_lpdf(y[1] | mu, sigma);
    logalpha_1[1] = log(pi1) + normal_lpdf(y[1] | mu, sigma);

    for (t in 2:T) {
      for (j in 1:K) { // j = current (t)
        for (i in 1:K) { // i = previous (t-1)
                         // Murphy (2012) Eq. 17.48
                         // belief state + transition prob + local evidence at t
          accumulator[i] = logalpha[t-1, i] + log(A[i, j]) + normal_lpdf(y[t] | mu[j], sigma[j]);
        }
        logalpha[t, j] = log_sum_exp(accumulator);
        
        logalpha_1[t, j] = log_sum_exp({
          logalpha_1[t-1, 1] + log(A[1, j]) + normal_lpdf(y[t] | mu[j], sigma[j]),
          logalpha_1[t-1, 2] + log(A[2, j]) + normal_lpdf(y[t] | mu[j], sigma[j])
        });
      }
    }
  }
}

model {
  // Log-likelihood for forward probabilities
  target += log_sum_exp(logalpha[T]); // Update based only on last logalpha
}

generated quantities {
  vector[K] logbeta[T];
  vector[K] loggamma[T];
  vector[K] alpha[T];
  vector[K] beta[T];
  vector[K] gamma[T];

  int<lower=1, upper=K> zstar[T];
  real logp_zstar;

  // Forward algorithm
  for (t in 1:T)
    alpha[t] = softmax(logalpha[t]);

  // Backward algorithm log p(x_{t+1:T} | z_t = j)
  real accumulator[K];

  for (j in 1:K)
    logbeta[T, j] = 0;

  for (tforward in 0:(T-2)) {
    int t;
    t = T - tforward;

    for (j in 1:K) { // j = previous (t-1)
      for (i in 1:K) { // i = next (t)
                         // Murphy (2012) Eq. 17.58
                         // backwards t + transition prob + local evidence at t
        accumulator[i] = logbeta[t, i] + log(A[j, i]) + normal_lpdf(y[t] | mu[i], sigma[i]);
      }
      logbeta[t-1, j] = log_sum_exp(accumulator);
    }
  }

  for (t in 1:T)
    beta[t] = softmax(logbeta[t]);

  // Forwards-backwards algorithm log p(z_t = j | x_{1:T})
  for (t in 1:T) {
    loggamma[t] = alpha[t] .* beta[t];
  }

  for (t in 1:T)
    gamma[t] = normalize(loggamma[t]);

  // Viterbi algorithm
  int bpointer[T, K];             // backpointer to the most likely previous state on the most probable path
  real delta[T, K];               // max prob for the seq up to t with final output from state k for time t

  for (j in 1:K)
    delta[1, K] = normal_lpdf(y[1] | mu[j], sigma[j]);

  for (t in 2:T) {
    for (j in 1:K) { // j = current (t)
      delta[t, j] = negative_infinity();
      for (i in 1:K) { // i = previous (t-1)
        real logp;
        logp = delta[t-1, i] + log(A[i, j]) + normal_lpdf(y[t] | mu[j], sigma[j]);
        if (logp > delta[t, j]) {
          bpointer[t, j] = i;
          delta[t, j] = logp;
        }
      }
    }
  }

  logp_zstar = max(delta[T]);

  for (j in 1:K)
    if (delta[T, j] == logp_zstar)
      zstar[T] = j;

  for (t in 1:(T - 1)) {
    zstar[T - t] = bpointer[T - t + 1, zstar[T - t + 1]];
  }


  vector[T] log_likelihood;
  {
    for (t in 1:T) {
      log_likelihood[t] = log_sum_exp(logalpha[t, ]);
    }
  }
}

I fit the model with T=100, iteration=500, warmup=200, adapt_delta=0.96, and 2 chains and selected seed so that no label switching occurs, the result seems to be reliable:

> print(fit, pars = c("pi1","A","mu","lp__"), digits = 2)
Inference for Stan model: anon_model.
2 chains, each with iter=500; warmup=200; thin=1; 
post-warmup draws per chain=300, total post-warmup draws=600.

          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
pi1[1]    0.46    0.01 0.28    0.02    0.22    0.45    0.69    0.96   799    1
pi1[2]    0.54    0.01 0.28    0.04    0.31    0.55    0.78    0.98   799    1
A[1,1]    0.48    0.01 0.18    0.15    0.35    0.48    0.61    0.83   605    1
A[1,2]    0.52    0.01 0.18    0.17    0.39    0.52    0.65    0.85   605    1
A[2,1]    0.03    0.00 0.02    0.01    0.02    0.03    0.04    0.07   620    1
A[2,2]    0.97    0.00 0.02    0.93    0.96    0.97    0.98    0.99   620    1
mu[1]    19.95    0.04 0.54   19.05   19.65   19.89   20.16   21.39   209    1
mu[2]     9.00    0.01 0.16    8.69    8.88    8.99    9.11    9.31   556    1
lp__   -202.27    0.15 2.02 -207.00 -203.45 -201.92 -200.73 -199.33   187    1

Samples were drawn using NUTS(diag_e) at Mon Jan  6 00:22:13 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

And they are closed to the true values:

> dataset$theta
$pi1
[1] 0.271007 0.728993

$A
          [,1]       [,2]
[1,] 0.9711612 0.02883883
[2,] 0.5549811 0.44501885

$mu
[1]  8.831898 20.286436

$sigma
[1] 1.280083 1.060025

However, when I pass the log likelihood to loo package, p_loo is 5 times bigger than T(100), and all K values are very bad:

> log_lik <- extract_log_lik(fit,parameter_name = "log_likelihood", merge_chains = FALSE)
> r_eff <- relative_eff(exp(log_lik), cores = 2)
> loo_1 <- loo(log_lik, r_eff = r_eff, cores = 2)
Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
 
> print(loo_1)

Computed from 600 by 100 log-likelihood matrix.

         Estimate     SE
elpd_loo -11305.9  500.6
p_loo       511.7   14.0
looic     22611.8 1001.1
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 0.7]).

Pareto k diagnostic values:
                          Count Pct.    Min. ESS
(-Inf, 0.64]   (good)       0     0.0%  <NA>    
   (0.64, 1]   (bad)        0     0.0%  <NA>    
    (1, Inf)   (very bad) 100   100.0%  <NA>    
See help('pareto-k-diagnostic') for details.

Since the 2 chains show some evidence of good mixing, I also tried 1 chain:

> print(fit, pars = c("pi1","A","mu","lp__"), digits = 2)
Inference for Stan model: anon_model.
1 chains, each with iter=500; warmup=200; thin=1; 
post-warmup draws per chain=300, total post-warmup draws=300.

          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
pi1[1]    0.46    0.01 0.29    0.02    0.20    0.45    0.69    0.96   389 1.01
pi1[2]    0.54    0.01 0.29    0.04    0.31    0.55    0.80    0.98   389 1.01
A[1,1]    0.49    0.01 0.19    0.16    0.35    0.49    0.63    0.86   320 1.00
A[1,2]    0.51    0.01 0.19    0.14    0.37    0.51    0.65    0.84   320 1.00
A[2,1]    0.03    0.00 0.02    0.01    0.02    0.03    0.04    0.07   291 1.00
A[2,2]    0.97    0.00 0.02    0.93    0.96    0.97    0.98    0.99   291 1.00
mu[1]    19.94    0.04 0.53   19.05   19.64   19.88   20.14   21.25   146 1.00
mu[2]     9.00    0.01 0.16    8.71    8.89    8.99    9.10    9.32   374 1.00
lp__   -202.38    0.16 1.96 -206.80 -203.46 -202.16 -200.88 -199.53   144 1.00

Samples were drawn using NUTS(diag_e) at Mon Jan  6 00:50:54 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> r_eff <- relative_eff(exp(log_lik), cores = 2)
> loo_1 <- loo(log_lik, r_eff = r_eff, cores = 2)
Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
 
> print(loo_1)

Computed from 300 by 100 log-likelihood matrix.

         Estimate     SE
elpd_loo -11267.7  501.9
p_loo       472.2   12.9
looic     22535.4 1003.8
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 0.7]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.6]   (good)      0     0.0%   <NA>    
   (0.6, 1]   (bad)       2     2.0%   <NA>    
   (1, Inf)   (very bad) 98    98.0%   <NA>    
See help('pareto-k-diagnostic') for details.

I haven’t read the loo documentation especially about Pareto-k carefully yet since my assignment is due soon (which is the primary productivity…) plus I’m newbie to bayesian and stan, also there isn’t much stan + HMM material (ppl seem to use Gibbs more often in HMM topic I suppose?), and this often makes me question my coding.

Besides, I will post another reply below and summarize my learning in practice regarding estimating panel MSVAR, and I hope it can help other newbies to save time with panel MSVAR+Stan. If you have the time, please feel free to critique any point of my summary.

If anyone who google MSVAR to get this post, I wish my summary of my experience of stan+panel MSVAR can help you.

  1. Stan is indeed more efficient than Gibbs sampling, however, this “efficiency” is limited by the actual time consumed by solving divergence, multimodalities, adjusting adapt delta, etc., especially for complex panel models.

  2. Adapt_delta needs to be carefully chosen. Especially for uninformative dataset, there may be only a short range of adapt delta which results good Rhat, ESS and multimodalities. Higher adapt delta usually can solve the divergence issue, however, it may suffer from multimodalities, and it depends how informative your dataset is. For example, I see issues like, no divergent when adapt_delta>0.6, however, there will be multimodalities when adapt_delta>0.85 to even 0.99. Sometime, lower the adapt_delta than the default of rstan’s 0.8 may solve the divergence and multimodalities at the same time.

  3. Choosing centered or non-centered reparameterization is a problem. For example, in hierarchical prior settings, like y~N(mu, S), mu~N(a, s). In my observation, some dataset from my DGP, if I make both y and mu to be non-centered, it may increase ESS in VAR coefs (they are regime-independent, so I do not need to worry about label switching if I use multiple chains), but I’m not sure about this.

  4. About loo package. Even if my posterior matches with the true values and no Rhat ESS multimodalities problem, I still cannot get all “good” in k diagnostic values, it is possible that my last reply’s code about log likelihood is wrong. However, predictive density is basically reliable, if you notice my previous replies to argue that higher lag length in MSVAR tends to result higher predictive density, after I re-re-read the mathematical expression of FFBS, my DGP and real data can converge to lag=1 is the best. My original concern is that I was trying to compare MS model with MSVAR(1). However, I found that the posterior means of covairance of MS model, especially SD may exceed 2, even up to 10, and I think this boom comes from it lacks important dynamics and it needs higher SD to compensate it, also I did not reliaze that MS may be not nested within MSVAR. Finally, bridge sampling I think is good, since it can really differentiate my true MSVAR(1) from MS, MSVAR(2) with 2 regimes, and MSVAR(1) with 3 regimes.
    My solution is, for bridge sampling, once with 4 chains, lp__ and all regime-independent parameters have good ESS and Rhat, re-estimate it with 1 chain and apply bridge sampling, I think in practice, it often can reveal the “true” model.

  5. To deal with label switching issue, stan documentation suggests hacks, such as only 1 chain, or setting restrictions on regime-means, or post re-ordering. 1) 1 chains, in this situation, it has to be confirmed with multiple chains for convergence and mixing, once chains are behave good, from my experience of DGP, it seems to be ok. 2) Setting restrictions on regime-means. For simple model like MS, or even MSVAR model, it is a good starting point, and it will suffer from multimodalities and divergents. However, from my experience of DGP, informative( clear evidence of multiple regime means) dataset can survive, and it usually needs higher adapt_delta>0.95, and much higher warmup>2k. Techniques I have used are real<mu_2>mu_1> mu2; ,or ordered[K] mu; // observation means as the usage in GitHub - stan-dev/stancon_talks: Materials from Stan conferences. But for panel data MSVAR, I do not think these restrictions can get rid of multimodalities and divergents, and they cannot recover the true values in DGP. 3) post re-ordering. Sometime, as I observe, even within 1 chain, some iterations may switch to the other regimes, if the draws are wide and their “true” values are pretty close, you may not notice from the plots of draws, so post sampling re-ordering should be done. However, as my first post in this topic, “total ordering” may be not feasible. For “partially” ordering, such as the first variable in mu_1 is strictly less than mu_2 ( recession & expansion topic), I do not think that only comparing the first iteration of mu_1’s variable 1 to mu_2s variable 1 to confirm that ther is no label switching issue is correct. Actually, label switching can happen at arbitary iteration within 1 chain. Besides, even if we re-order the label, almost all packages, like bridge sampling, rhat, loo, etc., will not work properly( I guess, maybe I did not correctly modify the fit object from rstan). I think, the best way( for now) is to set no restrictions on parameters, and run multiple chains to verify convergence and mixing, then use 1 chain to do inference. Hopefuly maybe stan can introduce some method to just discard those iterations which violate label ordering and not affect the posterior.

  6. An importan thing to mention is that for anyone who tries to implement “K-state switching models with time-varying transition distributions – Does credit growth signal stronger effects of variables on inflation?” from Kaufmann, Sylvia. Since once non-centered parameterization and hierarchical priors are usually applied, there will be many parameters, and it is very important to look at the rhat of lp__ since the covariates’ parameters often experience multimodalities depending on the quality of your covariate.

We’re updating the user’s guide chapter on HMMs. We’ve had a built-in HMM solver that uses a more efficient method than just writing out the likelihood using the forward algorithm, which is what we had before. Here’s the reference manual pages:

https://mc-stan.org/docs/functions-reference/hidden_markov_models.html

This will be much more efficient in wall time, not in mixing time—it’s just a more efficient likelihood calculation.

By multimodalities, do you mean genuine ones or ones resulting from label switching? That’s mainly a problem for convergence monitoring because just about any kind of posterior predictive inference is going to marginalize them out.

The divergence problem and hierarchical model centered/non-centered issue is a real problem for which we don’t have a good solution. Cranking up adapt_delta usually just means you don’t see divergences in short runs, but were you to run the chains long enough to get into the extreme tail regions, you’d get stuck again. This is a fundamental problem of HMC when faced with varying curvature/varying scale. Gibbs automatically self-tunes on scale, but has the opposite problem of having a hard time moving when the parametes are correlated.

The other essential problem here that we don’t have a solution for is that mixture models are hard to fit, and HMMs can make the problem worse by making it a time-series of mixtures. Sometimes it can make it better if the underlying component is well predicted by the underlying Markov model. Similarly, some of these problems can go away or get worse as data sizes change or how well the data generating process matches the data changes. Usually better models in terms of matching the data generating process are easier to fit.

Thanks for summarizing!

1 Like

I’m pretty confused by the model code here. I don’t fully understand what you’re trying to do, but if you have T observations, you’re only incrementing the target once?

It might help you to take a look at my HMM in Stan blog post where I focus on ecological models. In ecological models such as mark-recapture or occupancy, the likelihood factorises at the level of the individuals and sites, respectively, so that’s where you collect your log_lik. Usually the target and the log_lik is thus incremented inside a loop over these units. FWIW, I’ve never had to change adapt_delta with these models but I think the way a lot of these ecological models are parameterised don’t really struggle with multi-modality. But at least loo doesn’t show any problems, at least with simulated data, as long as there’s no random effects at the level of individual/site unit.

1 Like

I feel like we’re just not being creative enough. I keep plugging away to see how far we can take HMC. Although this isn’t as sexy as a new sampler I enjoy working under the imposed constraints.

For normal hierarchical models you can try partially non centering. It takes a weight/weights between 0 and 1 that are prederived from some data driven heuristic. I have an example in my StanCon 24 hierarchical models tutorial at

The relevant slides are the one below and the slide following in the tutorial

You can find Stan models and R code to fit this and the data driven heuristic I used for the weights.

data {
  int<lower=0> N; // Number of observations
  vector[N] y; // Observations
  real<lower=0> sigma; // Measurement variability
  int<lower=0> K;
  array[N] int<lower=1, upper=K> indiv_idx;
  vector<lower=0, upper=1>[K] w;
}
transformed data {
 // real tau = 3;
}
parameters {
  real mu;            // Population location
  real<lower=0> tau;  // Population scale
  vector[K] eta;      // Non-centered individual parameters
}
transformed parameters {
 vector[K] tau_wt = tau - w * (tau - 1);
 vector[K] theta = (eta * tau + mu * w) ./ tau_wt;
}
model {
  mu ~ normal(0, 5);                   // Prior model
  tau ~ normal(0, 5);                  // Prior model
  eta ~ normal(mu * (1 - w), tau_wt);
  y ~ normal(theta[indiv_idx], sigma); // Observational model
}
generated quantities {
  array[N] real y_post_pred = normal_rng(theta[indiv_idx], sigma);
  real theta_p = normal_rng(mu, tau);
}
1 Like

Partially non centering seems to be an interesting method, I will test this method later.

I was think that since the hierarchical prior setting I followed “Interactions between Eurozone and US Booms and Busts: A Bayesian Panel Markov-switching VAR Model” by Monica Billio1, Roberto Casarin1, Francesco Ravazzolo2, Herman K. van Dijk3 is pretty simple. Their hierarchical is just done for the means, where the VAR coefs \gamma_{i0} are regime-independent, and i means the number of contries:
\gamma_{i0} \sim MVN(\lambda, I_{K}), and \lambda \sim MVN(0, I_{K}*10),
and I was thinking that I can either make \gamma_{i0} or \lambda_i or both to be non-centered.

Some tutorial suggests to only reparameterize \gamma_{i0}, some suggests both, and in stan’s documentation, it suggests only \gamma_{i0}, and both, and it in generally needs me to test the number of parameters can be reparameterized * the interval of possible adpat_delta, this is indeed a pretty, very, extrem hard work.

The reason is that the stan code follows Forward-backward algorithm, and since the log alpha is recursively incremented, mathematically, this alpha is \alpha_t = p(Z_t =k | X_{1:(t-1)}), where Z_t is the latent variable, X is the observation.

About your code, I’m major in economics, and I do not know ecological models, however, I guess your “mps” seems to be similar to “logalpha” in this code which contains size of (regimes(live or death in your code I suppese?), times(“n_surv”)), and your
target += log(sum(mps[:, n_surv]));
just works as what the code does in target += log_sum_exp(logalpha[T]); // Update based only on last logalpha, they are both log_sum_exp the last observation since it’s already incremented 1:T at this stage?

Oh, do you mean why this code only has 1 target +=? but in your code, you add priors to target?
If you are talking about this, it is because you essentially put those my posted code used strictly to the target, like

  // priors
  target += beta_lupdf(phi | 1, 1);
  target += beta_lupdf(p | 1, 1);

In my code, they’d like to be:

phi~beta(1,1);
p~beta(1,1);

Since the code I posted is from stancon_talks, I’d like to believe it is ok in coding in 99% confidence level. Your calculation of loglikelihood is log_lik[i] = log(sum(mps[:, n_surv]));, which seems to be the same as what I added to the code

for (t in 1:T) {
      log_likelihood[t] = log_sum_exp(logalpha[t, ]);
}

where logalpha works similar to your mps, and, I guess the main reason that you do not usually change adapt_delta is because the difference b/w ecological model and continuous model in y?Maybe the likelihood function is simpler in discrete y.

1 Like

Thanks for your sharing, I noticed it before, and I was thinking that I applied time-varying transition matrix, these built-in functiosn may cause error, so I did not even try them yet. But I guess I could remove the time-varying trantion matrix part and try them to see if they can make loo package working properly.

I meant genuine ones, such as the VAR part in panel MSVAR which I set them to be regime-independent. I found that multimodalities heavily reiles on how good the dataset is; if there is extreme or outlier( my economic dataset has short length like 30 years annualy, and I think it may make the sampler harder to explore the place),it is much easier to see multimodalities even after standard non-centered reparameterization.

And you are correct that the suitable model specification will be easier to fit, in the other words, bad model specification in stan will be harder to fit than Gibbs in HMM, especially in empirical, papers usually say that “depending on BF, we choose lag length xxxx with regime yyyy”, however, those bad models (I feel) may not even possible to be estimated by stan (such as FFBS+non-centered methods) and having good Rhat, Ess, etc… This situation makes me to be very emotional, such as facing covariance SD=15.

I think where I got confused is that you seem to increment log_likelihood with each observation, whereas in ecological models you increment the log_lik with all of the observations from that individual or site, because that’s where the likelihood factorises. I’m not sure if the observations in your model are independent which I think is what you need for loo, but I’m not sure. Apologies for the misunderstanding.

Oh, I see, you mean this part?

  vector[T] log_likelihood;
  {
    for (t in 1:T) {
      log_likelihood[t] = log_sum_exp(logalpha[t, ]);
    }
  }

I do not increment log likelihood for all the observations, which at least I did not intend to do.
This forward prob, logalpha[t, ] has dimension T by the number of regimes,

In your code, they are very similar:
log_lik[i] = log(sum(mps[:, n_surv]));
where i is the individual, n_surv seems meaning the number of surveys the individual is alive or death? And : sums alive or death.

Please correct me if I’m wrong, in my understanding, your log_lik[i] is just the same as my log_likelihood[t], the definition of “observation” in ecological seems to be individual, i, and each individual will be either alive or death; For like economics, observation means for ONE individual’s each time t, and each time t is either in expansion or recession. So eventually we are doing the same thing, and your t works different in my t, where your t is more likely a probability or another layer that how many time steps the individual will die.