Divergent transitions in TVP-VAR with stochastic volatility

I’m running into issues with divergent transitions in a modified version of the time-varying VAR model with stochastic volatility from Cogley and Sargent (2005). The problem occurs when the number of variables (K) in the VAR gets relatively large. Increasing adapt_delta does help reduce divergences, but even with adapt_delta=0.99999, I still get some chains with a few divergences when K is high. The chains with divergences tend to have relatively high step sizes.

The model is mathematically described by the following equations :

\begin{equation} \begin{aligned} z_t &= \theta_t z_{t-1} + u_t \\ \theta_t &= \theta_{t-1} + \nu_t \\ \log h_{it} &= \log h_{i,t-1} + \eta_{it} \end{aligned} \end{equation}

where:

\begin{equation} \begin{aligned} u_t &\sim N(0, R_t) \\ \nu_t &\sim N(0, Q) \\ \eta_{it} &\sim N(0, \sigma^2_{\eta,i}) \\ R_t &= \text{diag}\left(\exp\left(\frac{h_{t}}{2}\right)\right) \Omega_R \text{diag}\left(\exp\left(\frac{h_{t}}{2}\right)\right) \\ Q &= \text{diag}\left(\sigma_Q\right) \Omega_Q \text{diag}\left(\sigma_Q\right) \end{aligned} \end{equation}

Below is the relevant Stan code I used:

data {
  int<lower=1> T;        // Number of time periods
  int<lower=1> K;        // Number of variables
  matrix[T, K] y;        // Observed data
}

parameters {
  vector[K * K] theta_0_raw;               // Raw initial state
  vector<lower=0>[K * K] sigma_q_raw;      // Raw state innovation standard deviations
  cholesky_factor_corr[K * K] L_Omega_q;   // Cholesky factor of state innovation correlation matrix
  vector[K] h_0_raw;                       // Raw initial log volatilities
  vector<lower=0>[K] sigma_eta_raw;        // Raw std dev of log volatility innovations
  cholesky_factor_corr[K] L_Omega_y;       // Cholesky factor of observation correlation matrix
  matrix[T, K * K] theta_std;              // Non-centered theta innovations
  matrix[T, K] h_std;                      // Non-centered h innovations
}

transformed parameters {
  vector[K * K] theta_0 = 2 * theta_0_raw;          // More conservative scaling
  vector<lower=0>[K * K] sigma_q = 0.1 * sigma_q_raw;   // Tighter prior scaling
  vector[K] h_0 = -2 + 3 * h_0_raw;                     // More conservative range
  vector<lower=0>[K] sigma_eta = 0.5 * sigma_eta_raw;  
  matrix[K * K, K * K] L_Q = diag_pre_multiply(sigma_q, L_Omega_q); // Cholesky factor of state innovation covariance

  // Preallocate arrays for states and transformations
  array[T] vector[K * K] theta;     // States of theta over time
  array[T] vector[K] h;             // Log volatilities over time
  array[T] vector[K] exp_half_h;    // Exponential of half h for each time
  array[T] matrix[K, K] L_R;        // Cholesky factors of observation covariance
  array[T] matrix[K, K] Theta_mat;  // Theta matrices over time
  matrix[T, K] y_hat;               // Fitted values

  // Initial states
  theta[1] = theta_0;
  h[1] = h_0;
  exp_half_h[1] = exp(0.5 * h[1]);
  L_R[1] = diag_pre_multiply(exp_half_h[1], L_Omega_y);
  Theta_mat[1] = to_matrix(theta[1], K, K);
  y_hat[1] = rep_row_vector(0, K);  // Initialize y_hat[1] (not used in likelihood)

  // Time evolution and precomputations
  for (t in 2:T) {
    // Update states
    theta[t] = theta[t - 1] + L_Q * theta_std[t]';
    h[t] = h[t - 1] + sigma_eta .* h_std[t]';

    // Precompute transformations
    exp_half_h[t] = exp(0.5 * h[t]);
    L_R[t] = diag_pre_multiply(exp_half_h[t], L_Omega_y);
    Theta_mat[t] = to_matrix(theta[t], K, K);

    // Compute fitted values
    y_hat[t] = y[t - 1] * Theta_mat[t]';
  }
}

model {
  // Priors
  theta_0_raw ~ std_normal();
  h_0_raw ~ std_normal();
  sigma_q_raw ~ std_normal();
  sigma_eta_raw ~ std_normal();
  L_Omega_q ~ lkj_corr_cholesky(1);
  L_Omega_y ~ lkj_corr_cholesky(1);

  // Non-centered parameterization priors
  to_vector(theta_std) ~ std_normal();
  to_vector(h_std) ~ std_normal();

  // Likelihood
  for (t in 2:T) {
    target += multi_normal_cholesky_lpdf(y[t] | y_hat[t], L_R[t]);
  }
}

generated quantities {
  matrix[K, K] Omega_y = multiply_lower_tri_self_transpose(L_Omega_y);
  matrix[K * K, K * K] Omega_q = multiply_lower_tri_self_transpose(L_Omega_q);
  matrix[K * K, K * K] Q = multiply_lower_tri_self_transpose(L_Q);
}

Should I just keep increasing adapt_delta until I eliminate divergences, or are there tricks or adjustments I should implement to deal with them more effectively?

Hi, @qbatista, and welcome to the Stan forums.

Despite what our warning message says, this rarely works. It can get rid of some divergences, but if you run long enough, they tend to still show up in the tails.

There’s been some interesting work on formulating priors on the coefficients of a VAR model to maintain stationarity:

Enforcing Stationarity through the Prior in Vector Autoregressions

Sarah E. Heaps
https://www.tandfonline.com/doi/full/10.1080/10618600.2022.2079648

P.S. You can simplify those non-centered parameterizations using affine transforms. For example, rather than

real h_0_raw;
...
h_0_raw ~ std_normal();
...
h_o = -2 + 3 * h_0_raw;

you can just use

real<offset=-2, multiplier=3> h_0;
...
h_0 ~ normal(-2, 3);

and similarly for the non-constant examples (that is, you can use parameters instead of the -2 and 3 used here).

The top-level variables in transformed parameters all get saved. You can make them local by declaring them inside braces.

transformed parameters {
  real x;
  {
    real y;  // variable local to this block
    x = ...;
  } // y goes out of scope and is not saved
}

lkj_corr_cholesky(1) is constant, so you can just drop these because Stan starts with a uniform distribution over Cholesky factors for correlation matrices. Upping the argument here might stabilize things, though—it will regularize toward zero correlations.

3 Likes

Thanks for the suggestions, they’re really helpful! I’ll implement them to see if it helps.

1 Like