Help on making my Stan code for the Steady State BVAR(p) model more efficient/faster

Hi,

I have code for the Steady State BVAR(p) model by Mattias Villani (Villani, 2009), which is one of the main macroeconomic forecasting models used by the Swedish central bank (Ankargren, Unosson, and Yang, 2020) . Unfortunately, it takes quite a long time to estimate in Stan (compared to a Gibbs sampler) and I am wondering if it is possible to make it faster. Any ideas are very welcome! Thanks :-)

The model is

y_t = \Psi x_t + A_1(y_{t-1}-\Psi x_{t-1})+\dots+A_p(y_{t-p}-\Psi x_{t-p})+u_t

where y_t is a k-dimensional vector of time series at time t, and x_t is a q-dimensional vector of deterministic variables at time t (for example a constant and/or a dummy), and u_t \sim N_k(0,\Sigma_u) with independence between time periods. Here A_\ell for \ell=1,\dots,p is (k \times k), and \Psi is (k \times q).
Now

E(y_t)=\mu_t=\Psi x_t

is the ā€œsteady stateā€. With this parametrization of a VAR model, we can specify our priors beliefs about the long run forecasts for the variables in the system, since they converge to the steady state (unconditional mean). As Villani points out: ā€œPrior information on the steady state is typically available, and quite often in strong form. The forecasts of inflation undertaken at central banks operating under an explicit inflation target is an apparent example.ā€ (Villani, 2009).

We can stack the A matrices in the (kp \times k) matrix \beta

\beta= \begin{bmatrix} A'_1 \\ \vdots \\ A'_p \end{bmatrix}

We can then rewrite the model as a nonlinear regression (Karlsson, 2013)

y_t' =x_t'\Psi' + \left[w_t'-q_t'(I_p \otimes \Psi') \right]\beta +u_t'

where w_t'=(y_{t-1}',\dots,y_{t-p}') is a kp-dimensional vector of lagged endogenous variables, and q_t'=(x_{t-1}',\dots,x_{t-p}') is a qp-dimensional vector of lagged exogenous (deterministic) variables, I_p is the (p \times p) identity matrix and \otimes denotes the Kronecker product.

Following Villani (2009), starting with \beta the usual ā€œMinnesotaā€ prior is used

\textrm{vec}(\beta) \sim N_{kpk}\left[\theta_\beta,\Omega_\beta\right]

I’m leaving out the specifics here, but \Omega_\beta is a diagonal matrix. And then for the steady state coefficients, we use

\textrm{vec}(\Psi) \sim N_{kq}\left[\theta_\Psi,\Omega_\Psi\right]

where \Omega_\Psi also is a diagonal matrix. At last for \Sigma_u Jeffreys prior is used

p(\Sigma_u) \propto\left|\Sigma_u \right|^{-(k+1)/2}

References

Ankargren, S., Unosson, M., and Yang, Y. (2020). A Flexible Mixed-Frequency Vector Autoregression with a Steady-State Prior. Journal of Time Series Econometrics. 12(2), Article 20180034.

Karlsson, S. (2013). Forecasting with Bayesian Vector Autoregression. In: Elliott, G. and Timmerman, A. (eds) Handbook of Economic Forecasting. Elsevier B.V. Vol 2, Part B., pp. 791-897.

Villani, M. (2009). Steady-state priors for vector autoregressions.Journal of Applied Econometrics. 24(4), pp. 630-650.


The stan code is

functions {
  matrix kron(matrix A, matrix B) { //kronecker product
  matrix[rows(A) * rows(B), cols(A) * cols(B)] C;
  int m;
  int n;
  int p;
  int q;
  m = rows(A);
  n = cols(A);
  p = rows(B);
  q = cols(B);
  for (i in 1:m) {
    for (j in 1:n) {
      int row_start;
      int row_end;
      int col_start;
      int col_end;
      row_start = (i - 1) * p + 1;
      row_end = (i - 1) * p + p;
      col_start = (j - 1) * q + 1;
      col_end = (j - 1) * q + q;
      C[row_start:row_end, col_start:col_end] = A[i, j] * B;
    }
  }
  return C;
  }
}

data {
  int<lower=0> N; //number of observations
  int<lower=0> k; //number of variables
  int<lower=0> p; //lag order
  int<lower=0> q; //number of exogenous variables
  matrix[N, k] Y; //endogenous variables (y_t)
  matrix[N, q] X; //exogenous variables (x_t)
  matrix[N, k*p] W; //lagged endogenous variables
  matrix[N, q*p] Q; //lagged exogenous variables
  vector[k*p*k] theta_beta; //vec_beta prior mean
  matrix[k*p*k, k*p*k] Omega_beta; //vec_beta prior covariance matrix
  vector[k*q] theta_Psi; //vec_Psi prior mean
  matrix[k*q, k*q] Omega_Psi; //vec_Psi prior covariance matrix
  int<lower=0> H; // Forecast horizon
  matrix[H, q] X_pred; //future exogenous variables
}

transformed data {
    matrix[p, p] I_p = diag_matrix(rep_vector(1, p)); // Identity matrix
}

parameters {
  matrix[k*p, k] beta; //beta' = (A_1,...,A_p)
  matrix[k, q] Psi; //Psi * x_t = steady state
  cov_matrix[k] Sigma_u;
}

model {
  for(t in 1:N){
      vector[k] u_t = (Y[t] - (X[t]*Psi' + (W[t]-Q[t]*(kron(I_p,Psi')))*beta))';
      u_t ~ multi_normal(rep_vector(0,k), Sigma_u);
  }
  to_vector(beta) ~ multi_normal(theta_beta, Omega_beta);
  to_vector(Psi) ~ multi_normal(theta_Psi, Omega_Psi);
  target += -0.5 * (k + 1) * log_determinant(Sigma_u);
}

generated quantities {

  matrix[k, k] A[p];
  for (i in 1:p) {
    A[i] = (beta[((i - 1) * k + 1):(i * k), :])'; //extract A_1, ..., A_p
  }

  matrix[H, k] Y_pred;

  for (h in 1:H) {

    vector[k] u_t = multi_normal_rng(rep_vector(0, k), Sigma_u);
    vector[k] yhat_t = (X_pred[h]*Psi')';

    if (h > 1) {
      for (i in 1:min(h-1, p)) {
        yhat_t += to_vector((Y_pred[h-i] - X_pred[h-i]*Psi') * A[i]');
      }
    }

    if (h <= p) {
      for (i in h:p) {
        yhat_t += to_vector((Y[N + h - i] - X[N + h - i]*Psi') * A[i]');
      }
    }
    Y_pred[h] = (yhat_t + u_t)'; //forecasts
  }
}

What’s the size of the vectors/matrices? This doesn’t seem like an awfully large model, but the kronecker product may generate a large number of calculations if the matrices become large. Sometimes its properties can be used to avoid some of the calculations and save some time, but I’m not sure that will dramatically speed up the model.

Did you time how long a single simulation would take and/or check the message at the beginning of the MCMC run with the estimate of the time for 1000 (I think) of gradient computations to get a ballpark estimate of how long a whole chain should take?

Some models are just intense, lots of data points and/or of parameters which makes gradient calculations expensive, and sometimes the solution is just to make sure everything is working correctly and let it run for a day, a week, a month, that can happen.

You can also try a small version of the model by reducing the value of k , p, q parameters and get an idea of how the computational cost scales with size, and see if it’s worth trying to radically optimize the model, or just let it run longer.

1 Like

One recommendation I have when working with Stan (or statistical methods generally) is to start with simpler models and make them more complex. This helps pin down where the issues are.

In my experience with economic and financial market data, I started running in to problems when I went from univariate normal models to models with a multivariate normal. It really was the correlation structure that caused the problems.

My solution to this was to convert the multivariate normal to a series of univariate regressions. The first variable would just be a univariate normal. The second would regress the second on the first with univariate error. The third was a regression of 3 on 1 on 2. And so on.

It turns out that this is the same general approach used in the VAR case. There’s a paper by a Purdue econometrician on large Bayesian VARs that basically does this by converting a normal VAR to an SVAR. My approach above is basically a less sophisticated version of that where you assume VAR coefficients are an identity matrix (for a VAR(1)).

So that would be my advice. I assume you’ve already done the usual things like check for stationary. Sarah Heaps has a great paper on priors enforcing stationarity in VARs. That might also be useful.

1 Like

Thanks for the reply!

I am using the same data as in Villani (2009), and there k=7, q=2, p=4, and N=100 observations available for estimation. So 196 + 14 regression coefficients in the model.

Running 1 chain with 10 000 iterations and 5000 as warmup gives this:

SAMPLING FOR MODEL ā€˜anon_model’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.00289 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 28.9 seconds.
Chain 1: Adjust your expectations accordingly!

I think I may have a distorted view on what ā€œa long time to estimateā€ is… it takes about 17 mins for stan to run 1 chain with 10 000 iterations and 5000 as warmup

Chain 1: Elapsed Time: 611.261 seconds (Warm-up)
Chain 1: 428.049 seconds (Sampling)
Chain 1: 1039.31 seconds (Total)

And my Gibbs sampler chain takes about 9 minutes. So around 50% faster, but in absolute terms perhaps the difference is not really something to worry about… It’s not a big deal to let it run for 17 mins :-D

Thanks for the reply.

Interesting, I will have to try it out! So you are essentially not including lags of other variables in each equation? Only ā€œown lagsā€?

I use the data from Villani (2009) where it is assumed that the data is stationary. But for example, the interest rate variables are clearly non-stationary, so I guess an issue would be that some samples of \beta are equivivalent to a non-stationary VAR. The minnesota prior (with prior means for first own lags set to 0.9 for level variables, and 0 for growth rate variables) correspond to a stationary VAR, but perhaps not strongly enough. I wll check out the paper by Heaps :-)

It’s totally understandable. One of the first adjustments when moving from ā€œclassicā€ samplers to HMC is the relationship between time and number of samples. HMC will almost invariably take longer because the proposal method involves way more steps and computation (of which multiple gradient computations, one for each leapfrog should be the most costly), but that is compensated by the ā€œqualityā€ of the samples (i.e. lower correlation, faster convergence to high probability regions, more efficient sampling of this typical set), which is why fewer HMC samples usually still outperforms several fold more of methods like MH or Gibbs (nevertheless, and without objective support for it, I still believe in some particular cases simpler methods may be better suited than HMC, but just because HMC would be overkill, not because it is more likely to fail).

Sometimes this improvement can be seen through the common MCMC diagnostics, but they may also be more subtle. If the samplers are not implemented under the same framework, some of the differences can be explained by that (e.g. a simpler sampler may be implemented in Fortran or C, while a modern sampler may use some higher level language in the method or interface). That’s why there should be a (I’d guess small, though) speed difference between CmdStan and other interfaces (that could be the better benchmark reference).

2 Likes

(post deleted by author)

I see! Thanks for the detailed and interesting response!

1 Like

The single biggest speedup here will very likely be not taking the Kronecker product. Carrying around that huge sparse matrix is very costly, and all you need are the block-diagonal elements. The indexing might be a bit more painful, but if you want this to go fast, just carry around the needed quantities in a non-sparse, non-redundant format.