Help on making my Stan code for the Steady State BVAR(p) model more efficient/faster

Hi,

I have code for the Steady State BVAR(p) model by Mattias Villani (Villani, 2009), which is one of the main macroeconomic forecasting models used by the Swedish central bank (Ankargren, Unosson, and Yang, 2020) . Unfortunately, it takes quite a long time to estimate in Stan (compared to a Gibbs sampler) and I am wondering if it is possible to make it faster. Any ideas are very welcome! Thanks :-)

The model is

y_t = \Psi x_t + A_1(y_{t-1}-\Psi x_{t-1})+\dots+A_p(y_{t-p}-\Psi x_{t-p})+u_t

where y_t is a k-dimensional vector of time series at time t, and x_t is a q-dimensional vector of deterministic variables at time t (for example a constant and/or a dummy), and u_t \sim N_k(0,\Sigma_u) with independence between time periods. Here A_\ell for \ell=1,\dots,p is (k \times k), and \Psi is (k \times q).
Now

E(y_t)=\mu_t=\Psi x_t

is the “steady state”. With this parametrization of a VAR model, we can specify our priors beliefs about the long run forecasts for the variables in the system, since they converge to the steady state (unconditional mean). As Villani points out: “Prior information on the steady state is typically available, and quite often in strong form. The forecasts of inflation undertaken at central banks operating under an explicit inflation target is an apparent example.” (Villani, 2009).

We can stack the A matrices in the (kp \times k) matrix \beta

\beta= \begin{bmatrix} A'_1 \\ \vdots \\ A'_p \end{bmatrix}

We can then rewrite the model as a nonlinear regression (Karlsson, 2013)

y_t' =x_t'\Psi' + \left[w_t'-q_t'(I_p \otimes \Psi') \right]\beta +u_t'

where w_t'=(y_{t-1}',\dots,y_{t-p}') is a kp-dimensional vector of lagged endogenous variables, and q_t'=(x_{t-1}',\dots,x_{t-p}') is a qp-dimensional vector of lagged exogenous (deterministic) variables, I_p is the (p \times p) identity matrix and \otimes denotes the Kronecker product.

Following Villani (2009), starting with \beta the usual “Minnesota” prior is used

\textrm{vec}(\beta) \sim N_{kpk}\left[\theta_\beta,\Omega_\beta\right]

I’m leaving out the specifics here, but \Omega_\beta is a diagonal matrix. And then for the steady state coefficients, we use

\textrm{vec}(\Psi) \sim N_{kq}\left[\theta_\Psi,\Omega_\Psi\right]

where \Omega_\Psi also is a diagonal matrix. At last for \Sigma_u Jeffreys prior is used

p(\Sigma_u) \propto\left|\Sigma_u \right|^{-(k+1)/2}

References

Ankargren, S., Unosson, M., and Yang, Y. (2020). A Flexible Mixed-Frequency Vector Autoregression with a Steady-State Prior. Journal of Time Series Econometrics. 12(2), Article 20180034.

Karlsson, S. (2013). Forecasting with Bayesian Vector Autoregression. In: Elliott, G. and Timmerman, A. (eds) Handbook of Economic Forecasting. Elsevier B.V. Vol 2, Part B., pp. 791-897.

Villani, M. (2009). Steady-state priors for vector autoregressions.Journal of Applied Econometrics. 24(4), pp. 630-650.


The stan code is

functions {
  matrix kron(matrix A, matrix B) { //kronecker product
  matrix[rows(A) * rows(B), cols(A) * cols(B)] C;
  int m;
  int n;
  int p;
  int q;
  m = rows(A);
  n = cols(A);
  p = rows(B);
  q = cols(B);
  for (i in 1:m) {
    for (j in 1:n) {
      int row_start;
      int row_end;
      int col_start;
      int col_end;
      row_start = (i - 1) * p + 1;
      row_end = (i - 1) * p + p;
      col_start = (j - 1) * q + 1;
      col_end = (j - 1) * q + q;
      C[row_start:row_end, col_start:col_end] = A[i, j] * B;
    }
  }
  return C;
  }
}

data {
  int<lower=0> N; //number of observations
  int<lower=0> k; //number of variables
  int<lower=0> p; //lag order
  int<lower=0> q; //number of exogenous variables
  matrix[N, k] Y; //endogenous variables (y_t)
  matrix[N, q] X; //exogenous variables (x_t)
  matrix[N, k*p] W; //lagged endogenous variables
  matrix[N, q*p] Q; //lagged exogenous variables
  vector[k*p*k] theta_beta; //vec_beta prior mean
  matrix[k*p*k, k*p*k] Omega_beta; //vec_beta prior covariance matrix
  vector[k*q] theta_Psi; //vec_Psi prior mean
  matrix[k*q, k*q] Omega_Psi; //vec_Psi prior covariance matrix
  int<lower=0> H; // Forecast horizon
  matrix[H, q] X_pred; //future exogenous variables
}

transformed data {
    matrix[p, p] I_p = diag_matrix(rep_vector(1, p)); // Identity matrix
}

parameters {
  matrix[k*p, k] beta; //beta' = (A_1,...,A_p)
  matrix[k, q] Psi; //Psi * x_t = steady state
  cov_matrix[k] Sigma_u;
}

model {
  for(t in 1:N){
      vector[k] u_t = (Y[t] - (X[t]*Psi' + (W[t]-Q[t]*(kron(I_p,Psi')))*beta))';
      u_t ~ multi_normal(rep_vector(0,k), Sigma_u);
  }
  to_vector(beta) ~ multi_normal(theta_beta, Omega_beta);
  to_vector(Psi) ~ multi_normal(theta_Psi, Omega_Psi);
  target += -0.5 * (k + 1) * log_determinant(Sigma_u);
}

generated quantities {

  matrix[k, k] A[p];
  for (i in 1:p) {
    A[i] = (beta[((i - 1) * k + 1):(i * k), :])'; //extract A_1, ..., A_p
  }

  matrix[H, k] Y_pred;

  for (h in 1:H) {

    vector[k] u_t = multi_normal_rng(rep_vector(0, k), Sigma_u);
    vector[k] yhat_t = (X_pred[h]*Psi')';

    if (h > 1) {
      for (i in 1:min(h-1, p)) {
        yhat_t += to_vector((Y_pred[h-i] - X_pred[h-i]*Psi') * A[i]');
      }
    }

    if (h <= p) {
      for (i in h:p) {
        yhat_t += to_vector((Y[N + h - i] - X[N + h - i]*Psi') * A[i]');
      }
    }
    Y_pred[h] = (yhat_t + u_t)'; //forecasts
  }
}

What’s the size of the vectors/matrices? This doesn’t seem like an awfully large model, but the kronecker product may generate a large number of calculations if the matrices become large. Sometimes its properties can be used to avoid some of the calculations and save some time, but I’m not sure that will dramatically speed up the model.

Did you time how long a single simulation would take and/or check the message at the beginning of the MCMC run with the estimate of the time for 1000 (I think) of gradient computations to get a ballpark estimate of how long a whole chain should take?

Some models are just intense, lots of data points and/or of parameters which makes gradient calculations expensive, and sometimes the solution is just to make sure everything is working correctly and let it run for a day, a week, a month, that can happen.

You can also try a small version of the model by reducing the value of k , p, q parameters and get an idea of how the computational cost scales with size, and see if it’s worth trying to radically optimize the model, or just let it run longer.

1 Like

One recommendation I have when working with Stan (or statistical methods generally) is to start with simpler models and make them more complex. This helps pin down where the issues are.

In my experience with economic and financial market data, I started running in to problems when I went from univariate normal models to models with a multivariate normal. It really was the correlation structure that caused the problems.

My solution to this was to convert the multivariate normal to a series of univariate regressions. The first variable would just be a univariate normal. The second would regress the second on the first with univariate error. The third was a regression of 3 on 1 on 2. And so on.

It turns out that this is the same general approach used in the VAR case. There’s a paper by a Purdue econometrician on large Bayesian VARs that basically does this by converting a normal VAR to an SVAR. My approach above is basically a less sophisticated version of that where you assume VAR coefficients are an identity matrix (for a VAR(1)).

So that would be my advice. I assume you’ve already done the usual things like check for stationary. Sarah Heaps has a great paper on priors enforcing stationarity in VARs. That might also be useful.

1 Like

Thanks for the reply!

I am using the same data as in Villani (2009), and there k=7, q=2, p=4, and N=100 observations available for estimation. So 196 + 14 regression coefficients in the model.

Running 1 chain with 10 000 iterations and 5000 as warmup gives this:

SAMPLING FOR MODEL ‘anon_model’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.00289 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 28.9 seconds.
Chain 1: Adjust your expectations accordingly!

I think I may have a distorted view on what “a long time to estimate” is… it takes about 17 mins for stan to run 1 chain with 10 000 iterations and 5000 as warmup

Chain 1: Elapsed Time: 611.261 seconds (Warm-up)
Chain 1: 428.049 seconds (Sampling)
Chain 1: 1039.31 seconds (Total)

And my Gibbs sampler chain takes about 9 minutes. So around 50% faster, but in absolute terms perhaps the difference is not really something to worry about… It’s not a big deal to let it run for 17 mins :-D

Thanks for the reply.

Interesting, I will have to try it out! So you are essentially not including lags of other variables in each equation? Only “own lags”?

I use the data from Villani (2009) where it is assumed that the data is stationary. But for example, the interest rate variables are clearly non-stationary, so I guess an issue would be that some samples of \beta are equivivalent to a non-stationary VAR. The minnesota prior (with prior means for first own lags set to 0.9 for level variables, and 0 for growth rate variables) correspond to a stationary VAR, but perhaps not strongly enough. I wll check out the paper by Heaps :-)

It’s totally understandable. One of the first adjustments when moving from “classic” samplers to HMC is the relationship between time and number of samples. HMC will almost invariably take longer because the proposal method involves way more steps and computation (of which multiple gradient computations, one for each leapfrog should be the most costly), but that is compensated by the “quality” of the samples (i.e. lower correlation, faster convergence to high probability regions, more efficient sampling of this typical set), which is why fewer HMC samples usually still outperforms several fold more of methods like MH or Gibbs (nevertheless, and without objective support for it, I still believe in some particular cases simpler methods may be better suited than HMC, but just because HMC would be overkill, not because it is more likely to fail).

Sometimes this improvement can be seen through the common MCMC diagnostics, but they may also be more subtle. If the samplers are not implemented under the same framework, some of the differences can be explained by that (e.g. a simpler sampler may be implemented in Fortran or C, while a modern sampler may use some higher level language in the method or interface). That’s why there should be a (I’d guess small, though) speed difference between CmdStan and other interfaces (that could be the better benchmark reference).

2 Likes

I see! Thanks for the detailed and interesting response!

1 Like

The single biggest speedup here will very likely be not taking the Kronecker product. Carrying around that huge sparse matrix is very costly, and all you need are the block-diagonal elements. The indexing might be a bit more painful, but if you want this to go fast, just carry around the needed quantities in a non-sparse, non-redundant format.

2 Likes

Alright, will have a look thx!

You ask about not including lags of other variables in each equation. What I’m saying isn’t the same per se. Think analogous to a SVAR with the structural errors driven by a cholesky decomposition. If you have 3 variables under considation, the first regression is the first variable against its own lags and other variables’ lags, the second regression is the second variable against the contemporaneous first variable, its lags, and other variables’ lags, the third regression is the third variable against the contemporaneous first and second variables, its lags, and the other variables’ lags. If I left out the other lags, it’s just out of saving on typing due to writing on iPhone.

BTW, I think this is the paper I was referring to with respect to the structural VAR-type approach: https://joshuachan.org/papers/BVAR-ACP-R1.pdf

The author has a number of other good materials on Bayesian VARs.

With respect to non-stationarity, the biggest problem is if you want to forecast, particularly over long horizons with the model. You can do the usual tests like check the eigenvalues of the companion matrix to get a sense as well. If you get a small number of violations, then you can just toss those collections of parameters. Interest rates can be kind of tricky because they can blow up when inflation expectations increase. So if you only look over the history from 1985 to present, you would get very different results compared to looking from 1960 to present. I think it makes sense to impose some kind of structure or economic intuition on short-term policy rates to help account for this.

1 Like

Pinging @spinkney to check my linear algebra here

if you are just multiplying \Psi' by a diagonal matrix in the kroneker product then the result of I_p \otimes \Psi' is just a block diagonal matrix where each block is \Psi'.

So I think you can do some clever indexing to never have to generate the actual kronecker product. Eigen used to have a block diagonal matrix type, but when I went to put it in Stan I realized it was super broken. So we have to do some weird indexing for this to work.

You can use the cholesky versions of multi_normal which should be faster

If Omega_beta and Omega_Psi are diagonal you can use just use normal instead of multi_normal.

I have not compiled the below but it should look something like this (ymmv)

transformed parameters {
  vector[k*p*k] Omega_beta_diag = sqrt(diagonal(Omega_beta));
  vector[k*p] Omega_psi_diag = sqrt(diagonal(Omega_Psi));
}

parameters {
  matrix[k*p, k] beta; //beta' = (A_1,...,A_p)
  //Psi * x_t = steady state
  matrix[k, q] Psi; 
  // Replaces `Sigma_u`
  // Sigma_u = L_u * L_u'
  cholesky_factor_cov[k] L_u;
}
model {
  for (t in 1:N) {
   // idk what to call this temp
    vector[k*p] WmQPsi_t;
    // Iterate like we going over block elements of a block diagonal matrix
    for (i in 1:p) {
      int qlo = (i - 1) * q + 1;
      int qhi = i * q;
      int klo = (i - 1) * k + 1;
      int khi = i * k;
      WmQPsi_t[klo:khi] = (W[t, klo:khi] - Q[t, qlo:qhi] * Psi')';    // (k, 1) block
    }
    vector[k] mu_t = (X[t] * Psi' + WmQPsi_t' * beta)';  // (k,1)
    Y[t] ~ multi_normal_cholesky(mu_t, L_u);
  }
  // You could just pass `Omega_beta/Omega_psi` as vectors into the model
  to_vector(beta) ~ normal(theta_beta, Omega_beta_diag);
  to_vector(Psi)  ~ normal(theta_Psi,  Omega_psi_diag);
  target += -(k + 1) * sum(log(diagonal(L_u)));
}

The general rule of thumb I have learned from @bgoodri is that if you see a kronecker product you should go to great lengths to try to do some linear algebra to try to not actually compute it.

1 Like

Thanks I see! I am actually already a bit familiar with some of Chan’s work, he has also worked with Bayesian VARMAs, which is the theoretically more attractive model compared to a Bayesian VAR. Also interesting idea with tossing the beta draws which are equivivalent with a non-stationary process… of course that will not speed up anything since i will have to do it after, but still nonetheless a good idea for a more stable model.

WRT interest rates, the structure or economic intuition on short-term policy rates here is actually that we can put a “tight” prior on the steady state/unconditional mean of the interest rate/policy rate, since we have the central bank forecasts and we know the policy rate “probably” won’t deviate super much from those forecasts (or else the central bank would lose credibility)