Speeding up hierarchical VAR

I’m fitting a hierarchical vector auto-regression, with units nested within groups (C). Currently, the model takes about 6 hours to get through 1500 samples with only 1 lag, and I’d like to incorporate additional lags in the future.

data{
  
  int N; //# observations
  int K; //# dimensions of Y
  int C; //# of countries
  int T; //# of time periods in the panel
  
  int<lower = 1, upper=C> country[N]; //country id for each obs
  int<lower = 1, upper=T> time[N]; //time period id for each obs

  matrix[N,K] Y; //the outcome matrix - each variable's time series stacked
  
}

parameters{
  
  //individual level
  corr_matrix[K] Omega_local[C]; //cholesky factor - correlation of errors btwn regions
  vector<lower = 0>[K] tau[C]; //scale for residuals
  matrix[K, K] z_beta[C]; //untransformed betas 
  vector[K] z_alpha[C]; //untransformed intercepts
  
  //hierarchical priors
  real<lower = 0, upper = 1> rho; //pooling coefficient
  corr_matrix[K] Omega_global;
  vector[K] tau_loc; //mean for variance scaling factor
  vector<lower=0>[K] tau_scale; //scale for tau
  matrix[K, K] bhat_location; //mean for prior on beta
  matrix<lower = 0>[K, K] bhat_scale; //scale for prior on beta
  vector[K] ahat_location; //means for prior on intercepts
  vector<lower = 0>[K] ahat_scale; //variance for intercept prior
  
  
}

transformed parameters{
  
  matrix[K, K] beta[C]; //VAR(1) coefficients, country specific
  vector[K] alpha[C]; //country specific intercepts
  corr_matrix[K] Omega[C];

  
  for(c in 1:C){
    //recentering random effects
    alpha[c] = ahat_location + ahat_scale .*z_alpha[c];
    beta[c] = bhat_location + bhat_scale*z_beta[c];
    Omega[c] = rho*Omega_global + (1-rho)*Omega_local[c];
  }
  
}

model{

  //hyperpriors
  rho ~ beta(2,2);
  tau_loc ~ cauchy(0,1);
  tau_scale ~ cauchy(0,1);
  ahat_location ~ normal(0,1);
  ahat_scale ~ cauchy(0, 1); 
  to_vector(bhat_location) ~ normal(0, 0.5);
  to_vector(bhat_scale) ~ cauchy(0, 0.5);
  Omega_global ~ lkj_corr(1);
  
  //hierarchical priors
  for(c in 1:C){
    //non-centered paramaterization to avoid convergence issues
    z_alpha[c] ~ normal(0, 1);
    to_vector(z_beta[c]) ~ normal(0, 1);
    tau[c] ~ normal(tau_loc, tau_scale);
    Omega_local[c] ~ lkj_corr(10);
  }
  
  //likelihood
  for(n in 1:N){
    if(time[n] > 1){
      Y[n] ~ multi_normal(alpha[country[n]] + beta[country[n]]*Y[n-1]',
      quad_form_diag(Omega[country[n]], tau[country[n]]));
    }
  }
  
  
}

My initial guess is that the the loop in the likelihood is probably what’s slowing things down, but I’m not sure if that actually can be replaced with anything more efficient.

Any suggestions on performance improvements are much appreciated!

1 Like

First you should check out the optimizations in the Multivariate Priors for Hierarchical Models section of the Stan User’s guide, specifically the optimizations that are presented later in the chapter. Then you should take a look at the reduced-computation-through-recognizing-redundancy trick I present here.

Oh, and I’m not super familiar with VAR models, but you seem to have T observations in each country, once per timepoint, but you don’t seem to use time as a predictor at all, which suggests to me that you can transition your model to be more similar to the one presented in the manual chapter where you end up with a likelihood that is of the form y ~ normal(something,measurement_noise). Framing it that way might help by itself, but you’d then also be prepared to use the new reduce_sum function coming in Stan 2.23

Thanks! The Cholesky factorization trick seems likely to work, though there’s no way to avoid calculating the variance for each observation since it depends on the observation’s country (unless I’m misunderstanding something in the example). I’ll give it a go.

After looking over the manual page again, I thought I figured out a way to avoid calculating the country-specific variance n times, but I’m getting variable corr_matrix does not exist. Am I misunderstanding how to define the local scope?

Revised model block:

model{

  //hyperpriors
  rho ~ beta(2,2);
  tau_loc ~ cauchy(0,1);
  tau_scale ~ cauchy(0,1);
  ahat_location ~ normal(0,1);
  ahat_scale ~ cauchy(0, 1); 
  to_vector(bhat_location) ~ normal(0, 0.5);
  to_vector(bhat_scale) ~ cauchy(0, 0.5);
  Omega_global ~ lkj_corr(1);
  
  //hierarchical priors
  for(c in 1:C){
    //non-centered paramaterization to avoid convergence issues
    z_alpha[c] ~ normal(0, 1);
    to_vector(z_beta[c]) ~ normal(0, 1);
    tau[c] ~ normal(tau_loc, tau_scale);
    Omega_local[c] ~ lkj_corr(10);
  }
  
  //likelihood
  
  {
  corr_matrix[K] omega_all[C];
  
  for(c in 1:C){
  omega_all[1:K, c] = to_vector(quad_form_diag(Omega[c], tau[c]));
  }
  
  
  for(n in 1:N){
    if(time[n] > 1){
      Y[n] ~ multi_normal(alpha[country[n]] + beta[country[n]]*Y[n-1]',
      omega_all[country[n]]);
    }
  }
  
  
}
}

Whoops - sorry for all the notifications, just realized I need to define the variance as a transformed parameter. Does in fact save a decent bit of time!

1 Like

Greetings,

Would you plz share your final solution. That would be a big help for me!

If you’re following @James_Savage’s implementation at RPubs - An introduction to hierarchical VAR modeling. I’ve coded up a version that is about 30% faster (if I remember correctly). It pre-allocates mu and uses segment to vectorize calls to multi_normal_cholesky. Hope this helps!

data {
  int N; // number of observations (number of rows in the panel)
  int K; // number of dimensions of the outcome variable
  int I; // number of individuals
  int T; // The greatest number of time periods for any individual
  int<lower = 1, upper = I> individual[N]; // integer vector the same length 
  // as the panel, indicating which individual each row corresponds to
  int<lower = 1, upper = T> time[N]; // integer vector with individual-time 
  // (not absolute time). Each individual starts at 1
  matrix[N, K] Y; // the outcome matrix, each individual stacked on each 
  vector[K] Y_[N - I];
  int grp_size[I];
  // other, observations ordered earliest to latest
}
parameters {
  // individual-level parameters
  cholesky_factor_corr[K] L_Omega_local[I]; // cholesky factor of correlation matrix of 
  // residuals each individual (across K outcomes)
  vector<lower = 0>[K] tau[I]; // scale vector for residuals
  matrix[K, K] z_beta[I];
  vector[K] z_alpha[I];
  
  // hierarchical priors (individual parameters distributed around these)
  real<lower = 0, upper = 1> rho;
  cholesky_factor_corr[K] L_Omega_global;
  vector[K] tau_location;
  vector<lower =0>[K] tau_scale;
  matrix[K,K] beta_hat_location;
  matrix<lower = 0>[K,K] beta_hat_scale;
  vector[K] alpha_hat_location;
  vector<lower = 0>[K] alpha_hat_scale;
}
model {
  // hyperpriors
  rho ~ beta(2, 2);
  tau_location ~ std_normal();
  tau_scale ~ exponential(1);
  alpha_hat_location ~ std_normal();
  alpha_hat_scale ~ exponential(1);
  to_vector(beta_hat_location) ~ normal(0, .5);
  to_vector(beta_hat_scale) ~ exponential(1);
  L_Omega_global ~ lkj_corr_cholesky(1);
  
  {
    int pos_ = 1;
    int pos = 1;
    vector[K] mu[N - I];
    matrix[K, K] beta[I]; // VAR(1) coefficient matrix
    vector[K] alpha[I]; // intercept matrix
    matrix[K, K] L_Omega[I];
    matrix[K, K] Omega_global = multiply_lower_tri_self_transpose(L_Omega_global);
    
   for (i in 1:I) {
      alpha[i] = alpha_hat_location + alpha_hat_scale .* z_alpha[i];
      beta[i] = beta_hat_location + beta_hat_scale .*z_beta[i];
      L_Omega[i] = cholesky_decompose(rho * Omega_global + (1 - rho) * multiply_lower_tri_self_transpose(L_Omega_local[i]));
    } 
     
    // hierarchical priors
    for (n in 1:N) {
      if (time[n] > 1) {
        mu[pos] = alpha[individual[n]] + beta[individual[n]] * Y[n - 1]';
        pos += 1;
      }
    }
    
  for(i in 1:I) {
    matrix[K, K] varcov = diag_pre_multiply(tau[i], L_Omega[i]);
    // non-centered parameterization
    z_alpha[i] ~ std_normal();
    to_vector(z_beta[i]) ~ std_normal();
    tau[i] ~ normal(tau_location, tau_scale);
    L_Omega_local[i] ~ lkj_corr_cholesky(5);
 
    segment(Y_, pos_, grp_size[i]) ~ multi_normal_cholesky(segment(mu, pos_, grp_size[i]),  varcov);
    pos_ += grp_size[i];
  }
  
  }
}
generated quantities {
  matrix[K, K] Omega_global = multiply_lower_tri_self_transpose(L_Omega_global);
}

4 Likes

Thank you for your response. I’m not very familiar with your variant implementation when it come to Y_[N-I] and segment function but your code is a great starting point!

Here’s the relevant updates to the R code to get each group’s size and make Y_, which is just the first observation removed from each time series.

grp <- 1
count <- 1
pos <- vector()
for (n in 1:length(individual)) {
  if(individual[n] == grp) {
    pos[grp] = count
    grp <- grp + 1
  } 
  count <- count + 1
}

Y_ = Y[-pos, ]

gdp_dt <- data.table::as.data.table(gdp_cons_inv_2)

mod_data <- list(
  individual = as.numeric(as.factor(gdp_cons_inv_2$country)),
  time = gdp_cons_inv_2$time,
  T = max(gdp_cons_inv_2$time),
  I = max(as.numeric(as.factor(gdp_cons_inv_2$country))),
  N = nrow(gdp_cons_inv_2),
  Y = gdp_cons_inv_2 %>% 
    ungroup %>%
    select(dl_gdp, dl_cons, dl_gfcf),
  K = 3,
  Y_ = Y_,
  grp_size = gdp_dt[, .N, by = country][, N] - 1
)

1 Like