(Another) Slow Hierarchical Model in Stan

Hello,

I am attempting to estimate a hierarchical model using Stan with code based on the example in the manual (Section 9.13 in manual 2.16.0-2). I have ~564 units with ~160 observations each (total of ~90,000 observations), none missing. Within each unit, the observations are sequential (time series) and there are no unvarying “group level” predictors, i.e. all of my variables vary over time. Thus I have removed the portion of the code that refers to group-level predictors. I have a total of 17 variables and a constant that I input as x.

I have tried the code below on my laptop and estimate that the model will take +7 days to get through 1000 iterations on each of three chains; I have never run the model to completion, however every time I try to run a model the chains seem to stall around 3-4% completion. I transferred the model to a much more powerful computer (~20 cores, +200 GB of RAM) but the model seems just as slow if not slower, even when I specify cores=3 and chains=3.

NonCenteredStanCode <- "
data {
int<lower=1> N;    // number of observations, in-sample, 90000
int<lower=1> K;    // number of variables in the model, 18
int<lower=1> nID;                    // number of unique IDs (unique groups), 564
int<lower=1, upper=nID> id[N];       // group ID variable for each observation
matrix[N, K] x;       // inputting the in-sample data matrix, 90000 x 18 matrix
matrix[nID, 1] u;  // this is a vestige of the manual's programming, there are no group-level predictors here
vector[N] loutcome; // logged outcome for each observation
}
parameters {
matrix[K, nID] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0,upper=pi()/2>[K] tau_unif;
matrix[1, K] gamma;
real<lower=0> sigma;
}
transformed parameters {
matrix[nID, K] beta;
vector<lower=0>[K] tau; // prior scale
for (k in 1:K) {tau[k] = 2.5 * tan(tau_unif[k]);}
beta = u * gamma + (diag_pre_multiply(tau, L_Omega) * z)';
}
model {
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(1);
to_vector(gamma) ~ normal(0, 2);

target += normal_lpdf(loutcome | rows_dot_product(beta[id] , x), sigma);
}
"

Also, following the advice here:

https://groups.google.com/forum/#!msg/stan-users/HWiUtQLRCfE/tPPN3bAeDQAJ

I have tried a centered parameterization as well. (Code below.) This seems only marginally faster and I estimate that it will still take approximately the same amount of time to reach completion.

CenteredStanCode <- "
data {
  int<lower=0> N; // num individuals
  int<lower=1> K; // num ind predictors
  int<lower=1> nID; // was called J, num groups
  int<lower=1,upper=nID> id[N]; // group for individual
  matrix[N, K] x; // individual predictors
  vector[N] loutcome; // outcomes
}
parameters {
  corr_matrix[K] Omega; // prior correlation
  vector<lower=0>[K] tau; // prior scale
  row_vector[K] gamma; // group coeffs
  vector[K] beta[nID]; // indiv coeffs by group
  real<lower=0> sigma; // prediction error scale
}
model {
  tau ~ cauchy(0, 2.5);
  Omega ~ lkj_corr(2);
  to_vector(gamma) ~ normal(0, 5);
  {
    matrix[K, K] Sigma_beta;
    Sigma_beta = quad_form_diag(Omega, tau);
    for (j in 1:nID)
      beta[j] ~ multi_normal(gamma, Sigma_beta);
  }
  
  {
    vector[N] x_beta_id;
      for (n in 1:N)
        x_beta_id[n] = x[n] * beta[id[n]];
      loutcome ~ normal(x_beta_id, sigma);
  }
        
}
"

Any ideas about what’s happening or ways to get this moving faster? I can’t determine if this is a coding issue, or if I’m asking too much from the model with varying intercepts and slopes over 564 cross-sections, or something else entirely. I’d appreciate any insights, thanks everyone!

You need the QR reparameterization of x, for which there is a case study
http://mc-stan.org/users/documentation/case-studies/qr_regression.html
in addition to the explanation in the manual.

Also, the timings that come out before the first iteration are completely bogus.

Thank you @bgoodri for the quick suggestion. I have looked into QR reparameterization, but would appreciate if someone could double-check my code (below) since I am unsure of how to implement the reverse transformation from beta_tilde to beta in the transformed parameters block with a beta matrix.

I saw here
https://groups.google.com/forum/#!category-topic/stan-users/Ozu7kwnyS_Y
that @betanalpha suggested code the works for a vector beta but I am unsure about applying this to a matrix beta. I wrote the code below as my attempt. I suspect version 2 would be faster but am unsure if it is mathematically correct.

// Reverse the transform to get original beta

// Version 1
for (i in 1:nID){
beta[i,] = mdivide_right_tri_low(beta_tilde[i,], R');} // indiv coeffs by group

// Version 2
beta = mdivide_right_tri_low(beta_tilde, R'); // indiv coeffs by group

Full model code:

QRNonCenteredStanCode <- "

data {
// Number of observations and variables
int<lower=1> N;    // number of observations, in-sample
int<lower=1> K;    // number of variables in the model

// Identifiers
int<lower=1> nID;                    // number of unique IDs (unique state, channel, product)
int<lower=1, upper=nID> id[N];       // group ID variable for each observation

// (Individual-level) input data, QR decomposition
matrix[N, K] Q;       // inputting the in-sample data matrix
matrix[K, K] R;
matrix[K, K] R_inv;

// (Group-level) input data
matrix[nID, 1] u;  // a vestige of the manual's programming

// Outcomes
vector[N] loutcome; // outcome for each observation

}

parameters {

// Disturbances (standard normal) to create non-centered parameterization
matrix[K, nID] z;

// Cholesky decomposition of Omega (the correlation matrix)
cholesky_factor_corr[K] L_Omega;

// The scale factor
vector<lower=0,upper=pi()/2>[K] tau_unif;

// Group coefficients, gamma (unvarying for this model)
matrix[1, K] gamma;

real alpha;

// Prediction error scale (standard deviation), for final outcome, loutcome
real<lower=0> sigma;


}

transformed parameters {

// Beta, the original non-QR form
matrix[nID, K] beta;

// Beta tilde, the coefficients as modeled
matrix[nID, K] beta_tilde;

// The prior scale, tau
vector<lower=0>[K] tau; // prior scale

// Input values for tau based on the parameter tau_unif
for (k in 1:K) {tau[k] = 2.5 * tan(tau_unif[k]);}

// Create the beta matrix (vectorized form)
beta_tilde = u * gamma + (diag_pre_multiply(tau, L_Omega) * z)';

// Reverse the transform to get original beta
beta = mdivide_right_tri_low(beta_tilde, R'); // indiv coeffs by group

}

model {
// These are the priors
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(2);
to_vector(gamma) ~ normal(0, 0.1);
to_vector(beta[nID,]) ~ normal(0, .1);

// This is the sampling statement, vectorized for efficiency
target += normal_lpdf(loutcome | (rows_dot_product(beta[id] , Q) + alpha), sigma);

// Note that this implies that XB is the logged (additive) value
// thus the outcome MUST be the logged version of quotes/binds

// Note that sigma is the standard deviation, not the variance

}
"

A matrix is just a collection of vectors glued together columnwise, so it should be the same. You can always check that X * beta = Q * beta_tilde up to a very small numerical tolerance.