Slow Stan Code for Hierarchical Multivariate Model


#1

Hi everyone! I am trying to fit a hierarchical multivariate model. However, the model takes long long hours to run. Below are the model, Rstan code and simulation data. I hope to get some help. Thank you!

  • The model I am trying to fit is the following:

\left( \begin{array}{c} Y_i \\ \log s_i \\ \end{array} \right) \sim N_2 \biggl[ \left( \begin{array}{c} \theta_i \\log \sigma_i \\ \end{array} \right), \left( \begin{array}{cc} \sigma_i^2 & \rho_1\sigma_i \\\rho_1\sigma_i & 1 \\ \end{array} \right) \biggr]
Hierarchical model for \theta_i and \sigma^2_i,
\left( \begin{array}{c} \theta_i \\ \log \sigma_i \\ \end{array} \right) \sim N_2 \biggl[ \left( \begin{array}{c} \mu_\theta \\\mu_\sigma \\ \end{array} \right), \left( \begin{array}{cc} r_\theta^2 & \rho_2 r_{\theta}r_{\sigma} \\\rho_2 r_{\theta}r_{\sigma} & r_\sigma^2 \\ \end{array} \right) \biggr]

  • Rstan code
data {
      int<lower=1> N;
      vector[2] x[N];
    }
    parameters {
      vector[2] mu;
      vector<lower=0>[2] lambda;
      vector[2] theta[N];
      real<lower=-1,upper=1> r1;
      real<lower=-1,upper=1> r2;
    }
    transformed parameters {
      vector<lower=0>[2] sigma;
      cov_matrix[2] T;
      cov_matrix[2] M[N];
      //Reparameterization
      sigma[1] = inv_sqrt(lambda[1]);
      sigma[2] = inv_sqrt(lambda[2]);
      T[1,1] = square(sigma[1]);
      T[1,2] = r2 * sigma[1] * sigma[2];
      T[2,1] = r2 * sigma[1] * sigma[2];
      T[2,2] = square(sigma[2]);
      //Reparameterization
      for (i in 1:N){
      M[i][1,1] = square(exp(theta[i,2]));
      M[i][1,2] = r1 * exp(theta[i,2]);
      M[i][2,1] = r1 * exp(theta[i,2]);
      M[i][2,2] = 1;
      }
    }
    model {
    lambda[1] ~ gamma(1,1);
    lambda[2] ~ gamma(1,1);
    for (i in 1:N){
    theta[i] ~ multi_normal(mu, T);
    x[i] ~ multi_normal(theta[i], M[i]);}
    }
  • Data generation
i <- 999
n <- 50
mu <-100
tau <- 50

set.seed(i)
y=vector()
theta=vector()
s2=vector()
sigma2=vector()

theta=rnorm(n=n,mean=mu,sd=tau)
sigma2=rlnorm(n=n,mean=5+0.1*theta,sd=1)
y=rnorm(n,theta,sqrt(sigma2))
s2=rlnorm(n=n,mean=log(sigma2),sd=1)
data_simu=cbind(y,theta,sigma2, s2)
data <- data.frame(data_simu)[,c("theta","y","s2")]
x <- cbind(data$y,log(sqrt(data$s2)))
dat=list(N=n,x=x)

#2

Not completely understanding your code, but here are some quick thoughts:

  • It seems like the lambda parameters represent precision of the normal distribution, with a gamma prior. Andrew and others seem to be of the opinion that this is not a great parametrization - working with sigma directly and putting a Half-normal prior on sigma might improve things.
  • exp(theta[i,2]) can be computed once and then reused
  • multi_normal_cholesky tends to be a faster version. For the T matrix, which is reused, you should get a speed up by calling cholesky_decompose once and then using the result. You could be able to do some algebra and compute the Cholesky factor of the covariance matrix directly from the parameters which would be useful for the M matrices.
  • The “folk theorem” states that poor performance usually signals some problems with the model. Does the model converge? Does it recover simulated data well? Are the Rhat and n_eff statistics reasonable?
  • lambda[1] ~ gamma(1,1); lambda[2] ~ gamma(1,1); could be vectorized as lambda ~ gamma(1,1);

#3

You’re doing a lot of redundant work here.

First, I wouldn’t recommend parameterizing in terms of precision (lambda). Just use the scales sigma as parameters directly—it’s much easier to reason about priors that way.

Second, don’t recompute repeated terms. For example, just use T[2,1] = T[1,2] in the definition of the covariance matrices.

Third, add priors. They help immensely with controlling both the statistical and computational priors of models.

Overall, we tend to factor our covariance priors as you have done, but we tend to do it with an LKJ prior on the correlation matrix and then normal or maybe student-t priors on the scales. There’s a discussion in the regression chapter of the manual. This approach is computationally more efficient and tends to fit better, too.

Vectorize everything, so that it’s just lambda ~ gamma(1, 1) (though I’d really encourage something more like sigma ~ normal(0, 1).

Then take use theta ~ multi_normal(mu, T) outside of the loop so that it only needs to solve the matrix T once. Not much you can do to make the second multi-normal faster as you have a different value for each entry.


#4

Hi! I’m trying to improve my hierarchical multivariate model as well, and hope it’s okay to post the model here:

There are IRT models on top of a hierarchical multivariate model (with diagonal level 1 error variances and unstructured level 2 error variances ). Any suggestions will be greatly appreciated!

data {
  int<lower=1> N0;
  int<lower=1> jj0[N0];
  int<lower=0,upper=1> y0[N0];
  real beta0[N0];
  real alpha0[N0];

  int<lower=1> N1;
  int<lower=1> jj1[N1];
  int<lower=0,upper=1> y1[N1];
  real beta1[N1];
  real alpha1[N1];

  int<lower=1> N2;
  int<lower=1> jj2[N2];
  int<lower=0,upper=1> y2[N2];
  real beta2[N2];
  real alpha2[N2];

  int<lower=1> J;
  matrix[3,2] X;
  matrix[3,2] X_new;
  matrix[2,2] R;
  matrix[3,3] Omega;
  cov_matrix[2] wishart_sigma;
  int<lower=1> wishart_df;
}

parameters {
  vector[3] THETA[J];

  //level2/////////////////////////////////////////////////////////////////////
  vector[2] beta;
  cov_matrix[2] L2_cov;
  vector[2] pi_i[J];

  //level1////////////////////////////////////////////////////////////////////
  vector<lower=0>[3] L1_sigma;
}

transformed parameters{
  real theta0[J];
  real theta1[J];
  real theta2[J];

  matrix[3, 3] L1_cov;
  L1_cov= quad_form_diag(Omega, L1_sigma);

  for (n in 1:J)
    theta0[n]=THETA[n,1];
  for (n in 1:J)
    theta1[n]=THETA[n,2];
  for (n in 1:J)
    theta2[n]=THETA[n,3];
}

model {
  vector[N0] atb_0;
  vector[N1] atb_1;
  vector[N2] atb_2;

  // priors://////////////////////////////////////////////////////////////////
  theta0 ~ normal(0, 1);
  theta1 ~ normal(0, 1);
  theta2 ~ normal(0, 1);

  to_vector(beta) ~ normal(0, 5);
  L2_cov ~ inv_wishart(wishart_df,wishart_sigma);
  L1_sigma~uniform (0, 20);

  for (n in 1:N0)
    atb_0[n]=alpha0[n]*(theta0[jj0[n]] - beta0[n]);
  y0 ~ bernoulli_logit(atb_0);

  for (n in 1:N1)
    atb_1[n]=alpha1[n]*(theta1[jj1[n]] - beta1[n]);
  y1 ~ bernoulli_logit(atb_1);

  for (n in 1:N2)
    atb_2[n]=alpha2[n]*(theta2[jj2[n]] - beta2[n]);
  y2 ~ bernoulli_logit(atb_2);

  //random effects:////////////////////////////////////////////////////////////
  pi_i ~ multi_normal(beta,L2_cov);

  //likelihood:////////////////////////////////////////////////////////////////

  for (n in 1:J)
    THETA[n]~ multi_normal(X * pi_i[n],L1_cov);
}

generated quantities {
  vector[3] THETA_PRED[J];
  for (n in 1:J)
    THETA_PRED[n]=multi_normal_rng(X_new * pi_i[n],L1_cov);
}

[edit: escaped and indented code]


#5

Not only OK, but encouraged. I escaped it with

```
code
```

so it’d be easier to read.


#6

I’d recommend reading the manual chapter on regression—it goes over our recommended priors and also provides efficiency and programming tips. The main takeaways to look for are:

  • use a scaled LKJ rather than inverse Wishart prior
  • work with Cholesky factor parameterizations for efficiency (huge savings in more than a couple dimensions)
  • vectorize the multivariate normals (even bigger efficiency savings because you only solve once)
  • don’t use interval-constrained priors (like on L1_sigma), and if you do insist on going against our recommendations, you should declare L1_sigma with <lower = 0, upper = 20>—Stan requires the support to match the constraints and the onstraints allow L1_sigma = 30, but that is out of support for the log density
  • vectorize assignments, e.g.,
    atb_1 = alpha1 .* (theta1[jj1] - beta1); and
    and
    theta0 = THETA[ , 1];

#7

Thanks, Bob. These comments are very helpful!


#8

Thanks so much!


#9

Actually, this model does converge when I do the model checking. I believe it is the loop that makes this model need more time to run. The LKJ prior on the correlation matrix is quite helpful and I also simplified some redundant codes. Thank you so much!