 Rstan much slower than self-coded MCMC

I have the following code for a hierarchical model that I coded with MCMC in R before. Surprisingly, this Stan code runs super slow (on Mac OS) comparing to my MCMC in R. Did I do anything obviously silly to slow it down? I have attached all the testing code and data here if anyone could help me out. Any comments would be appreciated.

functions{
// a function to calculate the square root of f ratios
real sqrt_f_ratio(real nu_j, real nu_0, real rho_0, row_vector omega1){
real alpha;
real front;
real omega2;
omega2 = omega1^2 + omega1^2;
alpha = 4 * nu_0 / rho_0^2;
front = (tgamma(nu_j + 1) * tgamma(nu_0)) /
(tgamma(nu_j) * tgamma(nu_0 + 1));
return (sqrt(front * (alpha / (alpha + omega2))^(nu_j - nu_0)));
}
}

data {
int<lower=1> N;       // number of locations
int<lower=1> P;       // number of variables
int<lower=1> L;       // number of frequencies
matrix[N, 2] s;       // locations N by 2
matrix[L, 2] omega;   // frequencies L by 2
matrix[N, P] Y;       // data N row-vectors of length P
cov_matrix[P] Sigma;  // covariance matrix of amplitudes
}

parameters {
vector<lower=0>[P] Tau2;       // nugget variance
vector<lower=0>[P+1] Nu;       // nu's for spectral fct f_0 ... f_p
vector<lower=0>[P+1] Rho;
matrix[L, P]  A;                  // amplitudes L row-vectors of length P
matrix[L, P]  B;
}

model {
matrix[P,P] D1[L,N];              // L by N array of diagonal matrices of size P
matrix[P,P] D2[L,N];
matrix[N,P] W;                      // N by P matrix for the mean
row_vector[P] Lsum;            // P by 1 row-vector to initiate summation over L
vector[P] sfr;                        // sqrt of f_ratio as a vector of length P

// priors
for(p in 1:(P+1)){
Nu[p] ~ lognormal(-2, 1);      // nu_p = exp(Normal) i.e. log(nu_p) ~ Normal
Rho[p] ~ lognormal(0, 1);
}
for(p in 1:P){
Tau2[p] ~ inv_gamma(0.1, 0.1);
}

for (l in 1:L){  //single index of matrix refers to its row
A[l] ~ multi_normal(rep_vector(0,P), Sigma);  // P by 1 row-vector amplitude
B[l] ~ multi_normal(rep_vector(0,P), Sigma);  // P by 1 row-vector amplitude
for(j in 1:P){
sfr[j] = sqrt_f_ratio(Nu[j+1], Nu, Rho[j+1], omega[l]);
}
for(i in 1:N){
D1[l,i] = cos(omega[l]*s[i]') * diag_matrix(sfr);
// diagonal matrix of size P
D2[l,i] = sin(omega[l]*s[i]') * diag_matrix(sfr);
// diagonal matrix of size P
}
}
for (i in 1:N){
Lsum = rep_row_vector(0, P);                      // P by 1 row-vector of all 0
for(l in 1:L){                                    // to sum over l
W[i] = Lsum + A[l] * D1[l,i] + B[l] * D2[l,i];  // P by 1 row-vector after summing over l
}
Y[i] ~ multi_normal(W[i], diag_matrix(Tau2));
}

}

And, here is the R part to read everything in and run Stan.

library(RCurl)
url1 <- "https://raw.githubusercontent.com/wenlongG/data/main/Y.csv"
url2 <- "https://raw.githubusercontent.com/wenlongG/data/main/s.csv"
N = nrow(Y)
P <- ncol(Y)
L <- 100
nu.vec_True <- c(1,1,4)
rho_True <- c(2,2,4)
grid <- seq(-pi, pi, length.out = 10)
omega <- expand.grid(grid, grid)
# use a simplified Sigma for now, instead of the mixture model
Sigma <- 0.2 * diag(P) + 0.8

## input data for Stan
stan_data <- list(N = N,
P = P,
L = L,
s = s,
omega = omega,
Y = Y,
Sigma = Sigma)

## initial values for Stan
init_0 <- function() {
list(Tau2 = rep(0.2, P),
Nu = nu.vec_True, # initiate with true value
Rho = rho_True,
A = matrix(rnorm(P*L),nrow = L, ncol = P),
B = matrix(rnorm(P*L),nrow = L, ncol = P)
)
}

## call stan to run HMC
library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit0 <- stan(model_code = stan0,
data = stan_data,
chains = 4,
init = init_0,
#warmup = 500,
iter = 40,
verbose = TRUE
)

Please start to use the multi normal cholesky and do use vectorized expressions instead of for loops.

You should probably look at the diagnostics first to see how much computational effort is being expended and why. So check on treedepth, the adaptation matrix, divergences, there might just be something very wrong with your model that doesn’t show up in another sampler.

You’re definitely right I’m supposed to vectorize some operations to ease the computation. I thought STAN should be faster than R as it’s source coded in C, even with many loops.

It looks like the diagnostics command is more for checking convergence? I’m afraid it’s not there yet. At this point, I hope there’s a profiling tool in STAN to check where the bottleneck is. Because, with the current code, it is tooo slow to finish even 100 iterations within 2 hours on a mac pro. Though I’m not sure whether it’s a problem of the model or a computation/coding issue.

there is a profiling tool in more recent stan versions…so consider switching to cmdstanr and use the latest 2.28.2 which allows to define profile blocks.

There’s a couple things here

1. You are doing everything row-wise while stan, like R and fortran, is column major order. Doing access in the wrong order essentially makes the computer fight itself to run your program.

2. There’s several places that should be vectorized such as

for(p in 1:P){
Tau2[p] ~ inv_gamma(0.1, 0.1);
}

to

Tau2 ~ inv_gamma(0.1, 0.1);

(I’d also check inv gamma is a good prior for this model as the graph looks kind of odd)

1. idt this needs to be multi-normal which is very expensive.
Y[i] ~ multi_normal(W[i], diag_matrix(Tau2));

This can just be

To_vector(Y) ~ normal(to_vector(W), to_vector(rep_vector(N, Tau2)));

Edit: I’m on mobile now and am not sure if this line is right but it’s something like the above

1. Make things that are just data like Lsum once in the transformed data block instead of once every loop

Lsum = rep_row_vector(0, P);

1. Lsum seems to serve no purpose so I would just remove it

2. The line here

for(l in 1:L){                                    // to sum over l
W[i] = Lsum + A[l] * D1[l,i] + B[l] * D2[l,i];  // P by 1 row-vector after summing over l
}

This just overwrites the row in i multiple times. This is the same as writing

W[i] = Lsum + A[L] * D1[L,i] + B[L] * D2[L,i];  // P by 1 row-vector after summing over l

I think if you fix things to get rid of the row major order indexing and do some vectorization this should go a lot faster. I also recommend starting with a simpler model and then adding more to it as you verify your simpler model works.

The other big thing here is that keep in mind Stan is not doing just MCMC like gibbs but uses the NUTS sampler. Wall time is less relevant than something like Neff / second.

8 Likes

Wow, your detailed comment is super helpful! I did modify the code for column major order, which speed up quite a lot!

For comment #2, the Tau^2 is actually a vector, with your suggestion to remove the loop:

Will it get me a vector that each element is independently drew from inv_gamma? Or it will set all elements to the same value?

For comment #4&5, Lsum is reset to 0, for each i, so that W[i] could be a sum over the index l from 1 to L in the original code as below.

So I have modified the nested loop to the following, is it legit to use "W += " here?

for (i in 1:N){
W = rep_vector(0, P);                    // P by 1 col-vector of all 0
for(l in 1:L){                                   // to sum over l, get mean vector of vector Y_i
W  += D1[l,i] * col(A, l) + D2[l,i] * col(B, l);   // P by 1 col-vector after summing over l
}
col(Y,i) ~ normal(W, Tau2);
}
1 Like

Different elements for each!

Yep!

Depending on the size of L it may be beneficial to wrap that loop into a function and use reduce_sum() over the N columns.

Also if D1 is a matrix with only the diagonals filled out it would be better to have it as a

vector[P] D1[L,N];

and I recommend pre-computing

in the transformed data block (and use dot_product(omega[l], s[i]))so it’s only done once in the program and then re-used on each iteration of nuts.

EDIT: Also my mind is a little foggy and I’m not sure what your program looks like now, but looking at the line below I think there’s a way with some linear algebra you can make D1 into a matrix of arrays and change the below to just do a matrix x vector multiplication

for(l in 1:L){                                   // to sum over l, get mean vector of vector Y_i
W  += D1[l,i] * col(A, l) + D2[l,i] * col(B, l);   // P by 1 col-vector after summing over l
}

This has been mentioned tangentially a few times but I just wanted to emphasize it: “running super slow” isn’t a well-defined comparison for generic Markov transitions.

A Stan program specifies a target density function, typically interpreted as a posterior density function. The optimality of a Stan program itself refers only to how well the evaluation of that target density has been implicated. Once we consider automatic differentiation the main concern is the full reverse mode execution which evaluates the density and its gradient together. For example the optimizations suggested by @stevebronder all influence the implementation the original model but do not affect the model itself.

Stan’s dynamic Hamiltonian Monte Carlo sampler uses many gradient evaluations per Markov transition – up to 1024 with default settings. The more complicated the geometry of the target density function around a current point, and the worse the adaptation is configured, the more gradient evaluations will be needed for the next transition. Only changing the model, or sometimes tweaking the adaptation routine, can change the number of gradient evaluations per iteration.

When looking at the total speed of running a certain number of iterations one cannot disentangle the number of gradient evaluations from the cost of each evaluation. To really understand the bottleneck one needs to look at these behaviors individually.

At the beginning of a Stan run the cost of evaluating the gradient is estimated and shared. One can also estimate the cost per gradient evaluation by dividing the total run time by the number of gradient evaluations, which is available through the get_sampler_params function; see for example Identity Crisis. The profile functionality in the most recent release of Stan will give the most precise timings.

In summary comparison between the cost of a gradient evaluation from a Stan program and a hand-written implementation can inform inefficient Stan programming. If those costs are similar, however, then any longer run time with Stan is due to longer Hamiltonian trajectories due to complex target geometries. Stan will push these trajectories as far as it can to maximize the efficiency of the statistical exploration, and that statistical performance cannot be quantified by looking a run times alone.

1 Like

Just wanted to add that “super slow” might need some additional details: what do you compare? A common problem with naive comparison of Stan to say Gibbs/Metropolis/… is measuring the time it takes to perform N iterations in each. But Stan does usually produce much “better” (less autocorrelated) samples than simpler algorithms (while using more computation time per sample). So the relevant quantity is not time per iteration but something like effective samples (as in Bulk effective sample size (bulk-ESS) — ess_bulk • posterior) per second or time to reach given effective sample size.

Another concern is that before you compare speed you need to compare correctness. If the model is difficult, it is possible that your sampler is not computing the correct posterior while Stan takes a long time but computes a correct one (or at least lets you know it failed via high Rhat/low ESS/divergent transitions/…).

If it is not super expensive to run your sampler, you can check its correctness via simulation-based calibration - the SBC package (which I develop) lets you do this relatively easily, see e.g. the intro vignette for basic concepts and Implementing a new backend (algorithm) + frequentist approximations • SBC for connecting SBC to your sampler. The relevant paper with full background is [1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration

Best of luck with your model!

2 Likes