I’m seeing poor performance using reduce_sum
to parallelize the computation of the likelihood (much worse using 4 threads vs 1) and this problem persists even in high dimensions. I’m hoping to gain some insight on this behavior and hopefully improve the threaded performance.
Toy data
This isn’t my actual application, but it’s a simple model that produces the same problematic results. The generative model here is
## variance components:
v_beta, v_resid ~ Cauchy priors
## M-dimensional latent random effects
beta | var components ~ iid Gaussian(0, sqrt(v_beta / M))
## N-dimensional outcome
y | beta, var components ~ indep Gaussian(X beta, sqrt(v_resid))
where X
is a deterministic N
x M
design matrix with standardized columns. A toy dataset can be generated in R using
## generate toy data N=500;M=200
set.seed(1)
N=500
M=200
v_resid=.6
v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)
Stan code
Here is my Stan implementation. The likelihood of y | .
portion of the model is the part I will be parallelizing. The problem size here scales with N
. There are two possible lpdf evaluations here, one of which can be commented out for the sequential version and one for the parallel version.
functions {
real partial_sum(array[] real y_slice,
int start, int end,
real resid_sd_par,
array[] real linpred) {
return normal_lpdf(y_slice | linpred[start:end], resid_sd_par);
}
}
data {
int<lower=0> N; // number of individuals
int<lower=0> M; // number of predictors
array[N] real y; // outcome
matrix[N,M] X; // predictors
int<lower=1> grainsize;
}
parameters {
real<lower=0> v_beta; // expected SS of random effects
real<lower=0> v_resid; // residual variance component
vector[M] beta; // latent random effects
}
transformed parameters {
real sdb = sqrt(v_beta/M); // sd of each beta
real resid_sd=sqrt(v_resid); // noise standard dev
array[N] real Xb=to_array_1d(X*beta); // linear predictors
}
model {
profile("priors") {
// priors on variance components
target += cauchy_lpdf(v_beta | 0, 1.0);
target += cauchy_lpdf(v_resid | 0, 1.0);
}
profile("beta_lpdf"){
// effects | genotypes, variance components
target += normal_lpdf(beta | 0, sdb);
}
profile("likelihood of y | .") {
// ONE OF THESE WILL BE COMMENTED OUT
// lmm_test.par.stan version:
target += reduce_sum(partial_sum, y, grainsize, resid_sd, Xb);
// lmm_test.seq.stan version:
target += normal_lpdf(y | Xb, resid_sd);
}
}
The only difference between the two models is whether I call normal_lpdf
or reduce_sum
.
Performance
Despite more CPU usage in the parallel code, the model is much slower. Here are timings increasing N
from 500 to 5,000 to 20,000:
## compile models
mod_par <- cmdstan_model('lmm_test.par.stan', cpp_options = list(stan_threads = TRUE))
mod_seq <- cmdstan_model('lmm_test.seq.stan', cpp_options = list(stan_threads = FALSE))
## sequential model N=500, M=200
# 1.8 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1)
## parallel model N=500, M=200
# 8 seconds, CPU usage up to 392%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1)
## generate toy data N=5000; M=200
set.seed(1);N=5000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)
## sequential model N=5000, M=200
# 20.3 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1)
## parallel model N=5000, M=200
# 96.9 seconds, CPU usage up to 320%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1)
## sequential model N=500, M=200
## generate toy data N=20000; M=200; fewer iterations because im impatient
set.seed(1);N=20000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)
## sequential model N=20000, M=200
# 33.3 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1,
iter_warmup=400, iter_sampling=0)
## parallel model N=20000, M=200
# 97.2 seconds, CPU usage up to 316%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1,
iter_warmup=400, iter_sampling=0)
The profiles look like this:
## comparing the profiles:
mod_fit_seq$profiles()
[[1]]
name thread_id total_time forward_time reverse_time
1 beta_lpdf 140177091551360 0.02311280 0.01950600 0.003606730
2 likelihood of y | . 140177091551360 1.26541000 1.01081000 0.254603000
3 priors 140177091551360 0.00609434 0.00547151 0.000622826
chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 9329 0 9329 1
2 9329 0 9329 1
3 18658 0 9329 1
mod_fit_par$profiles()
[[1]]
name thread_id total_time forward_time reverse_time
1 beta_lpdf 140237382762624 0.02608020 0.0184004 0.007679830
2 likelihood of y | . 140237382762624 69.50300000 69.2208000 0.282135000
3 priors 140237382762624 0.00761505 0.0068122 0.000802849
chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 7437 0 7437 1
2 7437 0 7437 1
3 14874 0 7437 1
Increasing N seems to mitigate the problem only slightly, which suggests to me that this isn’t about overhead. Curiously, the computational intensity, as eyeballed by examining my CPU usage on an Skylake machine with 4 physical cores, appears to modestly decline as problem size goes up.
For what it’s worth, playing around with grainsize
doesn’t seem to help, though I’ll spare you the sea of output.
Am I making an obvious mistake here? Any insights are greatly appreciated!