Dear All,
I have been experimenting with Stan 2.23’s reduce_sum functionality to multi-thread a quantile regression model I am running. In parallel I have looked into the gains made by applying the QR decomposition to the same regression. What confuses me is that while both tricks result in considerable runtime gains in and of themselves, when combined the runtime is even longer than for the barebones model, that uses neither technique.
The runtimes are as follows: Barebones model 96 seconds, QR decomposition 50 seconds, reduce_sum 76 seconds and QR + reduce_sum 110 seconds. I understand that I am not making the most of the reduced_sum feature on my local machine with only 4 cores, but I am trying all this to move it into the VM environment I set up.
Could anyone shed light on that, and if at all possible, what I am doing or understanding wrong?
The generated data I am using is repackaged differently (transposed X matrix for QR decomposition, grainsize added for reduce_sum, …). All models are built through cmdstanr_model().
df1 <- data.frame(x <- runif(5e+02, 0, 1),
x2 <- x^2,
y <- rnorm(5e+02, 1 + 2*x - 1*x2, x)) %>%
select(y=3,
x=1,
x2=2)
Barebones QR model (adapted from Corson Areshenkoff’s blog post), runtime approximately 96 seconds.
functions{
real q_loss(real u, real tau){
return (u * (tau - ((u < 0) ? 1 : 0)));
}
real ald_lpdf(real y, real mu, real tau){
return log(tau) + log(1 - tau)
- q_loss(y - mu, tau);
}
}
data {
int<lower=1> N; //number of observations
int<lower=1> M; //number of predictors
int<lower=1> K; //number of quantiles
real<lower=0, upper=1> tau[K]; //quantiles
matrix[N, M] X; // predictor matrix
real y[N]; //response variable
}
parameters{
vector[M] beta[K]; //parameter
real alpha[K];
}
model{
for(n in 1:N){
for(k in 1:K){
target += ald_lpdf(y[n] | (alpha[k] + dot_product(beta[k], X[n])), tau[k]);
}
}
}
Centered QR decomposition (as in Michael Betancourt’s blog post), clocks at 50 seconds.
functions{
real q_loss(real u, real tau){
return (u * (tau - ((u < 0) ? 1 : 0)));
}
real ald_lpdf(real y, real mu, real tau){
return log(tau) + log(1 - tau)
- q_loss(y - mu, tau);
}
}
data {
int<lower=1> N; //number of observations
int<lower=1> M; //number of predictors
int<lower=1> K; //number of quantiles
real<lower=0, upper=1> tau[K]; //quantiles
matrix[M, N] X; // predictor matrix
real y[N]; //response variable
}
transformed data {
matrix[M, N] X_centered;
matrix[N, M] Q;
matrix[M, M] R;
matrix[M, M] R_inv;
for (m in 1:M){
X_centered[m] = X[m] - mean(X[m]);
}
// Compute, thin, and then scale QR decomposition
Q = qr_Q(X_centered')[, 1:M] * N;
R = qr_R(X_centered')[1:M, ] / N;
R_inv = inverse(R);
}
parameters {
vector[M] beta_tilde[K];
real alpha[K];
}
transformed parameters {
vector[M] beta[K];
for(k in 1:K){
beta[k] = R_inv * beta_tilde[k];
}
}
model{
for(n in 1:N){
for(k in 1:K){
target += ald_lpdf(y[n] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
}
}
}
Version without centered QR decomposition, making use of reduce_sum, based on this STAN user documentation, runs for 75 seconds.
functions{
real q_loss(real u, real tau){
return (u * (tau - ((u < 0) ? 1 : 0)));
}
real ald_lpdf(real y, real mu, real tau){
//print("Y: ", y, "; Mu: ", mu, "; Tau: ", tau, ";");
return log(tau) + log(1 - tau) - q_loss(y - mu, tau);
}
real partial_sum(real[] y_slice,
int start, int end,
real[] alpha, vector[] beta, matrix X, real[] tau,
int K){
real target_tmp = 0.0;
for(k in 1:K){
for(n in start:end){
target_tmp += ald_lpdf(y_slice[n-start+1] | (alpha[k] + dot_product(beta[k], X[n])), tau[k]);
}
}
return target_tmp;
}
}
data {
int<lower=1> N; //number of observations
int<lower=1> M; //number of predictors
int<lower=1> K; //number of quantiles
real<lower=0, upper=1> tau[K]; //quantiles
matrix[N, M] X; // predictor matrix
real y[N]; //response variable
int<lower=1> grainsize;
}
parameters{
vector[M] beta[K]; //parameter
real alpha[K];
}
model{
target += reduce_sum(partial_sum, y, grainsize,
alpha, beta, X, tau,
K);
}
And finally, the combined version, which takes 110 seconds, and is thus even slower than the barebones model:
functions{
real q_loss(real u, real tau){
return (u * (tau - ((u < 0) ? 1 : 0)));
}
real ald_lpdf(real y, real mu, real tau){
return log(tau) + log(1 - tau)
- q_loss(y - mu, tau);
}
real partial_sum(real[] y_slice,
int start, int end,
real[] alpha, vector[] beta_tilde, matrix Q, real[] tau,
int K){
real target_tmp = 0.0;
for(k in 1:K){
for(n in start:end){
target_tmp += ald_lpdf(y_slice[n-start+1] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
}
}
return target_tmp;
}
}
data {
int<lower=1> N; //number of observations
int<lower=1> M; //number of predictors
int<lower=1> K; //number of quantiles
real<lower=0, upper=1> tau[K]; //quantiles
matrix[M, N] X; // predictor matrix
real y[N]; //response variable
int<lower=1> grainsize;
}
transformed data {
matrix[M, N] X_centered;
matrix[N, M] Q;
matrix[M, M] R;
matrix[M, M] R_inv;
for (m in 1:M){
X_centered[m] = X[m] - mean(X[m]);
}
// Compute, thin, and then scale QR decomposition
Q = qr_Q(X_centered')[, 1:M] * N;
R = qr_R(X_centered')[1:M, ] / N;
R_inv = inverse(R);
}
parameters {
vector[M] beta_tilde[K];
real alpha[K];
}
transformed parameters {
vector[M] beta[K];
for(k in 1:K){
beta[k] = R_inv * beta_tilde[k];
}
}
model{
target += reduce_sum(partial_sum, y, grainsize,
alpha, beta_tilde, Q, tau,
K);
}
Finally, the session info:
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cmdstanr_0.0.0.9000 quantreg_5.55 SparseM_1.78 rstan_2.19.3
[5] StanHeaders_2.19.2 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
[9] purrr_0.3.4 readr_1.3.1 tidyr_1.0.3 tibble_3.0.1
[13] ggplot2_3.3.0 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 lubridate_1.7.8 lattice_0.20-41 prettyunits_1.1.1
[5] ps_1.3.3 digest_0.6.25 assertthat_0.2.1 R6_2.4.1
[9] cellranger_1.1.0 backports_1.1.7 MatrixModels_0.4-1 reprex_0.3.0
[13] stats4_4.0.0 httr_1.4.1 pillar_1.4.4 rlang_0.4.6
[17] readxl_1.3.1 rstudioapi_0.11 callr_3.4.3 Matrix_1.2-18
[21] checkmate_2.0.0 labeling_0.3 loo_2.2.0 munsell_0.5.0
[25] broom_0.5.6 compiler_4.0.0 modelr_0.1.7 pkgconfig_2.0.3
[29] pkgbuild_1.0.8 tidyselect_1.1.0 gridExtra_2.3 codetools_0.2-16
[33] audio_0.1-7 matrixStats_0.56.0 fansi_0.4.1 crayon_1.3.4
[37] dbplyr_1.4.3 withr_2.2.0 grid_4.0.0 nlme_3.1-147
[41] jsonlite_1.6.1 gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0
[45] posterior_0.0.2 magrittr_1.5 scales_1.1.1 cli_2.0.2
[49] stringi_1.4.6 farver_2.0.3 fs_1.4.1 xml2_1.3.2
[53] ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.0 tools_4.0.0
[57] glue_1.4.1 hms_0.5.3 abind_1.4-5 processx_3.4.2
[61] parallel_4.0.0 inline_0.3.15 colorspace_1.4-1 rvest_0.3.5
[65] beepr_1.3 haven_2.2.0