QR decomposition and reduce_sum undercutting each others' runtime gains?

patrickmokre · May 17, 2020, 9:40pm

Dear All,

I have been experimenting with Stan 2.23’s reduce_sum functionality to multi-thread a quantile regression model I am running. In parallel I have looked into the gains made by applying the QR decomposition to the same regression. What confuses me is that while both tricks result in considerable runtime gains in and of themselves, when combined the runtime is even longer than for the barebones model, that uses neither technique.

The runtimes are as follows: Barebones model 96 seconds, QR decomposition 50 seconds, reduce_sum 76 seconds and QR + reduce_sum 110 seconds. I understand that I am not making the most of the reduced_sum feature on my local machine with only 4 cores, but I am trying all this to move it into the VM environment I set up.

Could anyone shed light on that, and if at all possible, what I am doing or understanding wrong?

The generated data I am using is repackaged differently (transposed X matrix for QR decomposition, grainsize added for reduce_sum, …). All models are built through cmdstanr_model().

df1 <- data.frame(x <- runif(5e+02, 0, 1),
                  x2 <- x^2,
                  y <- rnorm(5e+02, 1 + 2*x - 1*x2, x)) %>%
  select(y=3,
         x=1,
         x2=2)

Barebones QR model (adapted from Corson Areshenkoff’s blog post), runtime approximately 96 seconds.

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    return log(tau) + log(1 - tau)
    - q_loss(y - mu, tau);
  }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[N, M] X; // predictor matrix
  real y[N]; //response variable
}
parameters{
  vector[M] beta[K]; //parameter
  real alpha[K];
}
model{
  for(n in 1:N){
    for(k in 1:K){
      target += ald_lpdf(y[n] | (alpha[k] + dot_product(beta[k], X[n])), tau[k]);
    }
  }
}

Centered QR decomposition (as in Michael Betancourt’s blog post), clocks at 50 seconds.

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    return log(tau) + log(1 - tau)
    - q_loss(y - mu, tau);
  }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[M, N] X; // predictor matrix
  real y[N]; //response variable
}
transformed data {
  matrix[M, N] X_centered;
  matrix[N, M] Q;
  matrix[M, M] R;
  matrix[M, M] R_inv;
  
  for (m in 1:M){
    X_centered[m] = X[m] - mean(X[m]);
  }
  
  // Compute, thin, and then scale QR decomposition
  Q = qr_Q(X_centered')[, 1:M] * N;
  R = qr_R(X_centered')[1:M, ] / N;
  R_inv = inverse(R);
}
parameters {
  vector[M] beta_tilde[K];
  real alpha[K];
}
transformed parameters {
  vector[M] beta[K];
  for(k in 1:K){
    beta[k] = R_inv * beta_tilde[k];
  }
}
model{
  for(n in 1:N){
    for(k in 1:K){
      target += ald_lpdf(y[n] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
    }
  }
}

Version without centered QR decomposition, making use of reduce_sum, based on this STAN user documentation, runs for 75 seconds.

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    //print("Y: ", y, "; Mu: ", mu, "; Tau: ", tau, ";");
    return log(tau) + log(1 - tau) - q_loss(y - mu, tau);
  }
  real partial_sum(real[] y_slice,
                   int start, int end,
                   real[] alpha, vector[] beta, matrix X, real[] tau,
                   int K){
    real target_tmp = 0.0;
    for(k in 1:K){
      for(n in start:end){
        target_tmp += ald_lpdf(y_slice[n-start+1] | (alpha[k] + dot_product(beta[k], X[n])), tau[k]);
      }
    }
    return target_tmp;
  }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[N, M] X; // predictor matrix
  real y[N]; //response variable
  int<lower=1> grainsize;
}
parameters{
  vector[M] beta[K]; //parameter
  real alpha[K];
}
model{
      target += reduce_sum(partial_sum, y, grainsize,
                           alpha, beta, X, tau,
                           K);
}

And finally, the combined version, which takes 110 seconds, and is thus even slower than the barebones model:

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    return log(tau) + log(1 - tau)
    - q_loss(y - mu, tau);
  }
  real partial_sum(real[] y_slice,
                   int start, int end,
                   real[] alpha, vector[] beta_tilde, matrix Q, real[] tau,
                   int K){
                     real target_tmp = 0.0;
                     for(k in 1:K){
                       for(n in start:end){
                        target_tmp += ald_lpdf(y_slice[n-start+1] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
                       }
                     }
                     return target_tmp;
                   }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[M, N] X; // predictor matrix
  real y[N]; //response variable
  int<lower=1> grainsize;
}
transformed data {
  matrix[M, N] X_centered;
  matrix[N, M] Q;
  matrix[M, M] R;
  matrix[M, M] R_inv;
  
  for (m in 1:M){
    X_centered[m] = X[m] - mean(X[m]);
  }
  
  // Compute, thin, and then scale QR decomposition
  Q = qr_Q(X_centered')[, 1:M] * N;
  R = qr_R(X_centered')[1:M, ] / N;
  R_inv = inverse(R);
}
parameters {
  vector[M] beta_tilde[K];
  real alpha[K];
}
transformed parameters {
  vector[M] beta[K];
  for(k in 1:K){
    beta[k] = R_inv * beta_tilde[k];
  }
}
model{
  target += reduce_sum(partial_sum, y, grainsize,
                           alpha, beta_tilde, Q, tau,
                           K);
}

Finally, the session info:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cmdstanr_0.0.0.9000 quantreg_5.55       SparseM_1.78        rstan_2.19.3       
 [5] StanHeaders_2.19.2  forcats_0.5.0       stringr_1.4.0       dplyr_0.8.5        
 [9] purrr_0.3.4         readr_1.3.1         tidyr_1.0.3         tibble_3.0.1       
[13] ggplot2_3.3.0       tidyverse_1.3.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6       lubridate_1.7.8    lattice_0.20-41    prettyunits_1.1.1 
 [5] ps_1.3.3           digest_0.6.25      assertthat_0.2.1   R6_2.4.1          
 [9] cellranger_1.1.0   backports_1.1.7    MatrixModels_0.4-1 reprex_0.3.0      
[13] stats4_4.0.0       httr_1.4.1         pillar_1.4.4       rlang_0.4.6       
[17] readxl_1.3.1       rstudioapi_0.11    callr_3.4.3        Matrix_1.2-18     
[21] checkmate_2.0.0    labeling_0.3       loo_2.2.0          munsell_0.5.0     
[25] broom_0.5.6        compiler_4.0.0     modelr_0.1.7       pkgconfig_2.0.3   
[29] pkgbuild_1.0.8     tidyselect_1.1.0   gridExtra_2.3      codetools_0.2-16  
[33] audio_0.1-7        matrixStats_0.56.0 fansi_0.4.1        crayon_1.3.4      
[37] dbplyr_1.4.3       withr_2.2.0        grid_4.0.0         nlme_3.1-147      
[41] jsonlite_1.6.1     gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0         
[45] posterior_0.0.2    magrittr_1.5       scales_1.1.1       cli_2.0.2         
[49] stringi_1.4.6      farver_2.0.3       fs_1.4.1           xml2_1.3.2        
[53] ellipsis_0.3.1     generics_0.0.2     vctrs_0.3.0        tools_4.0.0       
[57] glue_1.4.1         hms_0.5.3          abind_1.4-5        processx_3.4.2    
[61] parallel_4.0.0     inline_0.3.15      colorspace_1.4-1   rvest_0.3.5       
[65] beepr_1.3          haven_2.2.0

martinmodrak · May 28, 2020, 11:45am

This is way out of my understanding, but tagging @bbbales2, hope he has time to provide some feedback on this.

bbbales2 · May 29, 2020, 2:16pm

@patrickmokre Sorry for the delay responding. This is weird.

Can you post code to turn df1 into data that can be passed to the Stan models? I would like to investigate this.

patrickmokre · May 31, 2020, 1:43pm

Yes of course, my apologies for the late reply. This is the full step between data generation and running the first model.

df1 <- data.frame(x <- runif(5e+02, 0, 1),
                  x2 <- x^2,
                  y <- rnorm(5e+02, 1 + 2*x - 1*x2, x)) %>%
  select(y=3,
         x=1,
         x2=2)

qrr_1_data <-
  list(N=dim(df1)[1],
       M=dim(as.matrix(df1[,c(2:3)]))[2],
       K=length(seq(0.25, 0.75, 0.25)),
       tau=seq(0.25, 0.75, 0.25),
       X=df1[,c(2:3)] %>% data.matrix(),
       y=df1[,1])

qrr_1 <- stan_model("stan_03_qrr_1.stan")
system.time(qrr_1_fit <- sampling(qrr_1,
                                  data=qrr_1_data,
                                  iter=2000,
                                  warmup=1000,
                                  chains=4))
print(qrr_1_fit)

bbbales2 · June 1, 2020, 2:53pm

I ran this model on my computer with one chain and 4 threads and got:

Base model: 20s
QR: 11s
Threading: 17s
Threading + QR: 10s

That isn’t great for threading, but it’s different than your initial numbers.

Here is the script I ran with some code at the bottom for printing the timings. I didn’t double check that all the chains were mixing to the same result (can get to this eventually):

library(cmdstanr)

df1 <- data.frame(x <- runif(5e+02, 0, 1),
                  x2 <- x^2,
                  y <- rnorm(5e+02, 1 + 2*x - 1*x2, x)) %>%
  select(y=3,
         x=1,
         x2=2)

df1 <- data.frame(x <- runif(5e+02, 0, 1),
                  x2 <- x^2,
                  y <- rnorm(5e+02, 1 + 2*x - 1*x2, x)) %>%
  select(y=3,
         x=1,
         x2=2)

qrr_1_data <-
  list(N=dim(df1)[1],
       M=dim(as.matrix(df1[,c(2:3)]))[2],
       K=length(seq(0.25, 0.75, 0.25)),
       tau=seq(0.25, 0.75, 0.25),
       X=df1[,c(2:3)] %>% data.matrix(),
       y=df1[,1],
       grainsize = 1)

qrr_2_data = qrr_1_data
qrr_2_data$X = t(qrr_1_data$X) # It looks like the QR code takes a transposed X

set_num_threads(4)

m = cmdstan_model("m.stan", cpp_options = list(stan_threads = TRUE))
mq = cmdstan_model("mq.stan", cpp_options = list(stan_threads = TRUE))
mp = cmdstan_model("mp.stan", cpp_options = list(stan_threads = TRUE))
mpq = cmdstan_model("mpq.stan", cpp_options = list(stan_threads = TRUE))

fit1 = m$sample(data = qrr_1_data, chains = 1, cores = 1)
fit2 = mq$sample(data = qrr_2_data, chains = 1, cores = 1)
fit3 = mp$sample(data = qrr_1_data, chains = 1, cores = 1)
fit4 = mpq$sample(data = qrr_2_data, chains = 1, cores = 1)

fit1$time()$total
fit2$time()$total
fit3$time()$total
fit4$time()$total

Everything worked except I needed to transpose X to pass it to the QR code. Can you try to reproduce this using that script I posted?

patrickmokre · June 1, 2020, 4:20pm

Hi,
Indeed the times check out, I’ll have to check the differences between your and my script though. The runtime gains also hold for more chains, and the beta estimates are consistent. The intercept terms are different between QR decomposed and non-QR decomposed, but this must be due to the centering of the X matrix.

These are the results:

    variable          mean         mean          mean         mean
1       lp__ -2640.9397450 -2641.018465 -2641.0945050 -2640.897675
2   alpha[1]     0.9420926     1.317050     0.9418446     1.314959
3   alpha[2]     0.9846038     1.672231     0.9823121     1.668793
4   alpha[3]     1.0225634     2.035041     1.0269739     2.030561
5  beta[1,1]     1.8541424     1.890002     1.8486589     1.844029
6  beta[2,1]     2.2431359     2.269715     2.2540305     2.277254
7  beta[3,1]     2.8730527     2.852415     2.8279028     2.833837
8  beta[1,2]    -1.6733167    -1.709190    -1.6682788    -1.670023
9  beta[2,2]    -1.3387654    -1.370006    -1.3541820    -1.384277
10 beta[3,2]    -1.3332794    -1.299509    -1.2853914    -1.287499

Thank you so much!

patrickmokre · June 2, 2020, 10:51am

I re-ran the estimations adding versions with non-centered X matrices, for the “pure” QR decomposition and the QR-decomposition (“QRD”) combined with partial sum optimization. This reduces the runtime again, and leads to very similar parameter estimates between all procedures. In a next step, I will bring this comparison to a larger dataset, and report back.

Runtimes:

         m       mq      mq2       mp      mpq     mpq2
1 83.26845 53.22989 39.02749 66.85869 37.92012 27.01449

Parameter Estimates:

         var             m           mq           mq2            mp          mpq          mpq2
1       lp__ -2619.7286700 -2619.879730 -2619.7917350 -2619.8946550 -2620.003020 -2619.7577450
2   alpha[1]     0.9514278     1.304908     0.9511056     0.9508009     1.304154     0.9450342
3   alpha[2]     0.9643038     1.664485     0.9669359     0.9599234     1.667383     0.9661077
4   alpha[3]     0.9819285     1.993653     0.9884846     0.9907698     1.996367     0.9855001
5  beta[1,1]     1.7130072     1.715811     1.7138846     1.7086123     1.722444     1.7386835
6  beta[2,1]     2.3958376     2.390738     2.3866548     2.4158604     2.412941     2.3917538
7  beta[3,1]     3.0189695     2.974214     2.9750865     2.9489679     2.988294     2.9867562
8  beta[1,2]    -1.5322860    -1.528503    -1.5405695    -1.5274898    -1.535095    -1.5582014
9  beta[2,2]    -1.5080936    -1.507999    -1.5071372    -1.5300649    -1.525162    -1.5087138
10 beta[3,2]    -1.5186338    -1.464896    -1.4663589    -1.4389903    -1.471531    -1.4827606

non-centered QRD STAN Code:

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    return log(tau) + log(1 - tau)
    - q_loss(y - mu, tau);
  }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[M, N] X; // predictor matrix
  real y[N]; //response variable
}
transformed data {
  matrix[N, M] Q;
  matrix[M, M] R;
  matrix[M, M] R_inv;
  
  // Compute, thin, and then scale QR decomposition
  Q = qr_Q(X')[, 1:M] * N;
  R = qr_R(X')[1:M, ] / N;
  R_inv = inverse(R);
}
parameters {
  vector[M] beta_tilde[K];
  real alpha[K];
}
transformed parameters {
  vector[M] beta[K];
  for(k in 1:K){
    beta[k] = R_inv * beta_tilde[k];
  }
}
model{
  for(n in 1:N){
    for(k in 1:K){
      target += ald_lpdf(y[n] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
    }
  }
}

Non-centered QRD code including partial sum optimization:

functions{
  real q_loss(real u, real tau){
    return (u * (tau - ((u < 0) ? 1 : 0)));
  }
  real ald_lpdf(real y, real mu, real tau){
    return log(tau) + log(1 - tau)
    - q_loss(y - mu, tau);
  }
  real partial_sum(real[] y_slice,
                   int start, int end,
                   real[] alpha, vector[] beta_tilde, matrix Q, real[] tau,
                   int K){
    real target_tmp = 0.0;
    for(k in 1:K){
      for(n in start:end){
        target_tmp += ald_lpdf(y_slice[n-start+1] | (alpha[k] + dot_product(beta_tilde[k], Q[n])), tau[k]);
      }
    }
    return target_tmp;
  }
}
data {
  int<lower=1> N; //number of observations
  int<lower=1> M; //number of predictors
  int<lower=1> K; //number of quantiles
  real<lower=0, upper=1> tau[K]; //quantiles
  matrix[M, N] X; // predictor matrix
  real y[N]; //response variable
  int<lower=1> grainsize;
}
transformed data {
  matrix[N, M] Q;
  matrix[M, M] R;
  matrix[M, M] R_inv;
  
  // Compute, thin, and then scale QR decomposition
  Q = qr_Q(X')[, 1:M] * N;
  R = qr_R(X')[1:M, ] / N;
  R_inv = inverse(R);
}
parameters {
  vector[M] beta_tilde[K];
  real alpha[K];
}
transformed parameters {
  vector[M] beta[K];
  for(k in 1:K){
    beta[k] = R_inv * beta_tilde[k];
  }
}
model{
  target += reduce_sum(partial_sum, y, grainsize,
                       alpha, beta_tilde, Q, tau,
                       K);
}

Topic		Replies	Views
Using reduce_sum for ODE parallelisation Modeling rstan , ode , compiler , paralellization	24	1720	April 5, 2021
Can't get reduce_sum to help with model runtime Modeling	7	353	December 19, 2023
Help with reduce_sum Modeling	32	1452	August 4, 2020
Separate partial_sum statements work, but not reduce_sum Modeling	15	660	November 18, 2020
Trying within-chain parallelization with reduce_sum increases runtime a lot Modeling paralellization	6	969	August 2, 2021

QR decomposition and reduce_sum undercutting each others' runtime gains?

Related topics