Matrix operations (1x1) run 10x slower than using reals

Hello, I am having very slow performance for time series modeling that involves matrix operations. I am showing a 10x slowdown over the equivalent operations with reals.
I have searched topics and found that perhaps it has to do with the compiler, but I don’t really know anything about compilers or configuring them. I’ve given some information about my setup at the end of this post.

The performance drop with matrix algebra is fairly prohibitive of the hierarchical state-space modeling that I am trying to do, which requires the kalman-bucy matrix operations at each time t.

I have modified the arma11.stan demo to create a minimal example. I added a simple line of algebra in the for loop. In one model, it operates on reals, and in the other, it operates on 1x1 matrices. The slowdown from 1.2s to 13s seems disproportionate.

The problem is identical between Rtools 35 and Rtools 40 / R 3.6.3 and R 4.0.0.
Is this typical, and is there any way to speed up matrix operations?

Model with reals:

data {
  int<lower=1> T;       // number of observations
  real y[T];            // observed outputs
}
parameters {
  real mu;              // mean term
  real phi;             // autoregression coeff
  real<lower=0> sigma;  // noise scale
}
model {
  vector[T] nu;         // prediction for time t
  vector[T] err;        // error for time t
  
 //ADDED TEST VARS
  real b[T];
  real a;
  a = 1.0;
  b[1] =1.0;
  
  nu[1] <- mu + phi * mu;   // assume err[0] == 0
  err[1] <- y[1] - nu[1];
  for (t in 2:T) {
//ADDED TEST ALGEBRA
  	b[t] = a*a'+t;
	
    nu[t] <- mu + phi * y[t-1];
    err[t] <- y[t] - nu[t];
  }
// priors
  mu ~ normal(0,10);
  phi ~ normal(0,2);
  sigma ~ cauchy(0,5);
// likelihood
  err ~ normal(0,sigma);
}

Model with 1x1 matrices:

data {
  int<lower=1> T;       // number of observations
  real y[T];            // observed outputs
}
parameters {
  real mu;              // mean term
  real phi;             // autoregression coeff
  real<lower=0> sigma;  // noise scale
}
model {
  vector[T] nu;         // prediction for time t
  vector[T] err;        // error for time t

 //ADDED TEST VARS
  matrix[1,1] b[T];
  matrix[1,1] a;
  a[1,1]= 1.0;
  b[1][1,1] =1.0;
   
  nu[1] <- mu + phi * mu;   // assume err[0] == 0
  err[1] <- y[1] - nu[1];
  for (t in 2:T) {
//ADDED TEST ALGEBRA
	b[t] = a*a'+t;

    nu[t] <- mu + phi * y[t-1];
    err[t] <- y[t] - nu[t];
  }
// priors
  mu ~ normal(0,10);
  phi ~ normal(0,2);
  sigma ~ cauchy(0,5);
// likelihood
  err ~ normal(0,sigma);
}

R script for generating data and comparing the two models:

library(rstan)
Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7') #Recommended when rstan is called (PC has i7 3770)
options(mc.cores = parallel::detectCores())

m1<-stan_model("ar1_test_real.stan")
m2<-stan_model("ar1_test_matrix.stan")

mu <- -1.25;
sigma <- 0.75;
phi <- 0.2;
Time <- 1000;

err <- rnorm(Time,0,sigma);
nu <- rep(0,Time);
y <- rep(0,Time);
y[1] <- err[1] + mu + phi * mu;
for (t in 2:Time) y[t] <- err[t] + (mu + phi * y[t-1])

fit1 <- sampling(m1, data=list("T"=Time, "y"=y), chains=1, iter=2000, warmup=1000, seed=123)
fit2 <- sampling(m2,  data=list("T"=Time, "y"=y),chains=1, iter=2000, warmup=1000, seed=123)

Console output with run times:

> fit1 <- sampling(m1, data=list("T"=Time, "y"=y), chains=1, iter=2000, warmup=1000, seed=123)

SAMPLING FOR MODEL 'ar1_test_real' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.637 seconds (Warm-up)
Chain 1:                0.607 seconds (Sampling)
Chain 1:                1.244 seconds (Total)
Chain 1: 
> fit2 <- sampling(m2,  data=list("T"=Time, "y"=y),chains=1, iter=2000, warmup=1000, seed=123)

SAMPLING FOR MODEL 'ar1_test_matrix' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.001 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 6.953 seconds (Warm-up)
Chain 1:                6.616 seconds (Sampling)
Chain 1:                13.569 seconds (Total)
Chain 1: 
  • Operating System: Windows 7
  • RStan Version: 2.19.3
  • Output of writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars"))):
CXX14FLAGS=-O3 -mtune=corei7 -march=corei7 -Wno-unused-variable -Wno-unused-function

Without really knowing anything about compilers, I did change -O2 flags to -O3 before running the above experiment.
Here are some of the modified flags in Makeconf

CXX11 = $(BINPREF)g++ $(M_ARCH)
CXX11FLAGS = -O3 -Wall $(DEBUGFLAG) -mfpmath=sse -msse2 -mstackrealign
CXX11PICFLAGS =
CXX11STD = -std=gnu++11
## these settings are for gcc >= 8
CXX14 = $(CXX11)
CXX14FLAGS = $(CXX11FLAGS)
CXX14PICFLAGS =
CXX14STD = -std=gnu++14
  • Output of devtools::session_info("rstan")
- Session info -----------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.0 (2020-04-24)
 os       Windows 7 x64 SP 1          
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-05-31                  

- Packages ---------------------------------------------------------------------------------------------------------------------------------------------------------
 package      * version   date       lib source        
 assertthat     0.2.1     2019-03-21 [1] CRAN (R 4.0.0)
 backports      1.1.6     2020-04-05 [1] CRAN (R 4.0.0)
 BH             1.72.0-3  2020-01-08 [1] CRAN (R 4.0.0)
 callr          3.4.3     2020-03-28 [1] CRAN (R 4.0.0)
 checkmate      2.0.0     2020-02-06 [1] CRAN (R 4.0.0)
 cli            2.0.2     2020-02-28 [1] CRAN (R 4.0.0)
 colorspace     1.4-1     2019-03-18 [1] CRAN (R 4.0.0)
 crayon         1.3.4     2017-09-16 [1] CRAN (R 4.0.0)
 desc           1.2.0     2018-05-01 [1] CRAN (R 4.0.0)
 digest         0.6.25    2020-02-23 [1] CRAN (R 4.0.0)
 ellipsis       0.3.1     2020-05-15 [1] CRAN (R 4.0.0)
 evaluate       0.14      2019-05-28 [1] CRAN (R 4.0.0)
 fansi          0.4.1     2020-01-08 [1] CRAN (R 4.0.0)
 farver         2.0.3     2020-01-16 [1] CRAN (R 4.0.0)
 ggplot2      * 3.3.1     2020-05-28 [1] CRAN (R 4.0.0)
 glue           1.4.1     2020-05-13 [1] CRAN (R 4.0.0)
 gridExtra      2.3       2017-09-09 [1] CRAN (R 4.0.0)
 gtable         0.3.0     2019-03-25 [1] CRAN (R 4.0.0)
 inline         0.3.15    2018-05-18 [1] CRAN (R 4.0.0)
 isoband        0.2.1     2020-04-12 [1] CRAN (R 4.0.0)
 labeling       0.3       2014-08-23 [1] CRAN (R 4.0.0)
 lattice        0.20-41   2020-04-02 [1] CRAN (R 4.0.0)
 lifecycle      0.2.0     2020-03-06 [1] CRAN (R 4.0.0)
 loo            2.2.0     2019-12-19 [1] CRAN (R 4.0.0)
 magrittr       1.5       2014-11-22 [1] CRAN (R 4.0.0)
 MASS           7.3-51.5  2019-12-20 [1] CRAN (R 4.0.0)
 Matrix         1.2-18    2019-11-27 [1] CRAN (R 4.0.0)
 matrixStats    0.56.0    2020-03-13 [1] CRAN (R 4.0.0)
 mgcv           1.8-31    2019-11-09 [1] CRAN (R 4.0.0)
 munsell        0.5.0     2018-06-12 [1] CRAN (R 4.0.0)
 nlme           3.1-147   2020-04-13 [1] CRAN (R 4.0.0)
 pillar         1.4.4     2020-05-05 [1] CRAN (R 4.0.0)
 pkgbuild       1.0.8     2020-05-07 [1] CRAN (R 4.0.0)
 pkgconfig      2.0.3     2019-09-22 [1] CRAN (R 4.0.0)
 pkgload        1.0.2     2018-10-29 [1] CRAN (R 4.0.0)
 praise         1.0.0     2015-08-11 [1] CRAN (R 4.0.0)
 prettyunits    1.1.1     2020-01-24 [1] CRAN (R 4.0.0)
 processx       3.4.2     2020-02-09 [1] CRAN (R 4.0.0)
 ps             1.3.3     2020-05-08 [1] CRAN (R 4.0.0)
 R6             2.4.1     2019-11-12 [1] CRAN (R 4.0.0)
 RColorBrewer   1.1-2     2014-12-07 [1] CRAN (R 4.0.0)
 Rcpp           1.0.4.6   2020-04-09 [1] CRAN (R 4.0.0)
 RcppEigen      0.3.3.7.0 2019-11-16 [1] CRAN (R 4.0.0)
 rlang          0.4.6     2020-05-02 [1] CRAN (R 4.0.0)
 rprojroot      1.3-2     2018-01-03 [1] CRAN (R 4.0.0)
 rstan        * 2.19.3    2020-02-11 [1] CRAN (R 4.0.0)
 rstudioapi     0.11      2020-02-07 [1] CRAN (R 4.0.0)
 scales         1.1.1     2020-05-11 [1] CRAN (R 4.0.0)
 StanHeaders  * 2.19.2    2020-02-11 [1] CRAN (R 4.0.0)
 testthat       2.3.2     2020-03-02 [1] CRAN (R 4.0.0)
 tibble         3.0.1     2020-04-20 [1] CRAN (R 4.0.0)
 utf8           1.1.4     2018-05-24 [1] CRAN (R 4.0.0)
 vctrs          0.3.0     2020-05-11 [1] CRAN (R 4.0.0)
 viridisLite    0.3.0     2018-02-01 [1] CRAN (R 4.0.0)
 withr          2.2.0     2020-04-20 [1] CRAN (R 4.0.0)

[1] D:/Program Files/R/R-4.0.0/library
2 Likes

Urgh…not nice. Thanks for reporting. Is it possible for you to run this with cmdstanr so that you can easily use the latest 2.23?

Tagging @rok_cesnovar for advise on where to file this as an issue?

1 Like

Hm. You are definitely more efficient with scalar then 1x1 matrices, but a 10x slowdown does seem excessive.

Let me throw this in cmdstanr to see if we can repeat this with the latest versions. If yes, this is probably something to discuss in Math.

2 Likes

I went ahead and tried this in cmdstanr and I have the same result:
(First run is reals only, second run is 1x1 matrices).

+   data = list("T"=Time, "y"=y),
+   seed = 123,
+   chains = 1,
+   cores = 1
+ )
Running MCMC with 1 chain(s) on 1 core(s)...

Running ar1_test_real.exe "id=1" random "seed=123" data "file=D:/TEMP/RtmpAbWQxN/standata-4798175b495f.json" output \
  "file=D:/TEMP/RtmpAbWQxN/ar1_test_real-202006020848-1-645678.csv" "method=sample" "save_warmup=0" "algorithm=hmc" "engine=nuts" adapt "engaged=1"
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 1.6 seconds.
> fit <- model$sample(
+   data = list("T"=Time, "y"=y),
+   seed = 123,
+   chains = 1,
+   cores = 1
+ )
Running MCMC with 1 chain(s) on 1 core(s)...

Running ar1_test_matrix.exe "id=1" random "seed=123" data "file=D:/TEMP/RtmpAbWQxN/standata-47983332258e.json" output \
  "file=D:/TEMP/RtmpAbWQxN/ar1_test_matrix-202006020848-1-a5210a.csv" "method=sample" "save_warmup=0" "algorithm=hmc" "engine=nuts" adapt "engaged=1"
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 13.1 seconds.
> 

Thanks. Will investigate and make an issue in the Stan Math repository. If you can post the cmdstanr script that would help.

Not totally sure what you mean. I just ran it via R as below using the same stan code and simulation as the first post:

library(cmdstanr)
model<-cmdstan_model("ar1_test_matrix.stan")
model0<-cmdstan_model("ar1_test_real.stan")

fit <- model0$sample(
  data = list("T"=Time, "y"=y),
  seed = 123,
  chains = 1,
  cores = 1
)

fit <- model$sample(
  data = list("T"=Time, "y"=y),
  seed = 123,
  chains = 1,
  cores = 1
)
1 Like

That is what I meant yes. Saves me a few minutes translating rtsan to cmdstanr :) Thanks!

1 Like

Gotcha. Thanks for looking into this!

I went ahead and tried something out of curiosity. I coded a completely non-vectorized Kalman-Bucy filter for my model for speed comparisons.

Exhibit A: Kalman filter in 1x1 matrix algebra:

	for(t in 1:T){
		z= y[t, 2:(p+1)]';
		u[1] = sigmoid(x[1], b0, b1);				
		
		//Prediction
		xp = Ad*x + Bd*u;
		Pp = Ad*P*Ad'+Qd;
		
		//Correction
		r = (z - H * xp);
		S = H*Pp*H'+R;
		K = (Sinv*(H*Pp))';
		x = xp + K * r;
		
		err[t] = r';
	}

Exhibit B: Kalman filter with reals (P, x, xp, etc) and for loops for matrix operations:

	for(t in 1:T){
		z= y[t, 2:(p+1)]';
		u = sigmoid(x, b0, b1);				
		
		//Prediction
		xp = Ad[1,1]*x + Bd[1,1]*u;
		Pp = Ad[1,1]*P*Ad[1,1]+Qd[1,1];
		
		//Correction
		for(h in 1:p){
			r[h] = (z[h] - H[h,1] * xp);
		}
		K = rep_matrix(0.0,q,p);
		for(k in 1:p){
			for(l in 1:p){
				K[1,k] += Sinv[k,l]*H[l,1]*Pp;
			}
		}
		x=xp;
		for(i in 1:p){
			x +=  K[1,i] * r[i];
		}
		err[t] = r';
	}

Result A, with matrix operations:

Chain 1: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 1: Iteration: 600 / 600 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 25.138 seconds (Warm-up)
Chain 1:                17.262 seconds (Sampling)
Chain 1:                42.4 seconds (Total)
Chain 1: 

Result B, with for loops:

Chain 1: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 1: Iteration: 600 / 600 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 7.915 seconds (Warm-up)
Chain 1:                8.076 seconds (Sampling)
Chain 1:                15.991 seconds (Total)

Sweet.

1 Like