# Matrix operations (1x1) run 10x slower than using reals

Hello, I am having very slow performance for time series modeling that involves matrix operations. I am showing a 10x slowdown over the equivalent operations with reals.
I have searched topics and found that perhaps it has to do with the compiler, but I don’t really know anything about compilers or configuring them. I’ve given some information about my setup at the end of this post.

The performance drop with matrix algebra is fairly prohibitive of the hierarchical state-space modeling that I am trying to do, which requires the kalman-bucy matrix operations at each time t.

I have modified the arma11.stan demo to create a minimal example. I added a simple line of algebra in the for loop. In one model, it operates on reals, and in the other, it operates on 1x1 matrices. The slowdown from 1.2s to 13s seems disproportionate.

The problem is identical between Rtools 35 and Rtools 40 / R 3.6.3 and R 4.0.0.
Is this typical, and is there any way to speed up matrix operations?

Model with reals:

``````data {
int<lower=1> T;       // number of observations
real y[T];            // observed outputs
}
parameters {
real mu;              // mean term
real phi;             // autoregression coeff
real<lower=0> sigma;  // noise scale
}
model {
vector[T] nu;         // prediction for time t
vector[T] err;        // error for time t

real b[T];
real a;
a = 1.0;
b[1] =1.0;

nu[1] <- mu + phi * mu;   // assume err[0] == 0
err[1] <- y[1] - nu[1];
for (t in 2:T) {
b[t] = a*a'+t;

nu[t] <- mu + phi * y[t-1];
err[t] <- y[t] - nu[t];
}
// priors
mu ~ normal(0,10);
phi ~ normal(0,2);
sigma ~ cauchy(0,5);
// likelihood
err ~ normal(0,sigma);
}
``````

Model with 1x1 matrices:

``````data {
int<lower=1> T;       // number of observations
real y[T];            // observed outputs
}
parameters {
real mu;              // mean term
real phi;             // autoregression coeff
real<lower=0> sigma;  // noise scale
}
model {
vector[T] nu;         // prediction for time t
vector[T] err;        // error for time t

matrix[1,1] b[T];
matrix[1,1] a;
a[1,1]= 1.0;
b[1][1,1] =1.0;

nu[1] <- mu + phi * mu;   // assume err[0] == 0
err[1] <- y[1] - nu[1];
for (t in 2:T) {
b[t] = a*a'+t;

nu[t] <- mu + phi * y[t-1];
err[t] <- y[t] - nu[t];
}
// priors
mu ~ normal(0,10);
phi ~ normal(0,2);
sigma ~ cauchy(0,5);
// likelihood
err ~ normal(0,sigma);
}
``````

R script for generating data and comparing the two models:

``````library(rstan)
Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7') #Recommended when rstan is called (PC has i7 3770)
options(mc.cores = parallel::detectCores())

m1<-stan_model("ar1_test_real.stan")
m2<-stan_model("ar1_test_matrix.stan")

mu <- -1.25;
sigma <- 0.75;
phi <- 0.2;
Time <- 1000;

err <- rnorm(Time,0,sigma);
nu <- rep(0,Time);
y <- rep(0,Time);
y[1] <- err[1] + mu + phi * mu;
for (t in 2:Time) y[t] <- err[t] + (mu + phi * y[t-1])

fit1 <- sampling(m1, data=list("T"=Time, "y"=y), chains=1, iter=2000, warmup=1000, seed=123)
fit2 <- sampling(m2,  data=list("T"=Time, "y"=y),chains=1, iter=2000, warmup=1000, seed=123)
``````

Console output with run times:

``````> fit1 <- sampling(m1, data=list("T"=Time, "y"=y), chains=1, iter=2000, warmup=1000, seed=123)

SAMPLING FOR MODEL 'ar1_test_real' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1:
Chain 1:
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 0.637 seconds (Warm-up)
Chain 1:                0.607 seconds (Sampling)
Chain 1:                1.244 seconds (Total)
Chain 1:
> fit2 <- sampling(m2,  data=list("T"=Time, "y"=y),chains=1, iter=2000, warmup=1000, seed=123)

SAMPLING FOR MODEL 'ar1_test_matrix' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.001 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 1:
Chain 1:
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 6.953 seconds (Warm-up)
Chain 1:                6.616 seconds (Sampling)
Chain 1:                13.569 seconds (Total)
Chain 1:
``````
• Operating System: Windows 7
• RStan Version: 2.19.3
• Output of `writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))`:
``````CXX14FLAGS=-O3 -mtune=corei7 -march=corei7 -Wno-unused-variable -Wno-unused-function
``````

Without really knowing anything about compilers, I did change -O2 flags to -O3 before running the above experiment.
Here are some of the modified flags in Makeconf

``````CXX11 = \$(BINPREF)g++ \$(M_ARCH)
CXX11FLAGS = -O3 -Wall \$(DEBUGFLAG) -mfpmath=sse -msse2 -mstackrealign
CXX11PICFLAGS =
CXX11STD = -std=gnu++11
## these settings are for gcc >= 8
CXX14 = \$(CXX11)
CXX14FLAGS = \$(CXX11FLAGS)
CXX14PICFLAGS =
CXX14STD = -std=gnu++14
``````
• Output of `devtools::session_info("rstan")`
``````- Session info -----------------------------------------------------------------------------------------------------------------------------------------------------
setting  value
version  R version 4.0.0 (2020-04-24)
os       Windows 7 x64 SP 1
system   x86_64, mingw32
ui       RStudio
language (EN)
collate  English_United States.1252
ctype    English_United States.1252
tz       America/New_York
date     2020-05-31

- Packages ---------------------------------------------------------------------------------------------------------------------------------------------------------
package      * version   date       lib source
assertthat     0.2.1     2019-03-21 [1] CRAN (R 4.0.0)
backports      1.1.6     2020-04-05 [1] CRAN (R 4.0.0)
BH             1.72.0-3  2020-01-08 [1] CRAN (R 4.0.0)
callr          3.4.3     2020-03-28 [1] CRAN (R 4.0.0)
checkmate      2.0.0     2020-02-06 [1] CRAN (R 4.0.0)
cli            2.0.2     2020-02-28 [1] CRAN (R 4.0.0)
colorspace     1.4-1     2019-03-18 [1] CRAN (R 4.0.0)
crayon         1.3.4     2017-09-16 [1] CRAN (R 4.0.0)
desc           1.2.0     2018-05-01 [1] CRAN (R 4.0.0)
digest         0.6.25    2020-02-23 [1] CRAN (R 4.0.0)
ellipsis       0.3.1     2020-05-15 [1] CRAN (R 4.0.0)
evaluate       0.14      2019-05-28 [1] CRAN (R 4.0.0)
fansi          0.4.1     2020-01-08 [1] CRAN (R 4.0.0)
farver         2.0.3     2020-01-16 [1] CRAN (R 4.0.0)
ggplot2      * 3.3.1     2020-05-28 [1] CRAN (R 4.0.0)
glue           1.4.1     2020-05-13 [1] CRAN (R 4.0.0)
gridExtra      2.3       2017-09-09 [1] CRAN (R 4.0.0)
gtable         0.3.0     2019-03-25 [1] CRAN (R 4.0.0)
inline         0.3.15    2018-05-18 [1] CRAN (R 4.0.0)
isoband        0.2.1     2020-04-12 [1] CRAN (R 4.0.0)
labeling       0.3       2014-08-23 [1] CRAN (R 4.0.0)
lattice        0.20-41   2020-04-02 [1] CRAN (R 4.0.0)
lifecycle      0.2.0     2020-03-06 [1] CRAN (R 4.0.0)
loo            2.2.0     2019-12-19 [1] CRAN (R 4.0.0)
magrittr       1.5       2014-11-22 [1] CRAN (R 4.0.0)
MASS           7.3-51.5  2019-12-20 [1] CRAN (R 4.0.0)
Matrix         1.2-18    2019-11-27 [1] CRAN (R 4.0.0)
matrixStats    0.56.0    2020-03-13 [1] CRAN (R 4.0.0)
mgcv           1.8-31    2019-11-09 [1] CRAN (R 4.0.0)
munsell        0.5.0     2018-06-12 [1] CRAN (R 4.0.0)
nlme           3.1-147   2020-04-13 [1] CRAN (R 4.0.0)
pillar         1.4.4     2020-05-05 [1] CRAN (R 4.0.0)
pkgbuild       1.0.8     2020-05-07 [1] CRAN (R 4.0.0)
pkgconfig      2.0.3     2019-09-22 [1] CRAN (R 4.0.0)
pkgload        1.0.2     2018-10-29 [1] CRAN (R 4.0.0)
praise         1.0.0     2015-08-11 [1] CRAN (R 4.0.0)
prettyunits    1.1.1     2020-01-24 [1] CRAN (R 4.0.0)
processx       3.4.2     2020-02-09 [1] CRAN (R 4.0.0)
ps             1.3.3     2020-05-08 [1] CRAN (R 4.0.0)
R6             2.4.1     2019-11-12 [1] CRAN (R 4.0.0)
RColorBrewer   1.1-2     2014-12-07 [1] CRAN (R 4.0.0)
Rcpp           1.0.4.6   2020-04-09 [1] CRAN (R 4.0.0)
RcppEigen      0.3.3.7.0 2019-11-16 [1] CRAN (R 4.0.0)
rlang          0.4.6     2020-05-02 [1] CRAN (R 4.0.0)
rprojroot      1.3-2     2018-01-03 [1] CRAN (R 4.0.0)
rstan        * 2.19.3    2020-02-11 [1] CRAN (R 4.0.0)
rstudioapi     0.11      2020-02-07 [1] CRAN (R 4.0.0)
scales         1.1.1     2020-05-11 [1] CRAN (R 4.0.0)
StanHeaders  * 2.19.2    2020-02-11 [1] CRAN (R 4.0.0)
testthat       2.3.2     2020-03-02 [1] CRAN (R 4.0.0)
tibble         3.0.1     2020-04-20 [1] CRAN (R 4.0.0)
utf8           1.1.4     2018-05-24 [1] CRAN (R 4.0.0)
vctrs          0.3.0     2020-05-11 [1] CRAN (R 4.0.0)
viridisLite    0.3.0     2018-02-01 [1] CRAN (R 4.0.0)
withr          2.2.0     2020-04-20 [1] CRAN (R 4.0.0)

[1] D:/Program Files/R/R-4.0.0/library
``````
2 Likes

Urgh…not nice. Thanks for reporting. Is it possible for you to run this with cmdstanr so that you can easily use the latest 2.23?

Tagging @rok_cesnovar for advise on where to file this as an issue?

1 Like

Hm. You are definitely more efficient with scalar then 1x1 matrices, but a 10x slowdown does seem excessive.

Let me throw this in cmdstanr to see if we can repeat this with the latest versions. If yes, this is probably something to discuss in Math.

2 Likes

I went ahead and tried this in cmdstanr and I have the same result:
(First run is reals only, second run is 1x1 matrices).

``````+   data = list("T"=Time, "y"=y),
+   seed = 123,
+   chains = 1,
+   cores = 1
+ )
Running MCMC with 1 chain(s) on 1 core(s)...

Running ar1_test_real.exe "id=1" random "seed=123" data "file=D:/TEMP/RtmpAbWQxN/standata-4798175b495f.json" output \
"file=D:/TEMP/RtmpAbWQxN/ar1_test_real-202006020848-1-645678.csv" "method=sample" "save_warmup=0" "algorithm=hmc" "engine=nuts" adapt "engaged=1"
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup)
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup)
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup)
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup)
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup)
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1 finished in 1.6 seconds.
> fit <- model\$sample(
+   data = list("T"=Time, "y"=y),
+   seed = 123,
+   chains = 1,
+   cores = 1
+ )
Running MCMC with 1 chain(s) on 1 core(s)...

Running ar1_test_matrix.exe "id=1" random "seed=123" data "file=D:/TEMP/RtmpAbWQxN/standata-47983332258e.json" output \
"file=D:/TEMP/RtmpAbWQxN/ar1_test_matrix-202006020848-1-a5210a.csv" "method=sample" "save_warmup=0" "algorithm=hmc" "engine=nuts" adapt "engaged=1"
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup)
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup)
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup)
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup)
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup)
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1 finished in 13.1 seconds.
>
``````

Thanks. Will investigate and make an issue in the Stan Math repository. If you can post the cmdstanr script that would help.

Not totally sure what you mean. I just ran it via R as below using the same stan code and simulation as the first post:

``````library(cmdstanr)
model<-cmdstan_model("ar1_test_matrix.stan")
model0<-cmdstan_model("ar1_test_real.stan")

fit <- model0\$sample(
data = list("T"=Time, "y"=y),
seed = 123,
chains = 1,
cores = 1
)

fit <- model\$sample(
data = list("T"=Time, "y"=y),
seed = 123,
chains = 1,
cores = 1
)
``````
1 Like

That is what I meant yes. Saves me a few minutes translating rtsan to cmdstanr :) Thanks!

1 Like

Gotcha. Thanks for looking into this!

I went ahead and tried something out of curiosity. I coded a completely non-vectorized Kalman-Bucy filter for my model for speed comparisons.

Exhibit A: Kalman filter in 1x1 matrix algebra:

``````	for(t in 1:T){
z= y[t, 2:(p+1)]';
u[1] = sigmoid(x[1], b0, b1);

//Prediction

//Correction
r = (z - H * xp);
S = H*Pp*H'+R;
K = (Sinv*(H*Pp))';
x = xp + K * r;

err[t] = r';
}
``````

Exhibit B: Kalman filter with reals (P, x, xp, etc) and for loops for matrix operations:

``````	for(t in 1:T){
z= y[t, 2:(p+1)]';
u = sigmoid(x, b0, b1);

//Prediction

//Correction
for(h in 1:p){
r[h] = (z[h] - H[h,1] * xp);
}
K = rep_matrix(0.0,q,p);
for(k in 1:p){
for(l in 1:p){
K[1,k] += Sinv[k,l]*H[l,1]*Pp;
}
}
x=xp;
for(i in 1:p){
x +=  K[1,i] * r[i];
}
err[t] = r';
}
``````

Result A, with matrix operations:

``````Chain 1: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 1: Iteration: 600 / 600 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 25.138 seconds (Warm-up)
Chain 1:                17.262 seconds (Sampling)
Chain 1:                42.4 seconds (Total)
Chain 1:
``````

Result B, with for loops:

``````Chain 1: Iteration: 480 / 600 [ 80%]  (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%]  (Sampling)
Chain 1: Iteration: 600 / 600 [100%]  (Sampling)
Chain 1:
Chain 1:  Elapsed Time: 7.915 seconds (Warm-up)
Chain 1:                8.076 seconds (Sampling)
Chain 1:                15.991 seconds (Total)
``````

Sweet.

1 Like