30%-40% drop in sampling time using stanc O1 optimizations

avehtari · July 27, 2022, 1:30pm

I’m reporting here some results from using recent stanc O1 optimization. The O1 optimization was enabled in 2.29 and some fixes in 2.30 has made it working well with many models I’ve tested. O1 optimization is not on by default as it still needs more testing before making it default, but it’s easy to enable it, so it’s worth testing if you have slow sampling and the model has vector or array parameters.

Especially, the possibility of using a struct-of-arrays memory mapping for vector and array parameters can make a big difference. I see for many posteriors 25%-40% drop in sampling time. The posteriors for which I see big speed-ups are for example GLMs with a large number of predictors, basis function based splines and GPs, and covariance matrix based GPs. Struct-of-arrays memory mapping changes how the parameter and gradient values are stored, but can be made only for vectors and arrays that are not individually indexed, plus there are some other constraints.

for cmdstan_model() add
stanc_options = list("O1")

for brm() add (works at least for GLMs)
backend="cmdstanr", stan_model_args=list(stanc_options = list("O1"))

brms generated code is not (yet) optimized for benefiting from --O1, so there maybe more brms models that will benefit from the --O1 later.

You can also check which parameters can use structure of array (SoA) memory mapping by new debug-mem-patterns option:

model <- cmdstan_model(stan_file=file, compile=FALSE)
model$check_syntax(stanc_options = list("debug-mem-patterns", "O1"))

You see either AoS (array-of-structures) or SoA (structure-of-arrays). If you see some SoA, you may expect sampling speed differences. Please report your experiences, for example, in this thread

There is more information about “Stan compiler optimization levels” and “New optimization to better utilize vectorization and memory throughput” in Release notes Release of CmdStan 2.29 – The Stan Blog (although the release notes is for 2.29, it’s better use 2.30 (or later))

Thanks for all Stan C++ / stanc developers for making Stan faster

wds15 · July 27, 2022, 1:51pm

Thanks Aki for reporting this. Very encouraging.

Also a big thanks to @stevebronder who was/is key driver behind this thing for years… in fact, the entire stanc3 folks were chased by Steve to make varmat a reality (at least this is how I observed all of this from a GitHub notification reader).

It’s time to try this out now as it sounds!

wds15 · July 28, 2022, 9:26am

I was curious to try it out and took the models from my two R packages. So the model in the R packageOncoBayes2 does not benefit at all, since all data types end-up being AoS. I expected this, given that I loop over everything. The model in RBesT in contrast came down to being all SoA! I was not sure if that would work out like this, given that I do index into vectors, but the indexing always happens “on block”, so no loops. The model is really tiny and it’s not really needed to speedup this model, but running super long chains I am getting about 12% speedup … without changing any line of code!

The model is a one-level random effects model applied to very few data-lines usually, but it’s nice to see what works with SoA.

(odd things to write about AoS vs SoA… SoS?)

dotPiano · August 6, 2022, 9:46am

Since you are asking for feedback: I tried both O1 and Oexperimental on a nested distributional model in brms with about 2500 parameters

model <- brm(
  bf(value ~ 1 + (1|a) + (1|a:b) + (1|a:b:c), sigma ~ 1 +(1|a)), df, backend = "cmdstanr", 
  stan_model_args=list(stanc_options = list("O1")
)

but I don’t see any difference in the runtime (Ubuntu 20.04 on WSL, GCC 9.4.0, cmdstan 2.30.1).

Could it be that this is due to the fact that brms emits for loops instead of vector operations? I’m not sure to what extent stan can vectorize this kind of code:

real partial_log_lik_lpmf(...) {
   // ... 
   // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
    for (n in 1 : N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn] + r_2_1[J_2[nn]] * Z_2_1[nn]
               + r_3_1[J_3[nn]] * Z_3_1[nn];
    }
    for (n in 1 : N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      sigma[n] += r_4_sigma_1[J_4[nn]] * Z_4_sigma_1[nn];
    }
    for (n in 1 : N) {
      // apply the inverse link function
      sigma[n] = exp(sigma[n]);
    }
    // ...
  }

avehtari · August 6, 2022, 10:11am

I don’t recommend using Oexperimental. It’s known to have unsafe optimizations, and also the compilation failed for the two models I tried.

Yes. Maybe I should have mentioned that I have tested these, too, and the reason is that currently the hierarchical models are easiest to code with loops. To get better performance, we need sparse matrix support, which is in the pipeline, but unfortunately will still take a lot of work to be done. In hierarchical models, the hierarchy can be presented in the design matrix as data, but that design matrix will be mostly zeros. When doing the design matrix times parameter vector multiplication. using a dense matrix would unnecessarily do all the multiplications by zeros and also have a big memory usage. To avoid this brms is instead using for loops to add hierachical model terms together, but then that causes the overhad in the Stan autodiff. Sparse matrices would have the same memory usage as for loops, but the sparse matrix times parameter vector multiplication would be handled by Eigen which would take care of efficient memory access and use of modern CPU features and we would get the SoA speedup.

dotPiano · August 6, 2022, 4:17pm

Thanks for the clarification!

wds15 · August 8, 2022, 8:28am

I did investigate this a bit more and found that

reduce_sum seems to block SoA to work right now… I don’t see why this is the case, so maybe/hopefully that can be fixed.
We do not need full blown sparse matrices as it looks. Simple vectorized indexing will do the trick already and hopefully result in nice speedups.

While I don’t know how easy it is for @paul.buerkner to handle the following scheme in brms, I am hoping that is is easy to implement and does not have side-effects:

  // compute partial sums of the log-likelihood
  real partial_log_lik_lpmf(int[] seq, int start, int end, data vector Y, data matrix Xc, vector b, real Intercept, real sigma, data int[] J_1, data vector Z_1_1, vector r_1_1) {
    real ptarget = 0;
    int N = end - start + 1;
    // initialize linear predictor term
    //vector[N] mu;  = Intercept + rep_vector(0.0, N);
    vector[N] mu = rep_vector(Intercept, N);
    /*
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn];
    }
    */
    mu += r_1_1[J_1[start:end]] .* Z_1_1[start:end];
    ptarget += normal_id_glm_lpdf(Y[start:end] | Xc[start:end], mu, b, sigma);
    return ptarget;
  }

I filed respective reports:

github.com/stan-dev/stanc3

[BUG] `reduce_sum` blocks `SoA`

opened 08:15AM - 08 Aug 22 UTC

wds15

### Current Behavior: Stan models using `reduce_sum` cannot benefit from the …`SoA` optimisation whenever `reduce_sum` is used with a data only slicing argument. ### Expected Behavior: If the argument to be sliced for `reduce_sum` is data only, then `SoA` should work just fine, given that the shared arguments are simply copied as a whole. Moreover, whenever the slicing argument is an array of parameters, then `SoA` should also work given that an array ( `std::vector`) is being chopped into pieces. ### Model Code <details><summary>Model which exhibits the issue</summary> ```stan // likelihood including constants if (!prior_only) { // reduce_sum is blocking SoA, but it should not since the slice // argument is data in this case //target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1); // this call instead gives me a full SoA program target += partial_log_lik_lpmf(seq| 1, N, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1); } ``` </details> ### Environment: Stan 2.30.0 ### Anything else: Here is the full Stan model, which is a minor modification of a brms generated one. So the goal is to get brms to create `reduce_sum` programs which can benefit from `SoA` in addition to within-chain parallelisation. [soa-example-brms-soa_stan.txt](https://github.com/stan-dev/stanc3/files/9280024/soa-example-brms-soa_stan.txt)

github.com/paul-buerkner/brms

sampling speed improvements

opened 02:11PM - 27 Jul 22 UTC

avehtari

efficiency

Two potential speed-ups for spline and HSGP models (without random effects) #…# 1. use _glm more often. E.g. formula `y ~ s(x)` generates code with ``` vector[N] mu = Intercept + rep_vector(0.0, N) + Xs * bs + Zs_1_1 * s_1_1; target += normal_lpdf(Y | mu, sigma); ``` this can changed to use _glm (in the cases when there is _glm for the family used) ``` target += normal_id_glm_lpdf(Y | append_col(Xs,Zs_1_1), Intercept, append_row(bs,s_1_1), sigma); ``` This provided 80% drop in sampling time. This should be applicable for example for models of the form `y ~ x1 + x2 + s(x1) + s(x2)` ## 2. make code compatible with --O1 SoA See stanc --O1 optimization discourse post https://discourse.mc-stan.org/t/30-40-drop-in-sampling-time-using-stanc-o1-optimizations/28347 Even if not changing to use _glm, the code can be changed to benefit from stanc optimization --O1 and structure-of-arrays memory mapping, e.g. changing ``` // initialize linear predictor term vector[N] mu = Intercept + rep_vector(0.0, N) + Xs * bs + Zs_1_1 * s_1_1; // initialize linear predictor term vector[N] sigma = Intercept_sigma + rep_vector(0.0, N) + Xs_sigma * bs_sigma + Zs_sigma_1_1 * s_sigma_1_1; for (n in 1:N) { // apply the inverse link function sigma[n] = exp(sigma[n]); } target += normal_lpdf(Y | mu, sigma); ``` to ``` vector[N] mu = Intercept + Xs * bs + Zs_1_1 * s_1_1; // initialize linear predictor term vector[N] sigma = Intercept_sigma + Xs_sigma * bs_sigma + Zs_sigma_1_1 * s_sigma_1_1; // apply the inverse link function sigma = exp(sigma); target += normal_lpdf(Y | mu, sigma); ``` drops the sampling time 40% when using --O1. In the first code, use of SoA is blocked by adding data vector `rep_vector(0.0, N)` and by for loop over sigma. This speedup would work also with families that don't have _glm function in Stan.

Let’s hope this can be sorted out. Some benchmarking of the proposed pattern above would be great (for now without reduce_sum, so it requires this snippet in the model block to run with SoA:

  // likelihood including constants
  if (!prior_only) {
    // reduce_sum is blocking SoA, but it should not since the slice
    // argument is data in this case
    //target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);
    // this call instead gives me a full SoA program
    target += partial_log_lik_lpmf(seq| 1, N, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);
  }

rok_cesnovar · August 8, 2022, 8:34am

Currently, using UDFs prevents the use of SoA. This is obviously a big limitation of O1, but it was very difficult to make the optimizations reliable in those cases. Now that O1 has gotten (more) stable, we should explore this.

wds15 · August 8, 2022, 8:41am

Oh dear… I did not know that. UDFs are certainly needed for reduce_sum. Thanks for clarifying that.

… and yeah… we need UDFs to work with SoA for many programs.

rok_cesnovar · August 8, 2022, 8:54am

Yeah, for sure. It does complicate the analysis and optimization, so we are taking it step by step. Now that O1 has gotten more widely tested and the early returns are that its great, especially the SoA stuff, we can move forward.

avehtari · August 8, 2022, 8:54am

I tried vectorized indexing, but it didn’t work (without using reduce_sum). If you did get it working, then I assume I made some mistake.

wds15 · August 8, 2022, 9:38am

…but now I am confused, because when I do this:

target += partial_log_lik_lpmf(seq| 1, N, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);

in the model block, then I do get SoA data types. the partial_log_lik_lpmf is a UDF. Am I missing something?

This seems to work just fine:

mu += r_1_1[J_1[start:end]] .* Z_1_1[start:end];

(Note that I was not able to compile and run the model on my machine this morning due to a setup glitch, but to the best of my knowledge this is giving the same results as the for loop)

avehtari · August 8, 2022, 10:27am

Double checking that by working you mean r_1_1 parameter is SoA and you see speed-up?

wds15 · August 8, 2022, 10:32am

Not easily readable output, but it looks to me like r_1_1 is SoA:

I was not able to run the program and test for speedups, since my AV scanner stopped me from running the program on my corporate laptop.

WardBrian · August 8, 2022, 12:58pm

Yes, function inlining was enabled in this most recent update which can make it look like SoA is supported in non-recursive UDFs. We obviously are not able to in-line into a reduce sum call, though, so we’d need much more complicated analysis

wds15 · August 8, 2022, 1:23pm

I see. Since SoA targets large models where we need high-performance, it would make sense to get reduce_sum to work with it (given that reduce_sum is there for just the same reason). I don’t have any idea how complicated all of that is, but if there is a way to make this work with “simple” functions, then that’s worthwhile I guess. “Simple” in a sense that it’s doable to check things sufficiently.

The thing is that reduce_sum can scale performance in a brute force way and from the things reported so far, the SoA trick already saves a factor of about 2x, but having to give that up just for reduce_sum isn’t a broken leg on big machines, but it does hurt.

(I really don’t have a clue how hard these things are… all I can do is looby for it :D ).

WardBrian · August 8, 2022, 1:33pm

It’s certainly worth looking into, the person who could probably best say how reasonable it is is @stevebronder

stevebronder · August 8, 2022, 3:31pm

Having SoA support UDF functions is hard because of the combinatorics of all the possible function arguments being SoA or AoS. An example is like the function signature below, where we have 120 possible signatures because of the combinations of SoA or AoS matrices / vectors.

matrix foo(matrix a, vector b, matrix c, vector d, vector g);

So in the optimization step for SoA I think what we could do is:

A first pass of the optimization step where we just assume any UDF is going to work with SoA.
one pass that just looks at the UDFs and parameters / transformed parameters promoted to SoA to see if it’s possible for that UDF to actually use SoA.
With the result of (2), re-run (1) with the starting values that cannot be SoA as the results from (1) and (2).
I think we need to keep running 1-3 until the results from 3 have not changed between iterations.

We also need to keep track of which combinations we need to generate for each UDF signature. Then when we generate the C++ we need to generate each function signature and body such that they will work with the SoA / AoS combinations.

I think it’s very possible to support SoA for general UDF functions, but we need to be pretty careful about how we go about it.

wds15 · August 8, 2022, 6:51pm

But isn’t reduce_sum stuff easier? The thing is that the partial log lik functions do only return a scalar. So any SoA going into the partial log lik function will be reduced to just a real number. Maybe that’s an easier case to deal with for a start? All you’d need to ensure is that inside the partial log lik function only vectorized things are being done to the input arguments. That’s it.

stevebronder · August 8, 2022, 7:38pm

All the rules of UDFs apply even for scalar output so the optimization still needs to check which combinations of inputs are valid for the function body

Topic		Replies	Views
Have you been using some of the latest features of Stan? General	14	2579	November 12, 2021
R memory-conservation strategies with Stan Modeling specification , performance	4	614	February 28, 2021
Improving Stan sampling speed Modeling fitting-issues	2	546	March 7, 2023
Fine tuning for polynomial Posteriors Modeling performance	1	38	December 3, 2024
Stanc optimization flags CmdStan	3	492	August 4, 2023

30%-40% drop in sampling time using stanc O1 optimizations

Related topics