30%-40% drop in sampling time using stanc O1 optimizations

I’m reporting here some results from using recent stanc O1 optimization. The O1 optimization was enabled in 2.29 and some fixes in 2.30 has made it working well with many models I’ve tested. O1 optimization is not on by default as it still needs more testing before making it default, but it’s easy to enable it, so it’s worth testing if you have slow sampling and the model has vector or array parameters.

Especially, the possibility of using a struct-of-arrays memory mapping for vector and array parameters can make a big difference. I see for many posteriors 25%-40% drop in sampling time. The posteriors for which I see big speed-ups are for example GLMs with a large number of predictors, basis function based splines and GPs, and covariance matrix based GPs. Struct-of-arrays memory mapping changes how the parameter and gradient values are stored, but can be made only for vectors and arrays that are not individually indexed, plus there are some other constraints.

for cmdstan_model() add
stanc_options = list("O1")

for brm() add (works at least for GLMs)
backend="cmdstanr", stan_model_args=list(stanc_options = list("O1"))

brms generated code is not (yet) optimized for benefiting from --O1, so there maybe more brms models that will benefit from the --O1 later.

You can also check which parameters can use structure of array (SoA) memory mapping by new debug-mem-patterns option:

model <- cmdstan_model(stan_file=file, compile=FALSE)
model$check_syntax(stanc_options = list("debug-mem-patterns", "O1"))

You see either AoS (array-of-structures) or SoA (structure-of-arrays). If you see some SoA, you may expect sampling speed differences. Please report your experiences, for example, in this thread

There is more information about “Stan compiler optimization levels” and “New optimization to better utilize vectorization and memory throughput” in Release notes Release of CmdStan 2.29 – The Stan Blog (although the release notes is for 2.29, it’s better use 2.30 (or later))

Thanks for all Stan C++ / stanc developers for making Stan faster

12 Likes

Thanks Aki for reporting this. Very encouraging.

Also a big thanks to @stevebronder who was/is key driver behind this thing for years… in fact, the entire stanc3 folks were chased by Steve to make varmat a reality (at least this is how I observed all of this from a GitHub notification reader).

It’s time to try this out now as it sounds!

3 Likes

I was curious to try it out and took the models from my two R packages. So the model in the R packageOncoBayes2 does not benefit at all, since all data types end-up being AoS. I expected this, given that I loop over everything. The model in RBesT in contrast came down to being all SoA! I was not sure if that would work out like this, given that I do index into vectors, but the indexing always happens “on block”, so no loops. The model is really tiny and it’s not really needed to speedup this model, but running super long chains I am getting about 12% speedup … without changing any line of code!

The model is a one-level random effects model applied to very few data-lines usually, but it’s nice to see what works with SoA.

(odd things to write about AoS vs SoA… SoS?)

3 Likes

Since you are asking for feedback: I tried both O1 and Oexperimental on a nested distributional model in brms with about 2500 parameters

model <- brm(
  bf(value ~ 1 + (1|a) + (1|a:b) + (1|a:b:c), sigma ~ 1 +(1|a)), df, backend = "cmdstanr", 
  stan_model_args=list(stanc_options = list("O1")
)

but I don’t see any difference in the runtime (Ubuntu 20.04 on WSL, GCC 9.4.0, cmdstan 2.30.1).

Could it be that this is due to the fact that brms emits for loops instead of vector operations? I’m not sure to what extent stan can vectorize this kind of code:

real partial_log_lik_lpmf(...) {
   // ... 
   // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] sigma = Intercept_sigma + rep_vector(0.0, N);
    for (n in 1 : N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn] + r_2_1[J_2[nn]] * Z_2_1[nn]
               + r_3_1[J_3[nn]] * Z_3_1[nn];
    }
    for (n in 1 : N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      sigma[n] += r_4_sigma_1[J_4[nn]] * Z_4_sigma_1[nn];
    }
    for (n in 1 : N) {
      // apply the inverse link function
      sigma[n] = exp(sigma[n]);
    }
    // ...
  }

I don’t recommend using Oexperimental. It’s known to have unsafe optimizations, and also the compilation failed for the two models I tried.

Yes. Maybe I should have mentioned that I have tested these, too, and the reason is that currently the hierarchical models are easiest to code with loops. To get better performance, we need sparse matrix support, which is in the pipeline, but unfortunately will still take a lot of work to be done. In hierarchical models, the hierarchy can be presented in the design matrix as data, but that design matrix will be mostly zeros. When doing the design matrix times parameter vector multiplication. using a dense matrix would unnecessarily do all the multiplications by zeros and also have a big memory usage. To avoid this brms is instead using for loops to add hierachical model terms together, but then that causes the overhad in the Stan autodiff. Sparse matrices would have the same memory usage as for loops, but the sparse matrix times parameter vector multiplication would be handled by Eigen which would take care of efficient memory access and use of modern CPU features and we would get the SoA speedup.

1 Like

Thanks for the clarification!

1 Like

I did investigate this a bit more and found that

  1. reduce_sum seems to block SoA to work right now… I don’t see why this is the case, so maybe/hopefully that can be fixed.
  2. We do not need full blown sparse matrices as it looks. Simple vectorized indexing will do the trick already and hopefully result in nice speedups.

While I don’t know how easy it is for @paul.buerkner to handle the following scheme in brms, I am hoping that is is easy to implement and does not have side-effects:

  // compute partial sums of the log-likelihood
  real partial_log_lik_lpmf(int[] seq, int start, int end, data vector Y, data matrix Xc, vector b, real Intercept, real sigma, data int[] J_1, data vector Z_1_1, vector r_1_1) {
    real ptarget = 0;
    int N = end - start + 1;
    // initialize linear predictor term
    //vector[N] mu;  = Intercept + rep_vector(0.0, N);
    vector[N] mu = rep_vector(Intercept, N);
    /*
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn];
    }
    */
    mu += r_1_1[J_1[start:end]] .* Z_1_1[start:end];
    ptarget += normal_id_glm_lpdf(Y[start:end] | Xc[start:end], mu, b, sigma);
    return ptarget;
  }

I filed respective reports:

Let’s hope this can be sorted out. Some benchmarking of the proposed pattern above would be great (for now without reduce_sum, so it requires this snippet in the model block to run with SoA:

  // likelihood including constants
  if (!prior_only) {
    // reduce_sum is blocking SoA, but it should not since the slice
    // argument is data in this case
    //target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);
    // this call instead gives me a full SoA program
    target += partial_log_lik_lpmf(seq| 1, N, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);
  }
1 Like

Currently, using UDFs prevents the use of SoA. This is obviously a big limitation of O1, but it was very difficult to make the optimizations reliable in those cases. Now that O1 has gotten (more) stable, we should explore this.

Oh dear… I did not know that. UDFs are certainly needed for reduce_sum. Thanks for clarifying that.

… and yeah… we need UDFs to work with SoA for many programs.

Yeah, for sure. It does complicate the analysis and optimization, so we are taking it step by step. Now that O1 has gotten more widely tested and the early returns are that its great, especially the SoA stuff, we can move forward.

I tried vectorized indexing, but it didn’t work (without using reduce_sum). If you did get it working, then I assume I made some mistake.

…but now I am confused, because when I do this:

target += partial_log_lik_lpmf(seq| 1, N, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, r_1_1);

in the model block, then I do get SoA data types. the partial_log_lik_lpmf is a UDF. Am I missing something?

This seems to work just fine:

mu += r_1_1[J_1[start:end]] .* Z_1_1[start:end];

(Note that I was not able to compile and run the model on my machine this morning due to a setup glitch, but to the best of my knowledge this is giving the same results as the for loop)

Double checking that by working you mean r_1_1 parameter is SoA and you see speed-up?

Not easily readable output, but it looks to me like r_1_1 is SoA:

I was not able to run the program and test for speedups, since my AV scanner stopped me from running the program on my corporate laptop.

1 Like

Yes, function inlining was enabled in this most recent update which can make it look like SoA is supported in non-recursive UDFs. We obviously are not able to in-line into a reduce sum call, though, so we’d need much more complicated analysis

I see. Since SoA targets large models where we need high-performance, it would make sense to get reduce_sum to work with it (given that reduce_sum is there for just the same reason). I don’t have any idea how complicated all of that is, but if there is a way to make this work with “simple” functions, then that’s worthwhile I guess. “Simple” in a sense that it’s doable to check things sufficiently.

The thing is that reduce_sum can scale performance in a brute force way and from the things reported so far, the SoA trick already saves a factor of about 2x, but having to give that up just for reduce_sum isn’t a broken leg on big machines, but it does hurt.

(I really don’t have a clue how hard these things are… all I can do is looby for it :D ).

3 Likes

It’s certainly worth looking into, the person who could probably best say how reasonable it is is @stevebronder

3 Likes

Having SoA support UDF functions is hard because of the combinatorics of all the possible function arguments being SoA or AoS. An example is like the function signature below, where we have 120 possible signatures because of the combinations of SoA or AoS matrices / vectors.

matrix foo(matrix a, vector b, matrix c, vector d, vector g);

So in the optimization step for SoA I think what we could do is:

  1. A first pass of the optimization step where we just assume any UDF is going to work with SoA.
  2. one pass that just looks at the UDFs and parameters / transformed parameters promoted to SoA to see if it’s possible for that UDF to actually use SoA.
  3. With the result of (2), re-run (1) with the starting values that cannot be SoA as the results from (1) and (2).
  4. I think we need to keep running 1-3 until the results from 3 have not changed between iterations.

We also need to keep track of which combinations we need to generate for each UDF signature. Then when we generate the C++ we need to generate each function signature and body such that they will work with the SoA / AoS combinations.

I think it’s very possible to support SoA for general UDF functions, but we need to be pretty careful about how we go about it.

3 Likes

But isn’t reduce_sum stuff easier? The thing is that the partial log lik functions do only return a scalar. So any SoA going into the partial log lik function will be reduced to just a real number. Maybe that’s an easier case to deal with for a start? All you’d need to ensure is that inside the partial log lik function only vectorized things are being done to the input arguments. That’s it.

All the rules of UDFs apply even for scalar output so the optimization still needs to check which combinations of inputs are valid for the function body