Any ideas for speeding up this chunk of my transformed parameters block?

I’m developing a model, and it’s taking longer to fit than I would like. Using profile tools, I have determined that a piece of the transformed parameters block is taking about half the total runtime. It’s doing math on three sets of largish vectors. Here’s the code:

...
transformed parameters {
...
        for (v in 1:3) {
            int s = 1;
            vec_simp[v] = re_stim_simp[v][stim_id_simp] 
                        + p0[v, s, sid_simp] 
                        + delta[v, s, sid_simp]
                        .* (1 - exp(-alpha[v, s, sid_simp].*t_simp));

            s = 2;
            vec_cplx[v] = re_stim_cplx[v][stim_id_cplx]
                    + p0[v, s, sid_cplx] 
                    + delta[v, s, sid_cplx]
                    .* (1 - exp(-alpha[v, s, sid_cplx].*t_cplx));
        }
...
}
...

stim_id_simp, sid_simp, stim_id_cplx, sid_cplx are data, int arrays of length about 7200 (will be twice as long eventually)
t_cplx and t_simp are vectors of the same length. All the other variables are transformed parameters.

Transformed parameter declarations

In case it is relevant, here are the declarations of all the transformed parameters from the above snippet. n_sub is ~120, nstim_simp,cplx=60, n_simp, n_cplx are both 7200.

    array[3] matrix[2, n_sub] re_sid_p0;
    array[3] matrix[2, n_sub] re_sid_delta;
    array[3] matrix[2, n_sub] re_sid_alpha;
    array[3] vector[nstim_simp] re_stim_simp;
    array[3] vector[nstim_cplx] re_stim_cplx;

    // likelihood temp variables
    array[3] vector[n_simp] vec_simp;
    array[3] vector[n_cplx] vec_cplx;

    // v: variable, 1:3
    // s: simple/complex 1:2
    // sid: subject id, 1:n_sub
    
    array[3, 2] vector[n_sub] p0;
    array[3, 2] vector[n_sub] delta;
    array[3, 2] vector<lower=0>[n_sub] alpha;

I’m hoping to figure out a way to speed the model up. I have plenty of memory and processors, but I would have to substantially restructure the code to use something like reduce_sum. If that’s the best answer, so be it, but I was hoping I was doing something really inefficient that would be obvious to the experts on here.

Any suggestions?

I went ahead down the reduce_sum route, but my code might still benefit from any efficiency gains that are to be had in the chunk included in the previous post. We’ll see how that goes.

Maybe the use of special functions like expm1 and fma can be tried? Also turn on STAN_NO_RANGE_CHECKS to get rid of range checks (once you know this works).

1 Like

When you have operations that mix many large inputs, it can be more efficient to use loops. This is especially the case when working with array types (real[], and int[]).

As a simple example, consider adding two vectors after exponentiating each element:

vector[N] a = ...;
vector[N] b = ...;

vector[N] result = exp(a) + exp(b); 

Each exp call will traverse the respective inputs and apply the operations, for a total of 2 * N traversals (i.e., each element of a is exponentiated and then each element of b is exponentiated).

Having said that, this isn’t always the case with vector/matrix types, as Stan uses the lazy evaluation provided by the Eigen c++ library. This means that the generated code can recognise that the elementwise exponents from the two vectors are being added, and make only N traversals to evaluate both inputs.

However, this is not always possible, especially when multiple operations are being applied. This is also not possible when the input types are arrays, as these do not use the Eigen c++ data structures, and so cannot take advantage of the same lazy evaluation.

Long story short, try using a single loop instead of vectorised operations to see if that gives you any performance benefit

1 Like

Thanks @andrjohns and @wds15 for the suggestions! First I tried the suggestion to loop, like:

...
            for (i in 1:n_simp) {
                loop_simp[v][i] = re_stim_simp[v][stim_id_simp[i]]
                            + p0[v, s, sid_simp[i]] 
                            + delta[v, s, sid_simp[i]]
                            .* (1 - exp(-alpha[v, s, sid_simp[i]].*t_simp[i]));
            }
...

That didn’t end up saving any time in this case (it was trivially slower than the vectorized version).

Then I turned to composed functions to see what effect that might have. The only one I could see that was applicable (although there might be some algebraic identity I am failing to see) was fma, like:

...
            vec_simp[v] = fma(
                delta[v, s, sid_simp],  
                (1 - exp(-alpha[v, s, sid_simp].*t_simp)), 
                re_stim_simp[v][stim_id_simp] + p0[v, s, sid_simp]
             );
...

That did speed up execution but only by 4 seconds (113s vs 117s) with the difference exclusively in the “reverse_time” column of the profile output. I was a little surprised it compiled, because the relevant function reference does not indicate that fma is vectorized.

Turning on STAN_NO_RANGE_CHECKS looks like it saved another 5 seconds relative to the fma version.

The good news is that re-writing the model to use reduce_sum gave me a ~4x speedup. I can propagate the fma change into that version as well. I appreciate the help and advice!

Sometimes computations just take a while, but other times a simple change can dramatically reduce runtimes. Perhaps I’m just facing the former situation.

Isn’t exp1m usable here? 1-exp(x) = - (exp(x) -1) = -exp1m ? But sure… that may not be the bottleneck.

… but getting 4x with reduce_sum is quite good… how many cores do you use?

Also make sure to use the num_threads feature from cmdstan. This way you specify the total number of threads to use and the number of chains to run. The Intel TBB should then figure out how to best distribute the resources (early finishing chains will free up ressources and give these to still running ones).

You’re right, of course, about exp1m. I read it as exp(x - 1), but that was incorrect! I’ll try that next.

Regarding the reduce_sum, I’m using 10 cores per chain. Top shows four processes running at 800-900%, but I haven’t tested if that’s because of e.g. memory bottlenecks or if another process on the machine is taking up the remainder. (Aside: for me, “quite good” is more-or-less equivalent to “very good”, but I know for other people it means something more like “not so good”).

I’m using the cmdstanpy interface and calling with

    chains=4,
    threads_per_chain=10,

I assume (hope) that num_threads defaults to -1.

uhm… then cmdstanX packages do not (yet) support what I mean. First: You should always run chains in parallel before running them with multiple threads. Parallel chains are more efficient than within chain parallelism.

Assuming that you have 20 cores (say), then it’s best to start cmdstan with the num_threads=20 argument. This will tell cmdstan to use 20 cores and run your 4 chains using all those cores. How these are mapped to individual chains is left to the Intel TBB scheduler. Fast finishing chains will free up ressources to those needing longer to run. You would have to make the cmdstan call manually to get that feature, which we have since 2.28…or @mitzimorris did this feature make it into CmdStanPy yet?

yes, this feature has been in CmdStanPy since release 1.0 - “Hello, World” — CmdStanPy 1.0.1 documentation

the logic is still too twisty for my taste and there’s not enough documentation.

  • to take advantage of TBB multithreading, the model must be instantiated or compiled with cpp_options={'STAN_THREADS':'true'}.

  • the sample method has arguments:

    • chains - default 4
    • parallel_chains - default num cores
    • threads_per_chain - default 1
    • force_one_process_per_chain - default False
  • threads_per_chain is intended to be used with reduce_sum and map_rect - prior to introduction of CmdStan argument num_chains, threads_per_chain maps to CmdStan’s num_threads argument.

  • with CmdStanPy 1.0, we added support for CmdStan’s num_chains argument, and by default, this will be set to chains * threads_per_chain.

this would be a little cleaner if CmdStanPy didn’t support older versions of CmdStan, and at some point, we might be able to do away with parallel_chains arg, as I find all this very confusing to explain.

2 Likes

Cool! That’s new to me, but great to read that this is available… to quote the doc you refer to:

As of version CmdStan 2.28, it is possible to run the NUTS-HMC sampler on multiple chains from within a single executable using threads. This has the potential to speed up sampling. It also reduces the overall memory footprint required for sampling as all chains share the same copy of data.the input data. When using within-chain parallelization all chains started within a single executable can share all the available threads and once a chain finishes the threads will be reused.

That’s exactly why I advocate the feature so much!

1 Like

I think I’m telling cmdstanpy to sample four chains in parallel, and each chain is given 10 cores. I believe this makes use of all 40 cores I have available. In practice, I end up 80-90% of the total CPU available on the computer. There are a lot of options to control this stuff, but I’m glad that the defaults usually work!

2 Likes

not sure this is ideal. How many cores do you have? 10?

As @mitzimorris writes you need to use num_chains and num_threads arguments from the cmdstan base installation. Maybe you inspect the cmdstan call which is issued by CmdStanPy?

I have 40 cores. They are being nearly maxed out by my sampling.

I appreciate the advice to optimize my calculations, and I’m grateful for the efforts of developers to implement efficient parallel computations.

I haven’t yet tuned the grain size to see if that takes me from nearly maxed out to fully maxed out, but at this point, I am not sure what problem we are trying to solve.

Quick update: using expm1 resulted in a 10% speed-up relative to not using it. That’s a nice bump!