# Reduce_sum results in much slower run times, even for large datasets

I’m seeing poor performance using `reduce_sum` to parallelize the computation of the likelihood (much worse using 4 threads vs 1) and this problem persists even in high dimensions. I’m hoping to gain some insight on this behavior and hopefully improve the threaded performance.

## Toy data

This isn’t my actual application, but it’s a simple model that produces the same problematic results. The generative model here is

``````## variance components:
v_beta, v_resid ~ Cauchy priors
## M-dimensional latent random effects
beta | var components ~ iid Gaussian(0, sqrt(v_beta / M))
## N-dimensional outcome
y | beta, var components ~ indep Gaussian(X beta, sqrt(v_resid))
``````

where `X` is a deterministic `N` x `M` design matrix with standardized columns. A toy dataset can be generated in R using

``````## generate toy data N=500;M=200
set.seed(1)
N=500
M=200
v_resid=.6
v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)
``````

## Stan code

Here is my Stan implementation. The `likelihood of y | .` portion of the model is the part I will be parallelizing. The problem size here scales with `N`. There are two possible lpdf evaluations here, one of which can be commented out for the sequential version and one for the parallel version.

``````functions {
real partial_sum(array[] real y_slice,
int start, int end,
real resid_sd_par,
array[] real linpred) {
return normal_lpdf(y_slice | linpred[start:end], resid_sd_par);
}
}

data {
int<lower=0> N;  // number of individuals
int<lower=0> M;  // number of predictors
array[N] real y; // outcome
matrix[N,M] X;   // predictors
int<lower=1> grainsize;
}

parameters {
real<lower=0> v_beta;  // expected SS of random effects
real<lower=0> v_resid; // residual variance component
vector[M] beta;        // latent random effects
}

transformed parameters {
real sdb = sqrt(v_beta/M);            // sd of each beta
real resid_sd=sqrt(v_resid);          // noise standard dev
array[N] real Xb=to_array_1d(X*beta); // linear predictors
}

model {
profile("priors") {
// priors on variance components
target +=  cauchy_lpdf(v_beta | 0, 1.0);
target +=  cauchy_lpdf(v_resid | 0, 1.0);
}
profile("beta_lpdf"){
// effects | genotypes, variance components
target += normal_lpdf(beta | 0, sdb);
}
profile("likelihood of y | .") {
// ONE OF THESE WILL BE COMMENTED OUT
// lmm_test.par.stan version:
target += reduce_sum(partial_sum, y, grainsize, resid_sd, Xb);
// lmm_test.seq.stan version:
target +=  normal_lpdf(y | Xb, resid_sd);
}
}
``````

The only difference between the two models is whether I call `normal_lpdf` or `reduce_sum`.

## Performance

Despite more CPU usage in the parallel code, the model is much slower. Here are timings increasing `N` from 500 to 5,000 to 20,000:

``````## compile models
mod_par <- cmdstan_model('lmm_test.par.stan', cpp_options = list(stan_threads = TRUE))
mod_seq <- cmdstan_model('lmm_test.seq.stan', cpp_options = list(stan_threads = FALSE))

## sequential model N=500, M=200
# 1.8 seconds, CPU usage up to 100%
## parallel model N=500, M=200
# 8 seconds, CPU usage up to 392%

## generate toy data N=5000; M=200
set.seed(1);N=5000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)

## sequential model N=5000, M=200
# 20.3 seconds, CPU usage up to 100%
## parallel model N=5000, M=200
# 96.9 seconds, CPU usage up to 320%
## sequential model N=500, M=200

## generate toy data N=20000; M=200; fewer iterations because im impatient
set.seed(1);N=20000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)

## sequential model N=20000, M=200
# 33.3 seconds, CPU usage up to 100%
iter_warmup=400, iter_sampling=0)
## parallel model N=20000, M=200
# 97.2 seconds, CPU usage up to 316%
iter_warmup=400, iter_sampling=0)
``````

The profiles look like this:

``````## comparing the profiles:
mod_fit_seq\$profiles()
[[1]]
1           beta_lpdf 140177091551360 0.02311280   0.01950600  0.003606730
2 likelihood of y | . 140177091551360 1.26541000   1.01081000  0.254603000
3              priors 140177091551360 0.00609434   0.00547151  0.000622826
chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1        9329              0           9329                 1
2        9329              0           9329                 1
3       18658              0           9329                 1

mod_fit_par\$profiles()
[[1]]
1           beta_lpdf 140237382762624  0.02608020    0.0184004  0.007679830
2 likelihood of y | . 140237382762624 69.50300000   69.2208000  0.282135000
3              priors 140237382762624  0.00761505    0.0068122  0.000802849
chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1        7437              0           7437                 1
2        7437              0           7437                 1
3       14874              0           7437                 1
``````

Increasing N seems to mitigate the problem only slightly, which suggests to me that this isn’t about overhead. Curiously, the computational intensity, as eyeballed by examining my CPU usage on an Skylake machine with 4 physical cores, appears to modestly decline as problem size goes up.

For what it’s worth, playing around with `grainsize` doesn’t seem to help, though I’ll spare you the sea of output.

Am I making an obvious mistake here? Any insights are greatly appreciated!

`reduce_sum` shines when you can pack a large amount of computation inside `partial_sum`.

In this example, the only computation that happens in parallel is a vectorized call to `normal_lpdf`, which should already be super fast. My guess is that any speedup by doing this in parallel is simply outweighed by the overhead. You could do a bit better by putting the computation of the linear predictor inside `partial_sum`, but probably this still wouldn’t be worth it in this case (especially compared to using `normal_id_glm`).

1 Like

When working with `reduce_sum`, any arguments that aren’t ‘sliced’ get copied in full to every core/thread. This copy cost is much greater for parameters than for data (since both the value and the gradient need to get copied).

When calling: `array[N] real Xb=to_array_1d(X*beta); // linear predictors`, the entire `N` values are upgraded to parameters. It will probably be more efficient to copy `X` and `beta` separately, so that only `M` values are parameters.

And to second @jsocolar’s point, `reduce_sum` needs the complexity of the likelihood to outweigh the overhead of parallelism. If you have access to a GPU, you’ll likely get much better performance by using `normal_id_glm` with opencl

2 Likes

@jsocolar That makes sense, but I was expecting things to not be so catostrophic in terms of overhead when I made `N` big enough.

The two biggest factors here seems to be:

• @andrjohns 's comment re: the overhead associated with constructing `N` parameters, which is easily avoided,
• @jsocolar 's suggestion of moving the linear predictor computation inside the `reduce_sum`, which both distributes more computation and gets around the unecessary parameter issue

Implementing the same model but defining:

``````functions {
real partial_sum(array[] real y_slice,
int start, int end,
real resid_sd_par,
matrix X,
vector beta, int M, int N) {
vector[end-start+1] linpred = X[start:end,:]*beta;
return normal_lpdf(y_slice | linpred, resid_sd_par);
}
}
``````

made a huge difference! Now, the parallel version takes ~16 seconds for the N=20,000 case, versus ~33 for the sequential version, and ~97 for the old parallel version!

Side note: the way I’ve implemented slicing `X` into chunks for parallel computation of the linear predictor above wrorks, but it feels clumsy and fails unless I manually set `grainsize` > 1. Perhaps someone could suggest a more elegant solution?

Also, since I’m essentially doing a map-reduce at this point by chunking `X`, would it make more sense to implement this using `map_rect`?

Thanks for the help!

2 Likes

There are also two further changes you can make here for optimisation. Using the `normal_id_glm` likelihood has more efficient gradients/constructions for each of the individual parameters (`X`, `beta`, and `resid_sd_par`) and so will (in almost all cases) sample more efficiently. Additionally, this allows Stan to calculate the gradients only with respect to `beta` during the likelihood construction, as the `X*beta` multiplication upgrades the entire vector to a parameter which needs gradients calculated for each element. Similar to the reduction in copy-cost, this means that Stan will only calculate the gradients for the `M` `beta` parameters, rather than for the `N` entries in the `X*beta` vector.

Finally, if you don’t need to include normalising constants (e.g., if you’re not using `loo`), then you can use the un-normalised density, which will reduce the amount of computation needed.

Putting that all together, your `partial_sum` function would then look like:

``````functions {
real partial_sum(array[] real y_slice,
int start, int end,
real resid_sd_par,
matrix X,
vector beta, int M, int N) {
// Include a dummy 0 for the intercept
// Change the suffix to _lpdf if you need normalising constants
return normal_id_glm_lupdf(y_slice | X[start:end,:], 0, beta, resid_sd_par);
}
}
``````

@jscolar has rightly pointed out that the need for normalising constants is for model comparison via Bayes factors, rather than `loo`, so the `_lupdf` suffix would be recommended if you’re not using Bayes factors.

1 Like

Oh almost forgot! It would also be a good idea to try out the new Stanc optimisation flag (`O1`), this will enable a more efficient handling of large matrix parameters which could also help here.

More details in the blog post here: Release of CmdStan 2.29 – The Stan Blog

I tried these changes but didn’t see any performance improvement. The cflags `O1`, `O0`, and `Oexperimental` all yielded identical results for both the sequential and multithreaded variants, as did replacing `lpdf` terms with `lupdf` terms.

Swapping `normal_lpdf` for `normal_id_glm_lpdf` didn’t help either, though I think that might be because I had to code to add in a call to `to_vector`:

`````` real partial_sum_lpdf(array[] real y_slice,
int start, int end,
real resid_sd_par,
matrix X,
vector beta, int M, int N) {
return normal_id_glm_lpdf(to_vector(y_slice) | X[start:end,:], 0, beta, resid_sd_par);
}
``````

It seems that `reduce_sum` needs the sliced variable to be an array, but `normal_id_glm_lupdf` needs a scalar or vector. Is there a nice solution to this? I could always add in an explicit loop over the values of `y_slice` and hope the compiler makes up for it, but this doesn’t feel like a very clean option.