Reduce_sum results in much slower run times, even for large datasets

I’m seeing poor performance using reduce_sum to parallelize the computation of the likelihood (much worse using 4 threads vs 1) and this problem persists even in high dimensions. I’m hoping to gain some insight on this behavior and hopefully improve the threaded performance.

Toy data

This isn’t my actual application, but it’s a simple model that produces the same problematic results. The generative model here is

## variance components:
v_beta, v_resid ~ Cauchy priors
## M-dimensional latent random effects
beta | var components ~ iid Gaussian(0, sqrt(v_beta / M))
## N-dimensional outcome
y | beta, var components ~ indep Gaussian(X beta, sqrt(v_resid))

where X is a deterministic N x M design matrix with standardized columns. A toy dataset can be generated in R using

## generate toy data N=500;M=200
set.seed(1)
N=500
M=200
v_resid=.6
v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)

Stan code

Here is my Stan implementation. The likelihood of y | . portion of the model is the part I will be parallelizing. The problem size here scales with N. There are two possible lpdf evaluations here, one of which can be commented out for the sequential version and one for the parallel version.

functions {
 real partial_sum(array[] real y_slice,
                      int start, int end,
                      real resid_sd_par,
                      array[] real linpred) {
    return normal_lpdf(y_slice | linpred[start:end], resid_sd_par);
  }
}

data {
  int<lower=0> N;  // number of individuals
  int<lower=0> M;  // number of predictors
  array[N] real y; // outcome
  matrix[N,M] X;   // predictors
  int<lower=1> grainsize;
}

parameters {
  real<lower=0> v_beta;  // expected SS of random effects
  real<lower=0> v_resid; // residual variance component
  vector[M] beta;        // latent random effects
}

transformed parameters {
  real sdb = sqrt(v_beta/M);            // sd of each beta
  real resid_sd=sqrt(v_resid);          // noise standard dev
  array[N] real Xb=to_array_1d(X*beta); // linear predictors
}

model {
  profile("priors") {
    // priors on variance components
    target +=  cauchy_lpdf(v_beta | 0, 1.0);
    target +=  cauchy_lpdf(v_resid | 0, 1.0);
  }
  profile("beta_lpdf"){
    // effects | genotypes, variance components
    target += normal_lpdf(beta | 0, sdb);
  }
  profile("likelihood of y | .") {
    // ONE OF THESE WILL BE COMMENTED OUT
    // lmm_test.par.stan version:
    target += reduce_sum(partial_sum, y, grainsize, resid_sd, Xb);
    // lmm_test.seq.stan version:
    target +=  normal_lpdf(y | Xb, resid_sd);
  }
}

The only difference between the two models is whether I call normal_lpdf or reduce_sum.

Performance

Despite more CPU usage in the parallel code, the model is much slower. Here are timings increasing N from 500 to 5,000 to 20,000:

## compile models
mod_par <- cmdstan_model('lmm_test.par.stan', cpp_options = list(stan_threads = TRUE))
mod_seq <- cmdstan_model('lmm_test.seq.stan', cpp_options = list(stan_threads = FALSE))

## sequential model N=500, M=200
# 1.8 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1) 
## parallel model N=500, M=200
# 8 seconds, CPU usage up to 392%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1)

## generate toy data N=5000; M=200
set.seed(1);N=5000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)

## sequential model N=5000, M=200
# 20.3 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1) 
## parallel model N=5000, M=200
# 96.9 seconds, CPU usage up to 320%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1)
## sequential model N=500, M=200

## generate toy data N=20000; M=200; fewer iterations because im impatient
set.seed(1);N=20000;M=200;v_resid=.6;v_beta=.4
X <- apply(matrix(rnorm(N*M),N,M),2,scale)
beta <- rnorm(M,0,sqrt(v_beta/M))
resid <- rnorm(N,0,sqrt(v_resid))
y <- as.vector(X%*%beta + resid)
stan_data <- list(M=M, N=N, y=y, X=X, grainsize=1)

## sequential model N=20000, M=200
# 33.3 seconds, CPU usage up to 100%
mod_fit_seq <- mod_seq$sample(data=stan_data,threads_per_chain=1,chains=1,
    iter_warmup=400, iter_sampling=0) 
## parallel model N=20000, M=200
# 97.2 seconds, CPU usage up to 316%
mod_fit_par <- mod_par$sample(data=stan_data,threads_per_chain=4,chains=1,
    iter_warmup=400, iter_sampling=0) 

The profiles look like this:

## comparing the profiles:
mod_fit_seq$profiles()
[[1]]
                 name       thread_id total_time forward_time reverse_time
1           beta_lpdf 140177091551360 0.02311280   0.01950600  0.003606730
2 likelihood of y | . 140177091551360 1.26541000   1.01081000  0.254603000
3              priors 140177091551360 0.00609434   0.00547151  0.000622826
  chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1        9329              0           9329                 1
2        9329              0           9329                 1
3       18658              0           9329                 1

mod_fit_par$profiles()
[[1]]
                 name       thread_id  total_time forward_time reverse_time
1           beta_lpdf 140237382762624  0.02608020    0.0184004  0.007679830
2 likelihood of y | . 140237382762624 69.50300000   69.2208000  0.282135000
3              priors 140237382762624  0.00761505    0.0068122  0.000802849
  chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1        7437              0           7437                 1
2        7437              0           7437                 1
3       14874              0           7437                 1

Increasing N seems to mitigate the problem only slightly, which suggests to me that this isn’t about overhead. Curiously, the computational intensity, as eyeballed by examining my CPU usage on an Skylake machine with 4 physical cores, appears to modestly decline as problem size goes up.

For what it’s worth, playing around with grainsize doesn’t seem to help, though I’ll spare you the sea of output.

Am I making an obvious mistake here? Any insights are greatly appreciated!

reduce_sum shines when you can pack a large amount of computation inside partial_sum.

In this example, the only computation that happens in parallel is a vectorized call to normal_lpdf, which should already be super fast. My guess is that any speedup by doing this in parallel is simply outweighed by the overhead. You could do a bit better by putting the computation of the linear predictor inside partial_sum, but probably this still wouldn’t be worth it in this case (especially compared to using normal_id_glm).

1 Like

When working with reduce_sum, any arguments that aren’t ‘sliced’ get copied in full to every core/thread. This copy cost is much greater for parameters than for data (since both the value and the gradient need to get copied).

When calling: array[N] real Xb=to_array_1d(X*beta); // linear predictors, the entire N values are upgraded to parameters. It will probably be more efficient to copy X and beta separately, so that only M values are parameters.

And to second @jsocolar’s point, reduce_sum needs the complexity of the likelihood to outweigh the overhead of parallelism. If you have access to a GPU, you’ll likely get much better performance by using normal_id_glm with opencl

2 Likes

@jsocolar That makes sense, but I was expecting things to not be so catostrophic in terms of overhead when I made N big enough.

The two biggest factors here seems to be:

  • @andrjohns 's comment re: the overhead associated with constructing N parameters, which is easily avoided,
  • @jsocolar 's suggestion of moving the linear predictor computation inside the reduce_sum, which both distributes more computation and gets around the unecessary parameter issue

Implementing the same model but defining:

functions {
 real partial_sum(array[] real y_slice,
                      int start, int end,
                      real resid_sd_par,
                      matrix X,
                      vector beta, int M, int N) {
    vector[end-start+1] linpred = X[start:end,:]*beta;
    return normal_lpdf(y_slice | linpred, resid_sd_par);
  }
}

made a huge difference! Now, the parallel version takes ~16 seconds for the N=20,000 case, versus ~33 for the sequential version, and ~97 for the old parallel version!

Side note: the way I’ve implemented slicing X into chunks for parallel computation of the linear predictor above wrorks, but it feels clumsy and fails unless I manually set grainsize > 1. Perhaps someone could suggest a more elegant solution?

Also, since I’m essentially doing a map-reduce at this point by chunking X, would it make more sense to implement this using map_rect?

Thanks for the help!

2 Likes

There are also two further changes you can make here for optimisation. Using the normal_id_glm likelihood has more efficient gradients/constructions for each of the individual parameters (X, beta, and resid_sd_par) and so will (in almost all cases) sample more efficiently. Additionally, this allows Stan to calculate the gradients only with respect to beta during the likelihood construction, as the X*beta multiplication upgrades the entire vector to a parameter which needs gradients calculated for each element. Similar to the reduction in copy-cost, this means that Stan will only calculate the gradients for the M beta parameters, rather than for the N entries in the X*beta vector.

Finally, if you don’t need to include normalising constants (e.g., if you’re not using loo), then you can use the un-normalised density, which will reduce the amount of computation needed.

Putting that all together, your partial_sum function would then look like:

functions {
 real partial_sum(array[] real y_slice,
                      int start, int end,
                      real resid_sd_par,
                      matrix X,
                      vector beta, int M, int N) {
    // Include a dummy 0 for the intercept
    // Change the suffix to _lpdf if you need normalising constants
    return normal_id_glm_lupdf(y_slice | X[start:end,:], 0, beta, resid_sd_par);
  }
}

@jscolar has rightly pointed out that the need for normalising constants is for model comparison via Bayes factors, rather than loo, so the _lupdf suffix would be recommended if you’re not using Bayes factors.

1 Like

Oh almost forgot! It would also be a good idea to try out the new Stanc optimisation flag (O1), this will enable a more efficient handling of large matrix parameters which could also help here.

More details in the blog post here: Release of CmdStan 2.29 – The Stan Blog

I tried these changes but didn’t see any performance improvement. The cflags O1, O0, and Oexperimental all yielded identical results for both the sequential and multithreaded variants, as did replacing lpdf terms with lupdf terms.

Swapping normal_lpdf for normal_id_glm_lpdf didn’t help either, though I think that might be because I had to code to add in a call to to_vector:

 real partial_sum_lpdf(array[] real y_slice,
                      int start, int end,
                      real resid_sd_par,
                      matrix X,
                      vector beta, int M, int N) {
    return normal_id_glm_lpdf(to_vector(y_slice) | X[start:end,:], 0, beta, resid_sd_par);
  }

It seems that reduce_sum needs the sliced variable to be an array, but normal_id_glm_lupdf needs a scalar or vector. Is there a nice solution to this? I could always add in an explicit loop over the values of y_slice and hope the compiler makes up for it, but this doesn’t feel like a very clean option.