OpenCL with discrete distributions

With little experience in parallelisation programming, I’m currently trying to understand whether I should try to set up OpenCL on my linux server to speed up the sampling of my model.

I found this section of the User’s Guide and I understand I won’t be able to get OpenCL speed gains for neither Dirichlet or Categorical distribitions’ computations that I use in my model.

However, at the same time I use some of the OpenCL implemented distributions for my priors. Until now, I have used threading with 4 parallel chains. Am I right to assume that for such a model with OpenCL enabled,

  • any implemented distribution on the list linked above would be computed using OpenCL, and
  • any other distribution would be computed without OpenCL …

… while enabling OpenCL would result in threading being disabled entirely as OpenCL and threading apparently cannot be used at the same time?

Would I expect speed gains for the OpenCL implemented distributions on the one hand, but a general slowing down due to disabled threading on the other hand? Or would I even have to assume that OpenCL, in my case, will not bring any meaningful speed gain because each model evaluation can only be as fast as it’s slowest (not implemented) element.

And out of curiosity - why won’t STAN let me run some distribution computations in parallel using OpenCL?

Thanks!

Stan has several OpenCL implementations for discrete distributions. Here’s the complete list:

https://mc-stan.org/docs/stan-users-guide/parallelization.html#opencl

You can see that it includes discrete distributions like Bernoulli, binomial and Poisson, and is missing some continuous distributions like the Dirichlet and multivariate normal. This is just a matter of where the devs have focused effort. @stevebronder should know more as he’s been doing a lot of this coding.

Thanks for the clarification, corrected in original post.

Then it would be interesting to know how it affects performance when using both already and not yet implemented distributions.

I tested different configurations on a VM (host machine: Dell R740, 2* Intel Xeon Silver 4116 CPU @ 2.10GHz, 8vCores) with Ubuntu 24.04 LTS, intel-opencl-icd, intel-oneapi-runtime-opencl, R, cmdstanr.

*****@*********:~$ clinfo -l
Platform #0: Intel(R) OpenCL
 `-- Device #0: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz

Test model (using only OpenCL supported distributions):

data {
  int<lower=0> N;   // number of observations
  int<lower=0> J;   // number of individuals (subjects)
  int<lower=0> I;   // number of elections (items)
  array[N] int j_n; // mapping observation to individual
  array[N] int i_n; // mapping observation to election
  array[N] int y;   // outcome vector
}
parameters {
  vector<lower=0>[I] lambda;    // election specific amplification effect
  vector[J] theta;              // latent individual voting propensity 
  vector[I] alpha;              // election specific effect
  real mu_y;                    // global mean
  real<lower=0> sigma_lambda;   // sigma for election specific amplification effect
  real<lower=0> sigma_alpha;    // sigma for election specific effect
}
model {
  // priors
  theta ~ normal(0,1);
  lambda ~ lognormal(0,sigma_lambda);
  alpha ~ student_t(3, mu_y, sigma_alpha);
  mu_y ~ student_t(3, 0, 1);
  sigma_lambda ~ student_t(3, 0, 1);
  sigma_alpha ~ student_t(3, 0, 1);
  // likelihood
  y ~ bernoulli_logit(lambda[i_n] .* theta[j_n] + alpha[i_n]);
}

Test data has N=8979, I=14, J=1000.

No OpenCL, 4 parallel chains: 36.7 seconds.
OpenCL, 4 parallel chains: 592.5 seconds
OpenCL, 4 sequential chains: 697.0 seconds

I can see 100% vCPU load on 4 (8) vCores without (with) OpenCL support respectively.

I’m trying to make sense of what I observe but struggle to see why the same model compiled with OpenCL would underperform so badly. From what I’ve read elsewhere, other users have experienced at least slight performance increases consistently, consequently making the use of OpenCL a default whenever available.

I suspect either some configuration problem or a misunderstanding of use case on my side…

Note that you’re using CPU opencl here, so the workload can only be parallelised by the number of cores that are available - which is markedly less than the number of compute units in a GPU

I would guess that the additional overhead of moving parameters/data in and out of the opencl context is outweighing the benefits of parallelism

That’s pretty striking! My guess is that some resource is bottlenecked. How much contention is there for the CPU and memory in the host machine? I’m pretty sure Stan uses the Intel TBB thread pooling library in the installation, which may be used by other processes.

@stevebronder might know what’s up.

Hi Joffdd! Thanks for taking a look at the OpenCL backend. Sadly parallelization is not free and what your seeing here is most likely the overhead opencl has to make when going between the host and device. For the simple model here you would need a lot lot more data to see the overhead of setting up the parallelization become worth it. Even when the host and device are the same device, most heterogeneous compute libraries assume you are transferring from one device to another. So you still pay the cost of copying memory even with the one device.

Since we originally wrote the OpenCL backend there are some new ways to have the host and device share memory when available. I have my hands full atm but at some point I would like to come back to that

-S.