CmdStan OpenCL GPU problems and wiki page

Oh, I tested that, too, but then it just stays forever showing

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 

so I assumed that version is not supported. Here’s part of the hpp

        lp_accum__.add(
          bernoulli_logit_glm_lpmf<false>(Y_opencl__, X_opencl__, Intercept,
            b));

Now while I was writing this it did proceed to iteration 100, but is seems to be x10 slower than CPU version.

With n=54,p=1536 or n=1e4,p=100?

@tadej do you have a feeling for n and p in bernoulli_logit_glm_lpmf when a moderate GPU outperforms the CPU?

Currently the latter (n=1e4,p=100), as then the model is simpler.

With GPU

Gradient evaluation took 0.082194 seconds
1000 transitions using 10 leapfrog steps per transition would take 821.94 seconds

With CPU

Gradient evaluation took 0.001563 seconds
1000 transitions using 10 leapfrog steps per transition would take 15.63 seconds.
Adjust your expectations accordingly!

At least I’ve been able to get some progress and 53x speed difference between CPU and GPU! If only it would be in the other direction.

1 Like

I have a cold and I’m not able to focus on my actual work, and instead do this kind of mindless testing.

The gradient evaluation number for the GPU version is not relevant as it also counts JIT compiling and data transfers.

With these n and p you probably wont see blazing speedups, but it should not be that bad…

It’s very appreciated! You’ve already done a lot of work here so no worries if you don’t have the time for this, but when you do clinfo does it also show other platforms with CPUs available? The device your using seems pretty good but I’m wondering if the transfer cost from the CPU to the GPU is the issue

@rok_cesnovar @tadej I wonder if there is some dirty trick we can do here like catch CL_OUT_OF_HOST_MEMORY when compiling opencl code, wait a few seconds checking mem and if enough mem opens up continue compiling

It’s not that bad, it’s worse. While the gradient evaluation was 53x slower, the actual sampling is 100x slower.

Aki sent me of some code for this

SEED <- 1655
set.seed(SEED)
n <- 1e4
k <- 100
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- as.numeric(a + b*x + sigma*rnorm(n) > 0)
fake <- data.frame(x, xn, y)
data<-list(N=n,Y=y,X=cbind(x,xn),K=k)

library(cmdstanr)
modelc1_gpu = cmdstan_model("bern_glm.stan", quiet = FALSE, opencl = TRUE,
 # Change these for your device from running > clinfo -l
 opencl_platform_id = 2, opencl_device_id = 1,
  compiler_flags = "CXXFLAGS+=-O3 -mtune=native -march=native"); 

fit1_gpu = modelc1_gpu$sample(data = data, num_warmup = 100,
 num_samples = 300, num_chains = 2, num_cores = 1)

modelc1_cpu = cmdstan_model("bern_glm.stan", quiet = FALSE,
  compiler_flags = "CXXFLAGS+=-O3 -mtune=native -march=native"); 

fit1_cpu = modelc1_cpu$sample(data = data, num_warmup = 100,
 num_samples = 300, num_chains = 2, num_cores = 1)

data {
  int<lower=1> N;  // number of observations
  int Y[N];  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
}
parameters {
  vector[K] b;  // population-level effects
  real Intercept;
}
model {
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 0, 10);
  // likelihood including all constants
  target += bernoulli_logit_glm_lpmf(Y | X, Intercept, b);
}

I’m running with a 1080 TI. For n=1e4 and p=100 I’m seeing for the gpu

Running ./bern_glm 'id=1' random 'seed=1868491443' data 'file=/tmp/Rtmp5316H6/standata-1d2005c1ab1e6.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281216-1-b720e3.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 1 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 1 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 1 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 1 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 1 finished in 2.4 seconds.
Running ./bern_glm 'id=2' random 'seed=1502212411' data 'file=/tmp/Rtmp5316H6/standata-1d2005c1ab1e6.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281216-2-b720e3.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 2 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 2 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 2 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 2 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 2 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 2 finished in 2.2 seconds.

And for the CPU

Running ./bern_glm 'id=1' random 'seed=2013950296' data 'file=/tmp/Rtmp5316H6/standata-1d20056a011bf.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281241-1-1621ed.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 1 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 1 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 1 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 1 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 1 finished in 3.2 seconds.
Running ./bern_glm 'id=2' random 'seed=477895878' data 'file=/tmp/Rtmp5316H6/standata-1d20056a011bf.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281241-2-1621ed.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 2 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 2 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 2 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 2 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 2 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 2 finished in 2.9 seconds.

Both chains finished succesfully.
Mean chain execution time: 3.1 seconds.
Total execution time: 6.2 seconds.

Running the chains for longer would probably have a more noticeable effect

For n=1e5 I’m seeing

Running MCMC with 2 chain(s) on 1 core(s)...

Running ./bern_glm 'id=1' random 'seed=1949552557' data 'file=/tmp/Rtmp5316H6/standata-1d20074342ff6.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281252-1-70faab.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 1 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 1 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 1 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 1 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 1 finished in 11.4 seconds.
Running ./bern_glm 'id=2' random 'seed=1673666080' data 'file=/tmp/Rtmp5316H6/standata-1d20074342ff6.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281252-2-70faab.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 2 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 2 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 2 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 2 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 2 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 2 finished in 11.4 seconds.

Both chains finished succesfully.
Mean chain execution time: 11.4 seconds.
Total execution time: 23.3 seconds.

and with cpu

fit1 = modelc1$sample(data = data, num_warmup = 100, num_samples = 300, num_chains = 2, num_cores = 1)
Running MCMC with 2 chain(s) on 1 core(s)...

Running ./bern_glm 'id=1' random 'seed=1617602711' data 'file=/tmp/Rtmp5316H6/standata-1d20078dda60c.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281255-1-c7e051.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 1 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 1 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 1 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 1 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 1 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 1 finished in 38.0 seconds.
Running ./bern_glm 'id=2' random 'seed=545479996' data 'file=/tmp/Rtmp5316H6/standata-1d20078dda60c.dat' output \
  'file=/tmp/Rtmp5316H6/bern_glm-202001281255-2-c7e051.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
  'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration:   1 / 400 [  0%]  (Warmup) 
Chain 2 Iteration: 100 / 400 [ 25%]  (Warmup) 
Chain 2 Iteration: 101 / 400 [ 25%]  (Sampling) 
Chain 2 Iteration: 200 / 400 [ 50%]  (Sampling) 
Chain 2 Iteration: 300 / 400 [ 75%]  (Sampling) 
Chain 2 Iteration: 400 / 400 [100%]  (Sampling) 
Chain 2 finished in 37.9 seconds.

Both chains finished succesfully.
Mean chain execution time: 38.0 seconds.
Total execution time: 76.3 seconds.

So about 3x faster on the GPU

Aki I’m not sure exactly where the issues your seeing are coming from. Shooting from the hip I want to say it’s one or a mix of

  1. Updating the driver. But it’s only a year old so that seems like an odd source
  2. The virtual environment, though I’m not sure what exactly would cause the virtual environment to make data transfers so slow. There’s not even a lot of data transfers in this scenario
  3. Our docs for setting up GPU stuff are bad and you may have goofed something by accident because of that. Though I don’t see how that would cause a 100x slowdown I think it’s something we need to update. I think we should remove the old GPU wiki on stan math and cmdstan and just use the doxygen page as a single source for instructions

Also I was able to run this with 8 cores and 8 chains with no issues so there’s another mystery. Aki is your compiler clang or g++?

1 Like

gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

I can talk with our IT-support when I’m back to the office. It’s probably easiest to sit down with one of them to go through this.

I tried running this also. n=10k, p=100 is a borderline input size for GPU GLMs I guess as on my systems its only around 30% faster, 3x faster for 2e4, 6.5x for 1e5.

The actual model of the GPU at these sizes should not be a huge factor as this isnt even near 100% occupancy.

I did run with 500/500 iterations as that seems more close to actual iteration numbers.

In general models that take around 2 seconds are not going to be faster on the GPU, as we have to pay the price of JIT, data transfers, etc. Also note cmdstanr is currenlty using Rdump

I just started with that and small number iterations to test that I can at least to get it run. The goal is to be able to run 4 chains parallel for 1000+1000 iterations each and with n>1e6.

What is the borderline if n<<p?

Sure, I’m interested in models+data (=posteriors) that take currently in my laptop hours.

1 Like

These are some charts from @tadej back when we first added this to Stan Math.

There were further optimizations done after that, but these are general speedups one should expect.

The plan is to run end-to-end tests with cmdstanr for all glms on various hardware we have access too (and provide replication scripts so anyone can try it), but things got delayed for 3 months due to the delayed stanc3 release and now issues with 2.22.

EDIT: this is all single chain

Thanks. How about very small n? Currently n=102, p=5966 with bernoulli_logit_glm and regularized horseshoe takes hours per chain in my laptop.

Other model I’d like to test has (nxp matrix)*(pxn matrix) computation and multi_normal_lpdf with nxn covariance matrix. This will also take hours with my laptop. I see that multiplication is done, but is multi_normal_lpdf in todo list?

1 Like

In general models that take around 2 seconds are not going to be faster on the GPU, as we have to pay the price of JIT, data transfers, etc.

The second part of that sentence is not actually true. JIT happens only once and at these sizes transfers take virtually no time. The “price” is actually relatively large latencies of executing GPU kernels and transfers (if I remember correctly they are in range 10-100 us).

How about very small n?

All GLMs are parallelized only across n, so you need n at least as large as the number of work units (that equals CUDA cores for NVidia GPUs) to utilize whole GPU. Ideally a few times as much.

I always thought n<p cases make no sense, so I made no effort to efficiently support them. Is this a real use case? Is it often used, so we should support it?

Yes, it is real use case, common in biomedical and genetic data. For Bayesian models with priors which make it make sense, see e.g. Sparsity information and regularization in the horseshoe and other shrinkage priors and above I mention that running these models in my laptop take hours and we didn’t use bigger (bigger n and p) data sets as he computation would have taken too much time. Others can do fast inference for n>>p as the posterior is almost Gaussian, but if Stan could make fast inference for n<<p it would be great.

1 Like

Yes, multi_normal_lpdf is on the short list, I think all necesarry gpu code is already merged, we just need use it inside the functoon. I think the bottleneck there is the ldlt_factor.

I also tested multi chain behaviour and it follows the instincts. 1 chain on 1 core, the speedup wass 4.5 for n=1e5, p=100 and for 4 chains its almost 10 (90 seconds vs 870s).

I would really like to know what is up the your machine, I really suspect its the driver. Its bizzare that you need that much RAM for JIT compiling.

Cool.

Me too. It’s just easier when I’m back to the university, so that I can talk face to face with admins, which might be next week.

1 Like

double agree! Def a use case we need to try to capture (and at least point people / their IT folks to the right setup stuff)

1 Like