Aki sent me of some code for this
SEED <- 1655
set.seed(SEED)
n <- 1e4
k <- 100
x <- rnorm(n)
xn <- matrix(rnorm(n*(k-1)),nrow=n)
a <- 2
b <- 3
sigma <- 1
y <- as.numeric(a + b*x + sigma*rnorm(n) > 0)
fake <- data.frame(x, xn, y)
data<-list(N=n,Y=y,X=cbind(x,xn),K=k)
library(cmdstanr)
modelc1_gpu = cmdstan_model("bern_glm.stan", quiet = FALSE, opencl = TRUE,
# Change these for your device from running > clinfo -l
opencl_platform_id = 2, opencl_device_id = 1,
compiler_flags = "CXXFLAGS+=-O3 -mtune=native -march=native");
fit1_gpu = modelc1_gpu$sample(data = data, num_warmup = 100,
num_samples = 300, num_chains = 2, num_cores = 1)
modelc1_cpu = cmdstan_model("bern_glm.stan", quiet = FALSE,
compiler_flags = "CXXFLAGS+=-O3 -mtune=native -march=native");
fit1_cpu = modelc1_cpu$sample(data = data, num_warmup = 100,
num_samples = 300, num_chains = 2, num_cores = 1)
data {
int<lower=1> N; // number of observations
int Y[N]; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
}
parameters {
vector[K] b; // population-level effects
real Intercept;
}
model {
// priors including all constants
target += student_t_lpdf(Intercept | 3, 0, 10);
// likelihood including all constants
target += bernoulli_logit_glm_lpmf(Y | X, Intercept, b);
}
I’m running with a 1080 TI. For n=1e4 and p=100 I’m seeing for the gpu
Running ./bern_glm 'id=1' random 'seed=1868491443' data 'file=/tmp/Rtmp5316H6/standata-1d2005c1ab1e6.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281216-1-b720e3.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 1 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 1 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 1 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 1 Iteration: 400 / 400 [100%] (Sampling)
Chain 1 finished in 2.4 seconds.
Running ./bern_glm 'id=2' random 'seed=1502212411' data 'file=/tmp/Rtmp5316H6/standata-1d2005c1ab1e6.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281216-2-b720e3.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 2 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 2 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 2 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 2 Iteration: 400 / 400 [100%] (Sampling)
Chain 2 finished in 2.2 seconds.
And for the CPU
Running ./bern_glm 'id=1' random 'seed=2013950296' data 'file=/tmp/Rtmp5316H6/standata-1d20056a011bf.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281241-1-1621ed.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 1 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 1 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 1 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 1 Iteration: 400 / 400 [100%] (Sampling)
Chain 1 finished in 3.2 seconds.
Running ./bern_glm 'id=2' random 'seed=477895878' data 'file=/tmp/Rtmp5316H6/standata-1d20056a011bf.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281241-2-1621ed.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 2 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 2 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 2 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 2 Iteration: 400 / 400 [100%] (Sampling)
Chain 2 finished in 2.9 seconds.
Both chains finished succesfully.
Mean chain execution time: 3.1 seconds.
Total execution time: 6.2 seconds.
Running the chains for longer would probably have a more noticeable effect
For n=1e5 I’m seeing
Running MCMC with 2 chain(s) on 1 core(s)...
Running ./bern_glm 'id=1' random 'seed=1949552557' data 'file=/tmp/Rtmp5316H6/standata-1d20074342ff6.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281252-1-70faab.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 1 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 1 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 1 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 1 Iteration: 400 / 400 [100%] (Sampling)
Chain 1 finished in 11.4 seconds.
Running ./bern_glm 'id=2' random 'seed=1673666080' data 'file=/tmp/Rtmp5316H6/standata-1d20074342ff6.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281252-2-70faab.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 2 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 2 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 2 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 2 Iteration: 400 / 400 [100%] (Sampling)
Chain 2 finished in 11.4 seconds.
Both chains finished succesfully.
Mean chain execution time: 11.4 seconds.
Total execution time: 23.3 seconds.
and with cpu
fit1 = modelc1$sample(data = data, num_warmup = 100, num_samples = 300, num_chains = 2, num_cores = 1)
Running MCMC with 2 chain(s) on 1 core(s)...
Running ./bern_glm 'id=1' random 'seed=1617602711' data 'file=/tmp/Rtmp5316H6/standata-1d20078dda60c.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281255-1-c7e051.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 1 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 1 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 1 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 1 Iteration: 400 / 400 [100%] (Sampling)
Chain 1 finished in 38.0 seconds.
Running ./bern_glm 'id=2' random 'seed=545479996' data 'file=/tmp/Rtmp5316H6/standata-1d20078dda60c.dat' output \
'file=/tmp/Rtmp5316H6/bern_glm-202001281255-2-c7e051.csv' 'method=sample' 'num_samples=300' 'num_warmup=100' \
'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 2 Iteration: 1 / 400 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 400 [ 25%] (Warmup)
Chain 2 Iteration: 101 / 400 [ 25%] (Sampling)
Chain 2 Iteration: 200 / 400 [ 50%] (Sampling)
Chain 2 Iteration: 300 / 400 [ 75%] (Sampling)
Chain 2 Iteration: 400 / 400 [100%] (Sampling)
Chain 2 finished in 37.9 seconds.
Both chains finished succesfully.
Mean chain execution time: 38.0 seconds.
Total execution time: 76.3 seconds.
So about 3x faster on the GPU
Aki I’m not sure exactly where the issues your seeing are coming from. Shooting from the hip I want to say it’s one or a mix of
- Updating the driver. But it’s only a year old so that seems like an odd source
- The virtual environment, though I’m not sure what exactly would cause the virtual environment to make data transfers so slow. There’s not even a lot of data transfers in this scenario
- Our docs for setting up GPU stuff are bad and you may have goofed something by accident because of that. Though I don’t see how that would cause a 100x slowdown I think it’s something we need to update. I think we should remove the old GPU wiki on stan math and cmdstan and just use the doxygen page as a single source for instructions
Also I was able to run this with 8 cores and 8 chains with no issues so there’s another mystery. Aki is your compiler clang or g++?