Linear, parallell regression

rstan 2.18 should follow soon… but it won’t hurt to get familiar with cmdstan.

If the data in xs is stored differently with c(c(Y[1], X[1,]),...,c(Y[J], X[J,]) in xs[1], could we instead do

for (i in 1:J)
  lp =+ normal_lpdf(xs[i+(1+K)(i-1)] | 
    to_row_vector(xs[(1+i+(1+K)(i-1)):(1+K+i+(1+K)(i-1))]) * beta, sigma);

If you ignore the likely presence of errors with indices, would that solve it?

Edit: I don’t care about losing the vectorization of normal_lpdf. My endgame is related to the categorical_logit, which isn’t vectorized anyways.

Much better… but can’t you sort things in x_r directly in a column major format so that you can do

normal_lpdf(xs[...] | to_matrix(xs[...], J, ??) * beta, sigma )

(sorry, leaving some things unclear due to lack of time to look more closely).

Wait… what will also be ok to do is:

vector[J] mu;
for(i in 1:J) {
mu[j] = ... your row vector ... * beta
lp = norma_lpdf(xs | mu, sigma);

maybe that is easier for you and you should get a similar speedup.

Thanks. I’ll throw coffee at the problem until the indices are straightened out.

That was less than a one cup-sized problem.

Could easily remove the J, K, sigma and beta parts in bl_glm and use the values directly where needed?

This actually cleaned up the transformed data block too.

functions {
  vector bl_glm(vector sigma_beta, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	real lp;
    real sigma = sigma_beta[1];
	vector[K] beta= sigma_beta[2:(K+1)];
    lp=normal_lpdf(xs[1:J] | to_matrix(xs[(J+1):(J*(K+1))],J,K) * beta, sigma);
    return [lp]';
data {
  int<lower = 0> K;
  int<lower = 0> shards;
  int<lower = 0> N;
  matrix[N,K] X;
  vector[N] Y;

transformed data {
vector[0] theta[shards];
  int<lower = 0> J = N / shards;      
  real x_r[shards, J * (K+1)];
  int x_i[shards, 2];
    int pos = 1;
    for (k in 1:shards) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(append_col(Y[pos:end],X[pos:end,]));
      x_i[k,1] = J;
      x_i[k,2] = K;
      pos += J;
parameters {
  vector[K] beta;
  real<lower=0> sigma;
model {
  beta ~ normal(0, 2);
  sigma ~ normal(0, 2);
   target += sum(map_rect(bl_glm, append_row(sigma, beta),
                         theta, x_r, x_i));

I would not do that. You don’t gain much from that and the code is easier to read. I usually only worry about this stuff whenever things scale with the data size.

If you could benchmark this vs the implementation without map_rect at all, then we get a good feeling as to when the parallelization will be beneficial. So the when increasing the number of cores with the map_rect, you should take as reference the non map_rect version and/or the 1-core map_rect (after all this is not too easy).

Also… if you can, switch to the MPI thing… it gives you better performance as there is no overhead from a thread_local AD stack… but I am afraid that the makefiles are not ideal right now as shipped in 2.18. (so if this is an option, I can send you a patch for this… still need to sort this out)

Yes I have been trying - sorry I was vague above because I am confused on various fronts regarding what to do. For a noob a CmdStan I feel there is alot of latent knowledge that is obvious to pros like yourself but perplexing to mugs like me! :)

Anyhow, I’ve had another go in the mean time using the data and script in the first post here. I managed to compile CmdStan under linux and run the bernoulli example successfully. I then created a local file including the option ‘CXXFLAGS += -DSTAN_THREADS’ and rebuilt cmdstan which I think worked ok. So then I tried:

make examples/map_rect/m_par

It churned away producing various messages, but ultimately ended in an error:

collect2: error: ld returned 1 exit status
make/models:14: recipe for target 'examples/map_rect/m_par' failed
make: *** [examples/map_rect/m_par] Error 1

So I don’t know where to begin to look for the error. I also still don’t know where/when to specify the environment variable STAN_NUM_THREADS. (an example of latent knowledge perhaps!)

I always need some time each time I start again with cmdstan.

Could you also please add -pthread and try once more?

What system are you on (distribution, compiler)?

1 Like

I have played around a bit with the model above, as well as a sequential version without the map_rect-function. K=50 explanatory variables, N varies from 100 to 10 000. Tried 1, 2, 4, 5 and 10 shards.

Not sure if the times below makes sense! My CPU-load was only around 10% the whole time. Set STAN_NUM_THREADS to 12. Have a feeling that either I didn’t run it in parallell (but that doesn’t fit with the 10 shard time!), or that a standard regression computationally too easy.

edit: Note that times are in seconds, measures as “Total Time” by cmdstan. Single time is the time for the non-map-rect version (i.e. standard, vectorized model).

N shards time Single_time Rel_to_single
100 1 2.179 0.075 29.053
100 2 2.356 0.075 31.413
100 4 2.895 0.075 38.600
100 5 2.749 0.075 36.653
100 10 3.976 0.075 53.013
1,000 1 5.044 2.626 1.921
1,000 2 5.193 2.626 1.978
1,000 4 5.813 2.626 2.214
1,000 5 5.637 2.626 2.147
1,000 10 6.189 2.626 2.357
10,000 1 153.894 30.271 5.084
10,000 2 163.801 30.271 5.411
10,000 4 94.226 30.271 3.113
10,000 5 106.028 30.271 3.503
10,000 10 58.067 30.271 1.918

Uhm… another pitfall… cmdstan reports as total time what is actually the total CPU time - but you want the total wall time. So you should start the programs with the time command and use the time reported from that. (Sorry, could have mentioned, but forgot - I had to scratch my head myself quite a bit to get this)

It is normal to not see 600% CPU load for 6 cores, because we are using a very rudimentary C++11 technique.

Even with that I am not so surprised that map_rect has a hard time here. This is because Stan is fast! However, Stan becomes slower whenever the problem blows the CPU cache. In that case the multi-threading should help.

And as I wrote… it is actually better to run the MPI thing due to less overhead.


Hm, allright. I’ll fiddle around with timers in cmd another day. Thanks anyway.

This worked! Thanks!

AMD Threadripper 1920x 12 core cpu, Ubuntu 18.04, g++ 7.3, make 4.1.

I also found somewhere in the forum how to set the number of cores in the environment as follows:

export STAN_NUM_THREADS = 10

Is this the best way? To later change it I’m simply running it again and changing the number is this correct? Perhaps this is in the user guide and I missed it, but if not I suggest you add a little bit about it.

Anyhow - I saw your comment on timing. So using the time command, I ran the model on the data set provided above. Here were my timings:

1 18.644
2 12.56
4 18.904
8 18.844
12 22.995
16 14.661
20 12.595
24 22.963

(Edit: the linreg.stan took 4.78s)

Ok something looks funky I guess, but its late now. I’m happy to have gotten it to work at least! Thanks @Ole_Petter_Hansen for providing code and thread to experiment with. Thanks @wds15 for the helpful tips!

1 Like

Great, let’s make this work. If you beat me to it, I upload the r-script I used to generate data-files. Think you can just set STAN_NUM_THREADS to your number of cores, and then re-estimate the model where you let N and shards vary. Also add a standard linear regression stan file, which would be a usefulke benchmark.

linreg.stan (239 Bytes)

gen_dat.r (382 Bytes)

1 Like

Thanks, I will try possibly myself later.

to clarify what I meant with the CPU cache stuff: My expectation for linear models is that you are only getting speedups whenever there is really a lot of work to do. So N must be large (>10^4) and K should also be large (maybe 10^3).

Doing this parallelization stuff adds overhead, which is only compensated for if each chunk of work is large enough in itself.

@jroon, the way I would recommend fiddling with the threads is to call the stan program like
STAN_NUM_THREADS=XX ./your-stan-program ...
This way you can set this for each run independently… and yes we can probably improve the doc. Most of the doc is right now on the stan-math wiki for this stuff.

1 Like

So an observation. I’m running map_rect on those different data sets and Ive noticed that the number of active cores = the number of shards. For example, even if I set STAN_NUM_THREADS = 24 - the max on my system, if shards =1 then only one core is active. Maybe (probably) this is obvious to everyone, but I had not realised this. @Ole_Petter_Hansen, I’m going to change up the simulated data to better capture the number of cores on my machine.

cmdstan now just displays the config and quitely exits when STAN_NUM_THREADS different from 1. If it is one it samples as it should.

However, this only happens with the linear regression program above. cmdstan works well in parallell with the program below (from @Bob_Carpenter). Does anyone have an idea why it just exits? Can’t see any error messages, so don’t know where to look for bugs.

functions {
  vector bl_glm(vector mu_sigma, vector beta,
                real[] x, int[] y) {
    vector[2] mu = mu_sigma[1:2];
    vector[2] sigma = mu_sigma[3:4];
    real lp = normal_lpdf(beta | mu, sigma);
    real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
    return [lp + ll]';
data {
  int<lower = 0> K;
  int<lower = 0> N;
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = y[pos:end];
      pos += J;
parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma),
                         beta, x_r, x_i));

I can’t help there @Ole_Petter_Hansen. I did a series of runs over the day using the original model from your first post and the data generating script you provided recently:

N K Shards linreg time (s) map_rect time (s) Ratio
1000 50 1 2.22 18.618 8.39
1000 50 4 2.138 20.234 9.46
1000 50 12 2.388 11.206 4.69
1000 50 24 2.265 21.391 9.44
10000 50 1 36.794 400.119 10.87
10000 50 4 40.887 296.588 7.25
10000 50 12 40.624 96.013 2.36
10000 50 24 39.703 89.769 2.26
100000 50 1 781.682 NA* #VALUE!
100000 50 4 954.53 3978.824 4.17
100000 50 12 854.336 2490.854 2.92
100000 50 24 828.47 1678.182 2.03
1000 200 4 13.126 141.935 10.81
1000 200 12 12.893 67.072 5.20
1000 200 24 12.897 70.384 5.46
10000 200 4 400.634 1934.303 4.83
10000 200 24 375.313 917.11 2.44

I got a bit fed up with it as the day went on so I skipped some options. However as you can see only in one case was map_rect faster. I suspect this is a fluke also since the linear model took much longer than the model immediately above it, I’m going to rerun this linear model to check it.
So on a second run with different random seed the last row linreg changed form 1450ish to 375

  • oh the NA is because this was clearly going to take many hours and I didn’t have the patience haha
1 Like

Thanks. Puzzling with the disadvantage relative to linreg.
I’ve given up on Windows for this - crazy how easy it is to do in Ubuntu! I’ll see if I can get similar results.

1 Like

[quote=“Ole_Petter_Hansen, post:22, topic:5230”]
Thanks. Puzzling with the disadvantage relative to linreg.[/quote]

I’m not that surprised. I was lurking on some of the development chat for this and if I understood correctly the parallelisation should work better for hierarchical models than linear models. Also I think @wds15 was dropping us hints that this might not work so well for this model :) But honestly - I don’t mind, at the start of the thread I had no clue how to go about this, so this has been worthwhile learning for me even if its not given efficiency gains :)

I know!! I spent about a day simply trying to install cmdstan in windows! In ubuntu I had the thing working in about an hour!