Linear, parallell regression

Ole_Petter_Hansen · August 15, 2018, 6:16pm

m_par.stan (1.3 KB)
sim_data.r (940.8 KB)

Hello all,

Here’s my “Hello world” in parallel. It is just a standard normal linear regression, but with K explanatory variables. This means that I need to “flatten” Y[1:J] and X[1:J,1:K] into a vector, where 1:J are the observations sent to one shard, and then stack these vectors on top of each other. the xi int array is useful for storing indices and such for keeping track of things.

The model works in the sense that several cores are humming along for the chain, and parameter estimates are correct. Happy to hear if anyone has a more elegant way of doing this…!

But hey - it works! :-) :-) :-)

-op

functions {
  vector bl_glm(vector sigma_beta, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	real lp;
    real sigma = sigma_beta[1];
	vector[K] beta= sigma_beta[2:(K+1)];
	vector[J] Y = to_vector(xs[1:J]);
	matrix[J,K] X;
	for (k in 1:K)
	  X[,k] = to_vector(xs[(1+J*k):(J*(k+1))]);
    lp=normal_lpdf(Y | X * beta, sigma);
    return [lp]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> shards;
  int<lower = 0> N;
  matrix[N,K] X;
  vector[N] Y;
}

transformed data {
vector[0] theta[shards];
  int<lower = 0> J = N / shards;       // Obs per shard
  real x_r[shards, J * (K+1)];
  int x_i[shards, 2];
  {
    int pos = 1;
    for (k in 1:shards) {
      int end = pos + J - 1;
      x_r[k,1:J] = to_array_1d(Y[pos:end]);
      x_i[k,1] = J;
      x_i[k,2] = K;
      pos += J;
    }
  }
    {
	for (cov in 1:K){
	  int pos = 1;
      for (k in 1:shards) {
        int end = pos + J - 1;
        x_r[k,(1+J*cov):(J+J*cov)] = to_array_1d(X[pos:end,cov]);
        pos += J;
      }
	}
  }
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 2);
  sigma ~ normal(0, 2);
   target += sum(map_rect(bl_glm, append_row(sigma, beta),
                         theta, x_r, x_i));
}

edit: Uploaded wrong data file

wds15 · August 16, 2018, 11:13am

Thanks for coding it up!

This part in the mapped function is fine from a dev perspective (the code is readable), but unfortunatley is causing a performance problem. The issue is that any floating point type variable in a function which returns a var type (= it involves one of the parameters) is itself considered as a var. Those var’s endup on the AD stack and the way this function is written will lead to an inflation of the AD tree. You should try to code up this model straight without map_rect. In that case the data X and y is not turned into a var and thus the thing runs a lot faster.

Now, what to do? We need to find a way to avoid the temporary form by assigning things in x_r to a variable. Instead we need to directly use things in x_r without the temporaries. A workaround is to call with the temporary expression with x_r another function which then turns this into nicer names.

Unless someone else is faster, I will try to come up with a program illustrating this.

(yes, this is a shortcoming of the current Stan functions, which is there for good reasons. Hopefully, we will be able to avoid this over-propagation in a future versions of Stan)

jroon · August 16, 2018, 11:18am

Thanks for coding this up, I’ve been waiting for a working code + model example to start playing.

Alas, I’m a total noob at CmdStan, usually using rstan. What I’m lacking here is how to make this go? Presumably I first ‘make’ the sim_data, then ‘make’ the model with some option for num_cores?

wds15 · August 16, 2018, 11:26am

rstan 2.18 should follow soon… but it won’t hurt to get familiar with cmdstan.

Ole_Petter_Hansen · August 16, 2018, 12:46pm

If the data in xs is stored differently with c(c(Y[1], X[1,]),...,c(Y[J], X[J,]) in xs[1], could we instead do

for (i in 1:J)
  lp =+ normal_lpdf(xs[i+(1+K)(i-1)] | 
    to_row_vector(xs[(1+i+(1+K)(i-1)):(1+K+i+(1+K)(i-1))]) * beta, sigma);

If you ignore the likely presence of errors with indices, would that solve it?

Edit: I don’t care about losing the vectorization of normal_lpdf. My endgame is related to the categorical_logit, which isn’t vectorized anyways.

wds15 · August 16, 2018, 1:05pm

Much better… but can’t you sort things in x_r directly in a column major format so that you can do

normal_lpdf(xs[...] | to_matrix(xs[...], J, ??) * beta, sigma )

(sorry, leaving some things unclear due to lack of time to look more closely).

wds15 · August 16, 2018, 1:11pm

Wait… what will also be ok to do is:

vector[J] mu;
for(i in 1:J) {
mu[j] = ... your row vector ... * beta
}
lp = norma_lpdf(xs | mu, sigma);

maybe that is easier for you and you should get a similar speedup.

Ole_Petter_Hansen · August 16, 2018, 1:26pm

Thanks. I’ll throw coffee at the problem until the indices are straightened out.

Ole_Petter_Hansen · August 16, 2018, 1:52pm

That was less than a one cup-sized problem.

Could easily remove the J, K, sigma and beta parts in bl_glm and use the values directly where needed?

This actually cleaned up the transformed data block too.

functions {
  vector bl_glm(vector sigma_beta, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	real lp;
    real sigma = sigma_beta[1];
	vector[K] beta= sigma_beta[2:(K+1)];
    lp=normal_lpdf(xs[1:J] | to_matrix(xs[(J+1):(J*(K+1))],J,K) * beta, sigma);
    return [lp]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> shards;
  int<lower = 0> N;
  matrix[N,K] X;
  vector[N] Y;
}

transformed data {
vector[0] theta[shards];
  int<lower = 0> J = N / shards;      
  real x_r[shards, J * (K+1)];
  int x_i[shards, 2];
  {
    int pos = 1;
    for (k in 1:shards) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(append_col(Y[pos:end],X[pos:end,]));
      x_i[k,1] = J;
      x_i[k,2] = K;
      pos += J;
    }
  }
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 2);
  sigma ~ normal(0, 2);
   target += sum(map_rect(bl_glm, append_row(sigma, beta),
                         theta, x_r, x_i));
}

wds15 · August 16, 2018, 2:25pm

I would not do that. You don’t gain much from that and the code is easier to read. I usually only worry about this stuff whenever things scale with the data size.

If you could benchmark this vs the implementation without map_rect at all, then we get a good feeling as to when the parallelization will be beneficial. So the when increasing the number of cores with the map_rect, you should take as reference the non map_rect version and/or the 1-core map_rect (after all this is not too easy).

Also… if you can, switch to the MPI thing… it gives you better performance as there is no overhead from a thread_local AD stack… but I am afraid that the makefiles are not ideal right now as shipped in 2.18. (so if this is an option, I can send you a patch for this… still need to sort this out)

jroon · August 16, 2018, 2:46pm

Yes I have been trying - sorry I was vague above because I am confused on various fronts regarding what to do. For a noob a CmdStan I feel there is alot of latent knowledge that is obvious to pros like yourself but perplexing to mugs like me! :)

Anyhow, I’ve had another go in the mean time using the data and script in the first post here. I managed to compile CmdStan under linux and run the bernoulli example successfully. I then created a local file including the option ‘CXXFLAGS += -DSTAN_THREADS’ and rebuilt cmdstan which I think worked ok. So then I tried:

make examples/map_rect/m_par

It churned away producing various messages, but ultimately ended in an error:

collect2: error: ld returned 1 exit status
make/models:14: recipe for target 'examples/map_rect/m_par' failed
make: *** [examples/map_rect/m_par] Error 1

So I don’t know where to begin to look for the error. I also still don’t know where/when to specify the environment variable STAN_NUM_THREADS. (an example of latent knowledge perhaps!)

wds15 · August 16, 2018, 5:51pm

I always need some time each time I start again with cmdstan.

Could you also please add -pthread and try once more?

What system are you on (distribution, compiler)?

Ole_Petter_Hansen · August 16, 2018, 7:22pm

I have played around a bit with the model above, as well as a sequential version without the map_rect-function. K=50 explanatory variables, N varies from 100 to 10 000. Tried 1, 2, 4, 5 and 10 shards.

Not sure if the times below makes sense! My CPU-load was only around 10% the whole time. Set STAN_NUM_THREADS to 12. Have a feeling that either I didn’t run it in parallell (but that doesn’t fit with the 10 shard time!), or that a standard regression computationally too easy.

edit: Note that times are in seconds, measures as “Total Time” by cmdstan. Single time is the time for the non-map-rect version (i.e. standard, vectorized model).


N	shards	time	Single_time	Rel_to_single

100	1	2.179	0.075	29.053
100	2	2.356	0.075	31.413
100	4	2.895	0.075	38.600
100	5	2.749	0.075	36.653
100	10	3.976	0.075	53.013
1,000	1	5.044	2.626	1.921
1,000	2	5.193	2.626	1.978
1,000	4	5.813	2.626	2.214
1,000	5	5.637	2.626	2.147
1,000	10	6.189	2.626	2.357
10,000	1	153.894	30.271	5.084
10,000	2	163.801	30.271	5.411
10,000	4	94.226	30.271	3.113
10,000	5	106.028	30.271	3.503
10,000	10	58.067	30.271	1.918

wds15 · August 16, 2018, 7:33pm

Uhm… another pitfall… cmdstan reports as total time what is actually the total CPU time - but you want the total wall time. So you should start the programs with the time command and use the time reported from that. (Sorry, could have mentioned, but forgot - I had to scratch my head myself quite a bit to get this)

It is normal to not see 600% CPU load for 6 cores, because we are using a very rudimentary C++11 technique.

Even with that I am not so surprised that map_rect has a hard time here. This is because Stan is fast! However, Stan becomes slower whenever the problem blows the CPU cache. In that case the multi-threading should help.

And as I wrote… it is actually better to run the MPI thing due to less overhead.

Ole_Petter_Hansen · August 16, 2018, 8:02pm

Hm, allright. I’ll fiddle around with timers in cmd another day. Thanks anyway.

jroon · August 16, 2018, 10:51pm

This worked! Thanks!

AMD Threadripper 1920x 12 core cpu, Ubuntu 18.04, g++ 7.3, make 4.1.

I also found somewhere in the forum how to set the number of cores in the environment as follows:

export STAN_NUM_THREADS = 10

Is this the best way? To later change it I’m simply running it again and changing the number is this correct? Perhaps this is in the user guide and I missed it, but if not I suggest you add a little bit about it.

Anyhow - I saw your comment on timing. So using the time command, I ran the model on the data set provided above. Here were my timings:

STAN_NUM_THREADS	Real Time
1	18.644
2	12.56
4	18.904
8	18.844
12	22.995
16	14.661
20	12.595
24	22.963

(Edit: the linreg.stan took 4.78s)

Ok something looks funky I guess, but its late now. I’m happy to have gotten it to work at least! Thanks @Ole_Petter_Hansen for providing code and thread to experiment with. Thanks @wds15 for the helpful tips!

Ole_Petter_Hansen · August 17, 2018, 5:11am

Great, let’s make this work. If you beat me to it, I upload the r-script I used to generate data-files. Think you can just set STAN_NUM_THREADS to your number of cores, and then re-estimate the model where you let N and shards vary. Also add a standard linear regression stan file, which would be a usefulke benchmark.

linreg.stan (239 Bytes)

gen_dat.r (382 Bytes)

wds15 · August 17, 2018, 6:43am

Thanks, I will try possibly myself later.

to clarify what I meant with the CPU cache stuff: My expectation for linear models is that you are only getting speedups whenever there is really a lot of work to do. So N must be large (>10^4) and K should also be large (maybe 10^3).

Doing this parallelization stuff adds overhead, which is only compensated for if each chunk of work is large enough in itself.

@jroon, the way I would recommend fiddling with the threads is to call the stan program like
STAN_NUM_THREADS=XX ./your-stan-program ...
This way you can set this for each run independently… and yes we can probably improve the doc. Most of the doc is right now on the stan-math wiki for this stuff.

jroon · August 17, 2018, 10:35am

So an observation. I’m running map_rect on those different data sets and Ive noticed that the number of active cores = the number of shards. For example, even if I set STAN_NUM_THREADS = 24 - the max on my system, if shards =1 then only one core is active. Maybe (probably) this is obvious to everyone, but I had not realised this. @Ole_Petter_Hansen, I’m going to change up the simulated data to better capture the number of cores on my machine.

Ole_Petter_Hansen · August 17, 2018, 12:28pm

cmdstan now just displays the config and quitely exits when STAN_NUM_THREADS different from 1. If it is one it samples as it should.

However, this only happens with the linear regression program above. cmdstan works well in parallell with the program below (from @Bob_Carpenter). Does anyone have an idea why it just exits? Can’t see any error messages, so don’t know where to look for bugs.

functions {
  vector bl_glm(vector mu_sigma, vector beta,
                real[] x, int[] y) {
    vector[2] mu = mu_sigma[1:2];
    vector[2] sigma = mu_sigma[3:4];
    real lp = normal_lpdf(beta | mu, sigma);
    real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
    return [lp + ll]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> N;
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = y[pos:end];
      pos += J;
    }
  }
}
parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma),
                         beta, x_r, x_i));
}

Topic		Replies	Views
Map_rect in the Stan language (MPI is coming) Developers	21	2677	July 14, 2020
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3156	April 21, 2020
Map_rect tutorial performance comparison - small vs. big machine General performance	23	1987	March 2, 2019
Map_rect concurrent about to land Developers math	34	2122	July 23, 2018
Proposed parallelism RFC - Stan language bits Developers	14	1045	July 9, 2019

Linear, parallell regression

Related topics