Linear, parallell regression

m_par.stan (1.3 KB)
sim_data.r (940.8 KB)

Hello all,

Here’s my “Hello world” in parallel. It is just a standard normal linear regression, but with K explanatory variables. This means that I need to “flatten” Y[1:J] and X[1:J,1:K] into a vector, where 1:J are the observations sent to one shard, and then stack these vectors on top of each other. the xi int array is useful for storing indices and such for keeping track of things.

The model works in the sense that several cores are humming along for the chain, and parameter estimates are correct. Happy to hear if anyone has a more elegant way of doing this…!

But hey - it works! :-) :-) :-)

-op

functions {
  vector bl_glm(vector sigma_beta, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	real lp;
    real sigma = sigma_beta[1];
	vector[K] beta= sigma_beta[2:(K+1)];
	vector[J] Y = to_vector(xs[1:J]);
	matrix[J,K] X;
	for (k in 1:K)
	  X[,k] = to_vector(xs[(1+J*k):(J*(k+1))]);
    lp=normal_lpdf(Y | X * beta, sigma);
    return [lp]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> shards;
  int<lower = 0> N;
  matrix[N,K] X;
  vector[N] Y;
}

transformed data {
vector[0] theta[shards];
  int<lower = 0> J = N / shards;       // Obs per shard
  real x_r[shards, J * (K+1)];
  int x_i[shards, 2];
  {
    int pos = 1;
    for (k in 1:shards) {
      int end = pos + J - 1;
      x_r[k,1:J] = to_array_1d(Y[pos:end]);
      x_i[k,1] = J;
      x_i[k,2] = K;
      pos += J;
    }
  }
    {
	for (cov in 1:K){
	  int pos = 1;
      for (k in 1:shards) {
        int end = pos + J - 1;
        x_r[k,(1+J*cov):(J+J*cov)] = to_array_1d(X[pos:end,cov]);
        pos += J;
      }
	}
  }
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 2);
  sigma ~ normal(0, 2);
   target += sum(map_rect(bl_glm, append_row(sigma, beta),
                         theta, x_r, x_i));
}
 

edit: Uploaded wrong data file

5 Likes

Thanks for coding it up!

This part in the mapped function is fine from a dev perspective (the code is readable), but unfortunatley is causing a performance problem. The issue is that any floating point type variable in a function which returns a var type (= it involves one of the parameters) is itself considered as a var. Those var’s endup on the AD stack and the way this function is written will lead to an inflation of the AD tree. You should try to code up this model straight without map_rect. In that case the data X and y is not turned into a var and thus the thing runs a lot faster.

Now, what to do? We need to find a way to avoid the temporary form by assigning things in x_r to a variable. Instead we need to directly use things in x_r without the temporaries. A workaround is to call with the temporary expression with x_r another function which then turns this into nicer names.

Unless someone else is faster, I will try to come up with a program illustrating this.

(yes, this is a shortcoming of the current Stan functions, which is there for good reasons. Hopefully, we will be able to avoid this over-propagation in a future versions of Stan)

1 Like

Thanks for coding this up, I’ve been waiting for a working code + model example to start playing.

Alas, I’m a total noob at CmdStan, usually using rstan. What I’m lacking here is how to make this go? Presumably I first ‘make’ the sim_data, then ‘make’ the model with some option for num_cores?

rstan 2.18 should follow soon… but it won’t hurt to get familiar with cmdstan.

If the data in xs is stored differently with c(c(Y[1], X[1,]),...,c(Y[J], X[J,]) in xs[1], could we instead do

for (i in 1:J)
  lp =+ normal_lpdf(xs[i+(1+K)(i-1)] | 
    to_row_vector(xs[(1+i+(1+K)(i-1)):(1+K+i+(1+K)(i-1))]) * beta, sigma);

If you ignore the likely presence of errors with indices, would that solve it?

Edit: I don’t care about losing the vectorization of normal_lpdf. My endgame is related to the categorical_logit, which isn’t vectorized anyways.

Much better… but can’t you sort things in x_r directly in a column major format so that you can do

normal_lpdf(xs[...] | to_matrix(xs[...], J, ??) * beta, sigma )

(sorry, leaving some things unclear due to lack of time to look more closely).

Wait… what will also be ok to do is:

vector[J] mu;
for(i in 1:J) {
mu[j] = ... your row vector ... * beta
}
lp = norma_lpdf(xs | mu, sigma);

maybe that is easier for you and you should get a similar speedup.

Thanks. I’ll throw coffee at the problem until the indices are straightened out.

That was less than a one cup-sized problem.

Could easily remove the J, K, sigma and beta parts in bl_glm and use the values directly where needed?

This actually cleaned up the transformed data block too.

functions {
  vector bl_glm(vector sigma_beta, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	real lp;
    real sigma = sigma_beta[1];
	vector[K] beta= sigma_beta[2:(K+1)];
    lp=normal_lpdf(xs[1:J] | to_matrix(xs[(J+1):(J*(K+1))],J,K) * beta, sigma);
    return [lp]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> shards;
  int<lower = 0> N;
  matrix[N,K] X;
  vector[N] Y;
}

transformed data {
vector[0] theta[shards];
  int<lower = 0> J = N / shards;      
  real x_r[shards, J * (K+1)];
  int x_i[shards, 2];
  {
    int pos = 1;
    for (k in 1:shards) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(append_col(Y[pos:end],X[pos:end,]));
      x_i[k,1] = J;
      x_i[k,2] = K;
      pos += J;
    }
  }
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 2);
  sigma ~ normal(0, 2);
   target += sum(map_rect(bl_glm, append_row(sigma, beta),
                         theta, x_r, x_i));
}
 

I would not do that. You don’t gain much from that and the code is easier to read. I usually only worry about this stuff whenever things scale with the data size.

If you could benchmark this vs the implementation without map_rect at all, then we get a good feeling as to when the parallelization will be beneficial. So the when increasing the number of cores with the map_rect, you should take as reference the non map_rect version and/or the 1-core map_rect (after all this is not too easy).

Also… if you can, switch to the MPI thing… it gives you better performance as there is no overhead from a thread_local AD stack… but I am afraid that the makefiles are not ideal right now as shipped in 2.18. (so if this is an option, I can send you a patch for this… still need to sort this out)

Yes I have been trying - sorry I was vague above because I am confused on various fronts regarding what to do. For a noob a CmdStan I feel there is alot of latent knowledge that is obvious to pros like yourself but perplexing to mugs like me! :)

Anyhow, I’ve had another go in the mean time using the data and script in the first post here. I managed to compile CmdStan under linux and run the bernoulli example successfully. I then created a local file including the option ‘CXXFLAGS += -DSTAN_THREADS’ and rebuilt cmdstan which I think worked ok. So then I tried:

make examples/map_rect/m_par

It churned away producing various messages, but ultimately ended in an error:

collect2: error: ld returned 1 exit status
make/models:14: recipe for target 'examples/map_rect/m_par' failed
make: *** [examples/map_rect/m_par] Error 1

So I don’t know where to begin to look for the error. I also still don’t know where/when to specify the environment variable STAN_NUM_THREADS. (an example of latent knowledge perhaps!)

I always need some time each time I start again with cmdstan.

Could you also please add -pthread and try once more?

What system are you on (distribution, compiler)?

1 Like

I have played around a bit with the model above, as well as a sequential version without the map_rect-function. K=50 explanatory variables, N varies from 100 to 10 000. Tried 1, 2, 4, 5 and 10 shards.

Not sure if the times below makes sense! My CPU-load was only around 10% the whole time. Set STAN_NUM_THREADS to 12. Have a feeling that either I didn’t run it in parallell (but that doesn’t fit with the 10 shard time!), or that a standard regression computationally too easy.

edit: Note that times are in seconds, measures as “Total Time” by cmdstan. Single time is the time for the non-map-rect version (i.e. standard, vectorized model).

N shards time Single_time Rel_to_single
100 1 2.179 0.075 29.053
100 2 2.356 0.075 31.413
100 4 2.895 0.075 38.600
100 5 2.749 0.075 36.653
100 10 3.976 0.075 53.013
1,000 1 5.044 2.626 1.921
1,000 2 5.193 2.626 1.978
1,000 4 5.813 2.626 2.214
1,000 5 5.637 2.626 2.147
1,000 10 6.189 2.626 2.357
10,000 1 153.894 30.271 5.084
10,000 2 163.801 30.271 5.411
10,000 4 94.226 30.271 3.113
10,000 5 106.028 30.271 3.503
10,000 10 58.067 30.271 1.918

Uhm… another pitfall… cmdstan reports as total time what is actually the total CPU time - but you want the total wall time. So you should start the programs with the time command and use the time reported from that. (Sorry, could have mentioned, but forgot - I had to scratch my head myself quite a bit to get this)

It is normal to not see 600% CPU load for 6 cores, because we are using a very rudimentary C++11 technique.

Even with that I am not so surprised that map_rect has a hard time here. This is because Stan is fast! However, Stan becomes slower whenever the problem blows the CPU cache. In that case the multi-threading should help.

And as I wrote… it is actually better to run the MPI thing due to less overhead.

2 Likes

Hm, allright. I’ll fiddle around with timers in cmd another day. Thanks anyway.

This worked! Thanks!

AMD Threadripper 1920x 12 core cpu, Ubuntu 18.04, g++ 7.3, make 4.1.

I also found somewhere in the forum how to set the number of cores in the environment as follows:

export STAN_NUM_THREADS = 10

Is this the best way? To later change it I’m simply running it again and changing the number is this correct? Perhaps this is in the user guide and I missed it, but if not I suggest you add a little bit about it.

Anyhow - I saw your comment on timing. So using the time command, I ran the model on the data set provided above. Here were my timings:

STAN_NUM_THREADS Real Time
1 18.644
2 12.56
4 18.904
8 18.844
12 22.995
16 14.661
20 12.595
24 22.963

(Edit: the linreg.stan took 4.78s)

Ok something looks funky I guess, but its late now. I’m happy to have gotten it to work at least! Thanks @Ole_Petter_Hansen for providing code and thread to experiment with. Thanks @wds15 for the helpful tips!

1 Like

Great, let’s make this work. If you beat me to it, I upload the r-script I used to generate data-files. Think you can just set STAN_NUM_THREADS to your number of cores, and then re-estimate the model where you let N and shards vary. Also add a standard linear regression stan file, which would be a usefulke benchmark.

linreg.stan (239 Bytes)

gen_dat.r (382 Bytes)

1 Like

Thanks, I will try possibly myself later.

to clarify what I meant with the CPU cache stuff: My expectation for linear models is that you are only getting speedups whenever there is really a lot of work to do. So N must be large (>10^4) and K should also be large (maybe 10^3).

Doing this parallelization stuff adds overhead, which is only compensated for if each chunk of work is large enough in itself.

@jroon, the way I would recommend fiddling with the threads is to call the stan program like
STAN_NUM_THREADS=XX ./your-stan-program ...
This way you can set this for each run independently… and yes we can probably improve the doc. Most of the doc is right now on the stan-math wiki for this stuff.

1 Like

So an observation. I’m running map_rect on those different data sets and Ive noticed that the number of active cores = the number of shards. For example, even if I set STAN_NUM_THREADS = 24 - the max on my system, if shards =1 then only one core is active. Maybe (probably) this is obvious to everyone, but I had not realised this. @Ole_Petter_Hansen, I’m going to change up the simulated data to better capture the number of cores on my machine.

cmdstan now just displays the config and quitely exits when STAN_NUM_THREADS different from 1. If it is one it samples as it should.

However, this only happens with the linear regression program above. cmdstan works well in parallell with the program below (from @Bob_Carpenter). Does anyone have an idea why it just exits? Can’t see any error messages, so don’t know where to look for bugs.

functions {
  vector bl_glm(vector mu_sigma, vector beta,
                real[] x, int[] y) {
    vector[2] mu = mu_sigma[1:2];
    vector[2] sigma = mu_sigma[3:4];
    real lp = normal_lpdf(beta | mu, sigma);
    real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
    return [lp + ll]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> N;
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = y[pos:end];
      pos += J;
    }
  }
}
parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma),
                         beta, x_r, x_i));
}