Linear, parallell regression

jroon · August 17, 2018, 4:52pm

I can’t help there @Ole_Petter_Hansen. I did a series of runs over the day using the original model from your first post and the data generating script you provided recently:

N	K	Shards	linreg time (s)	map_rect time (s)	Ratio
1000	50	1	2.22	18.618	8.39
1000	50	4	2.138	20.234	9.46
1000	50	12	2.388	11.206	4.69
1000	50	24	2.265	21.391	9.44
—	—	—	—	—	—
10000	50	1	36.794	400.119	10.87
10000	50	4	40.887	296.588	7.25
10000	50	12	40.624	96.013	2.36
10000	50	24	39.703	89.769	2.26
—	—	—	—	—	—
100000	50	1	781.682	NA*	#VALUE!
100000	50	4	954.53	3978.824	4.17
100000	50	12	854.336	2490.854	2.92
100000	50	24	828.47	1678.182	2.03
—	—	—	—	—	—
1000	200	4	13.126	141.935	10.81
1000	200	12	12.893	67.072	5.20
1000	200	24	12.897	70.384	5.46
—	—	—	—	—	—
10000	200	4	400.634	1934.303	4.83
10000	200	24	375.313	917.11	2.44

I got a bit fed up with it as the day went on so I skipped some options. However as you can see only in one case was map_rect faster. I suspect this is a fluke also since the linear model took much longer than the model immediately above it, I’m going to rerun this linear model to check it.
So on a second run with different random seed the last row linreg changed form 1450ish to 375

oh the NA is because this was clearly going to take many hours and I didn’t have the patience haha

Ole_Petter_Hansen · August 17, 2018, 8:34pm

Thanks. Puzzling with the disadvantage relative to linreg.
I’ve given up on Windows for this - crazy how easy it is to do in Ubuntu! I’ll see if I can get similar results.

jroon · August 17, 2018, 8:40pm

[quote=“Ole_Petter_Hansen, post:22, topic:5230”]
Thanks. Puzzling with the disadvantage relative to linreg.[/quote]

I’m not that surprised. I was lurking on some of the development chat for this and if I understood correctly the parallelisation should work better for hierarchical models than linear models. Also I think @wds15 was dropping us hints that this might not work so well for this model :) But honestly - I don’t mind, at the start of the thread I had no clue how to go about this, so this has been worthwhile learning for me even if its not given efficiency gains :)

I know!! I spent about a day simply trying to install cmdstan in windows! In ubuntu I had the thing working in about an hour!

Ole_Petter_Hansen · August 18, 2018, 6:02am

Ok, finally got it up and running. Results below. Looks like at least here the N=1000, K=500 is getting close to the baseline. Set STAN_NUM_THREADS to -1 in all cases.

…but now that the infrastructure is up an running I’ll try some more diffcult models. Thinking ordinal logit, with weak priors -> large tree.


N	K	shards	Sec.	Sec. linear model	ratio

100	50	10	6.404	0.636	10.100
100	50	5	2.532	0.636	4
100	50	4	2.266	0.636	3.600
100	50	2	1.660	0.636	2.600
100	50	1	1.459	0.636	2.300
1,000	500	10	119.267	90.591	1.300
1,000	500	5	104.732	90.591	1.200
1,000	500	4	96.168	90.591	1.100
1,000	500	2	211.634	90.591	2.300
1,000	500	1	219.123	90.591	2.400

wds15 · August 18, 2018, 10:14am

Intersting to see all these results. Just a note: I created map_rect with models in mind which take days to compute. I am not saying you need to go there, but big hierarchical models are where I am expecting to see the payoff. The big hierarichal models have not too much of communication needs, but are costly for each unit (at least this is where this pays off).

… and again: if you are already on Ubuntu, then you should try MPI. However, please download the make/models file from the current cmdstan. Going MPI on Ubuntu is a matter of installing the openMPI package, see the stan-math wiki for details.

Is the table above created by a single script? If so, then I can offer to run this on our cluster.

Ole_Petter_Hansen · August 18, 2018, 10:34am

I can try to make a script of it. Yes, I will check out MPI.

Trying to fit a more complex model - but it does take a bit of head scratching to figure out how to squeeze data structures through a funnel-like two-dimensional array. Maybe I’ll get used to it.

Are the arguments of map_rect settled, or still a discussion?

wds15 · August 18, 2018, 10:38am

It is released in a stable version…you can expect it to stay like it.

This type of packing is standard for a few functions in stan right now. Things will hopefully improve once we have Stan 3. Another interesting potential future feature is the serialization thing which @Bob_Carpenter brought up in another thread, but this is only at the early design stage at the moment.

jroon · August 18, 2018, 11:20am

Mine was done manually - thats why I got fedup running various options, I would have to learn how to script it (or @Ole_Petter_Hansen if write one I can probably understand it and adapt as needed).

@wds15 - yes I appreciate you have big hierarchical models in mind. I’m not fully up to speed with the syntax for map_rect yet as I wasn’t sure til now that I could even figure out how to run it. I have use cases in mind for maybe 6 months from now, hence I’m trying to lay some groundwork for myself. I was also curious how it would run on a Threadripper CPU, as I have not seen anyone report using them for Stan before. They have fairly large cache compared to intel chips which might also effect how efficiently they run parallel STAN? Mine has 32MB L3 cache for example. (They just released an insane new chip also https://www.pcworld.com/article/3296378/components-processors/2nd-gen-threadripper-review-amds-32-core-cpu-is-insanely-fast.html )

Anyhow, I’m likely of limited assistance with coding models currently, but if you guys want me to run benchmark test models on my machine I’m happy to help so long as I don’t need to use the machine for something else!
Edit - I’ll have a go at installing MPI in the coming week.

Ole_Petter_Hansen · August 18, 2018, 4:22pm

Allright, here are some files for doing a benchmark.

1: All the attached files need to be in the same folder.
2: make linreg and reg_par (standard cmdstan-stuff)
3: make sure you have dplyr, magrittr, stargazer and rstan as available libraries in R.
4: run start_benchmark (had to give it a txt-extenstion to be allowed to upload it…you may remove it). You may need to issue chmod +x start_benchmark first.

Program will then loop over orders 2,3,4 and 5, with shards 1, 2, 4, 5 and 10 as well as a plain linear model. When it is done it runs the post.r -script, that reads in times and saves a file “results.html”. Note: If you change stuff in the gen_dat-file, (especially orders and shards), you may need to update the loops in start_benchmark).

start_benchmark.txt (681 Bytes)
gen_dat.r (378 Bytes)
post.r (383 Bytes)
linreg.stan (239 Bytes)
reg_par.stan (986 Bytes)

This my first Ubuntu script. Please let me know if something should be improved.

Ole_Petter_Hansen · August 18, 2018, 4:46pm

On second thought, you might want to use the attached script instead. It scales down the number of iterations when N increases. Otherwise it will take forever with the larger datasets (might even be more iterations than what is needed for a relative measurement here…)
start_benchmark2.txt (739 Bytes)

Ole_Petter_Hansen · August 18, 2018, 9:49pm

Got an ordinal probit up and running.

ord.stan (382 Bytes)
ord_par.stan (1.2 KB)

If anyone has better ideas for looping through individuals than this in the map_rect-function I’m all ears.

functions {
  vector o_prob(vector beta_c, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	int D = xi[3];	
	real lp=0;
        vector[D] beta = beta_c[1:D];
	vector[K-1] c = beta_c[(1+D):(D+K-1)];
	vector[K] mu;
	for (n in 1:J){
          real eta=0;
          for(kk in 1:D)
            eta += xs[n+J*(kk-1)]*beta[kk];
	  mu[1] = 1-Phi(eta-c[1]);
	  for (k in 2:(K-1))
	    mu[k] = Phi(eta-c[k-1]) - Phi(eta-c[k]);
	  mu[K] = Phi(eta-c[K-1]);
	  lp += categorical_lpmf(xi[n+3] | mu);
	}
    return [lp]';
  }
}

As expected, with loops and a bunch of CDF-evaluations parallellisation pays off. Below results from a moderate number of iteration using the ordinal models above. 5 explanatory variables in all cases.


time_in_ms	order	shards	time_linear	ratio

5,232	2	10	882	5.932
799	2	5	882	0.906
757	2	4	882	0.858
779	2	2	882	0.883
944	2	1	882	1.070
1,775	3	10	4,557	0.390
1,433	3	5	4,557	0.314
2,008	3	4	4,557	0.441
3,487	3	2	4,557	0.765
5,385	3	1	4,557	1.182
10,359	4	10	22,897	0.452
10,914	4	5	22,897	0.477
12,210	4	4	22,897	0.533
20,666	4	2	22,897	0.903
34,035	4	1	22,897	1.486

jroon · August 18, 2018, 9:55pm

I was out for the day but just set the benchmark running. I changed a few settings but I’m gonna leave go now and lets see what happens :)
Edit: FYI the 5th order loop errored out because some varaible it was trying to create was 35.7GB in size! I restricted it to only order 2,3,4 now

Ole_Petter_Hansen · August 19, 2018, 5:12am

You could alternatively edit the line

K <- (10^order)/2

in the gen_dat.r-file, and e.g. set K to a fixed number.

jroon · August 19, 2018, 1:15pm

Results from your benchmark2 script!
Edit: @Ole_Petter_Hansen - results updated:


time_in_ms	order	shards	time_linear	ratio

24,736	2	24	313	79.029
9,066	2	10	313	28.965
2,578	2	4	313	8.236
1,938	2	2	313	6.192
45,863	3	24	38,704	1.185
34,860	3	10	38,704	0.901
66,252	3	4	38,704	1.712
121,380	3	2	38,704	3.136
4,821,764	4	24	10,231,311	0.471
8,803,770	4	10	10,231,311	0.860
12,528,026	4	4	10,231,311	1.224
19,583,774	4	2	10,231,311	1.914

Ole_Petter_Hansen · August 19, 2018, 2:09pm

Thanks! Could you please check if the standard linear regression actually has run? Times of 2ms is suspiciously fast.

jroon · August 19, 2018, 2:30pm

Ok so with the updated results now things make more sense.
In the above table the number of variables K = (10^order / 2). So the parallel approach only starts to make sense when N ~ 10000 and K ~ 5000 and with higher shard/core count.

Ole_Petter_Hansen · August 19, 2018, 8:10pm

Interesting! Is this with K=N/2? (i.e. the original gen_dat-file?). If so the order=4-rows are N=10000, K=5000. Doubling speed with 24-threads isn’t really that great. I suppose a linear model needs a ridiculously large N and K for parallellism to pay off!

However, a slighly more complex model (ordinal probit) give better performance for even moderately sized data sets (see above). Wonder what complexity is necessary to get linear reduction in time with #CPU’s.

I’m writing up a model with individual-level parameters, but the map_rect-syntax takes some getting used to.

jroon · August 19, 2018, 8:21pm

yes K=N/2.
Do you have a data generating file for the probit model? If so post it, and I’ll adapt the bash script to run that too.

Ole_Petter_Hansen · August 19, 2018, 8:36pm

gen_dat.r (675 Bytes)
ord_par.stan (1.2 KB)
ord.stan (382 Bytes)
start_benchmark.txt (717 Bytes)
post.r (391 Bytes)

Sure, attached. Maybe just let number of covariates (variable “D” in the gen_dat-file) be a constant, and just increase number of observations first?

jroon · August 19, 2018, 9:26pm

Ok here is the ordinal probit model for D=10.


time_in_ms	order	shards	time_linear	ratio

9,025	2	24	1,257	7.180
7,198	2	20	1,257	5.726
6,038	2	16	1,257	4.804
3,979	2	12	1,257	3.165
2,489	2	8	1,257	1.980
1,616	2	4	1,257	1.286
1,838	2	2	1,257	1.462
1,560	2	1	1,257	1.241
5,168	3	24	7,109	0.727
3,913	3	20	7,109	0.550
3,873	3	16	7,109	0.545
3,275	3	12	7,109	0.461
3,628	3	8	7,109	0.510
5,773	3	4	7,109	0.812
9,612	3	2	7,109	1.352
8,513	3	1	7,109	1.197
13,054	4	24	52,867	0.247
13,189	4	20	52,867	0.249
16,023	4	16	52,867	0.303
18,592	4	12	52,867	0.352
25,118	4	8	52,867	0.475
50,891	4	4	52,867	0.963
83,960	4	2	52,867	1.588
73,762	4	1	52,867	1.395

Topic		Replies	Views
MPI shard scaling General	5	1142	May 28, 2019
Parallel Options for Independent Regressions Modeling	4	539	April 28, 2021
Help with multi-threading a linear regression model and slicing the design matrix Modeling specification , matrix , paralellization	11	1565	November 20, 2021
Parallelising matrix multiplications Modeling	10	531	May 4, 2023
Reduce_sum parallelisation issue Modeling cmdstanr , multivariate-normal	12	1032	February 24, 2022

Linear, parallell regression

Related topics