Linear, parallell regression

I can’t help there @Ole_Petter_Hansen. I did a series of runs over the day using the original model from your first post and the data generating script you provided recently:

N K Shards linreg time (s) map_rect time (s) Ratio
1000 50 1 2.22 18.618 8.39
1000 50 4 2.138 20.234 9.46
1000 50 12 2.388 11.206 4.69
1000 50 24 2.265 21.391 9.44
10000 50 1 36.794 400.119 10.87
10000 50 4 40.887 296.588 7.25
10000 50 12 40.624 96.013 2.36
10000 50 24 39.703 89.769 2.26
100000 50 1 781.682 NA* #VALUE!
100000 50 4 954.53 3978.824 4.17
100000 50 12 854.336 2490.854 2.92
100000 50 24 828.47 1678.182 2.03
1000 200 4 13.126 141.935 10.81
1000 200 12 12.893 67.072 5.20
1000 200 24 12.897 70.384 5.46
10000 200 4 400.634 1934.303 4.83
10000 200 24 375.313 917.11 2.44

I got a bit fed up with it as the day went on so I skipped some options. However as you can see only in one case was map_rect faster. I suspect this is a fluke also since the linear model took much longer than the model immediately above it, I’m going to rerun this linear model to check it.
So on a second run with different random seed the last row linreg changed form 1450ish to 375

  • oh the NA is because this was clearly going to take many hours and I didn’t have the patience haha
1 Like

Thanks. Puzzling with the disadvantage relative to linreg.
I’ve given up on Windows for this - crazy how easy it is to do in Ubuntu! I’ll see if I can get similar results.

1 Like

[quote=“Ole_Petter_Hansen, post:22, topic:5230”]
Thanks. Puzzling with the disadvantage relative to linreg.[/quote]

I’m not that surprised. I was lurking on some of the development chat for this and if I understood correctly the parallelisation should work better for hierarchical models than linear models. Also I think @wds15 was dropping us hints that this might not work so well for this model :) But honestly - I don’t mind, at the start of the thread I had no clue how to go about this, so this has been worthwhile learning for me even if its not given efficiency gains :)

I know!! I spent about a day simply trying to install cmdstan in windows! In ubuntu I had the thing working in about an hour!

Ok, finally got it up and running. Results below. Looks like at least here the N=1000, K=500 is getting close to the baseline. Set STAN_NUM_THREADS to -1 in all cases.

…but now that the infrastructure is up an running I’ll try some more diffcult models. Thinking ordinal logit, with weak priors -> large tree.

N K shards Sec. Sec. linear model ratio
100 50 10 6.404 0.636 10.100
100 50 5 2.532 0.636 4
100 50 4 2.266 0.636 3.600
100 50 2 1.660 0.636 2.600
100 50 1 1.459 0.636 2.300
1,000 500 10 119.267 90.591 1.300
1,000 500 5 104.732 90.591 1.200
1,000 500 4 96.168 90.591 1.100
1,000 500 2 211.634 90.591 2.300
1,000 500 1 219.123 90.591 2.400

Intersting to see all these results. Just a note: I created map_rect with models in mind which take days to compute. I am not saying you need to go there, but big hierarchical models are where I am expecting to see the payoff. The big hierarichal models have not too much of communication needs, but are costly for each unit (at least this is where this pays off).

… and again: if you are already on Ubuntu, then you should try MPI. However, please download the make/models file from the current cmdstan. Going MPI on Ubuntu is a matter of installing the openMPI package, see the stan-math wiki for details.

Is the table above created by a single script? If so, then I can offer to run this on our cluster.

1 Like

I can try to make a script of it. Yes, I will check out MPI.

Trying to fit a more complex model - but it does take a bit of head scratching to figure out how to squeeze data structures through a funnel-like two-dimensional array. Maybe I’ll get used to it.

Are the arguments of map_rect settled, or still a discussion?

It is released in a stable version…you can expect it to stay like it.

This type of packing is standard for a few functions in stan right now. Things will hopefully improve once we have Stan 3. Another interesting potential future feature is the serialization thing which @Bob_Carpenter brought up in another thread, but this is only at the early design stage at the moment.

Mine was done manually - thats why I got fedup running various options, I would have to learn how to script it (or @Ole_Petter_Hansen if write one I can probably understand it and adapt as needed).

@wds15 - yes I appreciate you have big hierarchical models in mind. I’m not fully up to speed with the syntax for map_rect yet as I wasn’t sure til now that I could even figure out how to run it. I have use cases in mind for maybe 6 months from now, hence I’m trying to lay some groundwork for myself. I was also curious how it would run on a Threadripper CPU, as I have not seen anyone report using them for Stan before. They have fairly large cache compared to intel chips which might also effect how efficiently they run parallel STAN? Mine has 32MB L3 cache for example. (They just released an insane new chip also https://www.pcworld.com/article/3296378/components-processors/2nd-gen-threadripper-review-amds-32-core-cpu-is-insanely-fast.html )

Anyhow, I’m likely of limited assistance with coding models currently, but if you guys want me to run benchmark test models on my machine I’m happy to help so long as I don’t need to use the machine for something else!
Edit - I’ll have a go at installing MPI in the coming week.

Allright, here are some files for doing a benchmark.

1: All the attached files need to be in the same folder.
2: make linreg and reg_par (standard cmdstan-stuff)
3: make sure you have dplyr, magrittr, stargazer and rstan as available libraries in R.
4: run start_benchmark (had to give it a txt-extenstion to be allowed to upload it…you may remove it). You may need to issue chmod +x start_benchmark first.

Program will then loop over orders 2,3,4 and 5, with shards 1, 2, 4, 5 and 10 as well as a plain linear model. When it is done it runs the post.r -script, that reads in times and saves a file “results.html”. Note: If you change stuff in the gen_dat-file, (especially orders and shards), you may need to update the loops in start_benchmark).

start_benchmark.txt (681 Bytes)
gen_dat.r (378 Bytes)
post.r (383 Bytes)
linreg.stan (239 Bytes)
reg_par.stan (986 Bytes)

This my first Ubuntu script. Please let me know if something should be improved.

1 Like

On second thought, you might want to use the attached script instead. It scales down the number of iterations when N increases. Otherwise it will take forever with the larger datasets (might even be more iterations than what is needed for a relative measurement here…)
start_benchmark2.txt (739 Bytes)

1 Like

Got an ordinal probit up and running.

ord.stan (382 Bytes)
ord_par.stan (1.2 KB)

If anyone has better ideas for looping through individuals than this in the map_rect-function I’m all ears.

functions {
  vector o_prob(vector beta_c, vector theta,
                real[] xs, int[] xi) {
	int J = xi[1];
	int K = xi[2];			
	int D = xi[3];	
	real lp=0;
        vector[D] beta = beta_c[1:D];
	vector[K-1] c = beta_c[(1+D):(D+K-1)];
	vector[K] mu;
	for (n in 1:J){
          real eta=0;
          for(kk in 1:D)
            eta += xs[n+J*(kk-1)]*beta[kk];
	  mu[1] = 1-Phi(eta-c[1]);
	  for (k in 2:(K-1))
	    mu[k] = Phi(eta-c[k-1]) - Phi(eta-c[k]);
	  mu[K] = Phi(eta-c[K-1]);
	  lp += categorical_lpmf(xi[n+3] | mu);
	}
    return [lp]';
  }
}

As expected, with loops and a bunch of CDF-evaluations parallellisation pays off. Below results from a moderate number of iteration using the ordinal models above. 5 explanatory variables in all cases.

time_in_ms order shards time_linear ratio
5,232 2 10 882 5.932
799 2 5 882 0.906
757 2 4 882 0.858
779 2 2 882 0.883
944 2 1 882 1.070
1,775 3 10 4,557 0.390
1,433 3 5 4,557 0.314
2,008 3 4 4,557 0.441
3,487 3 2 4,557 0.765
5,385 3 1 4,557 1.182
10,359 4 10 22,897 0.452
10,914 4 5 22,897 0.477
12,210 4 4 22,897 0.533
20,666 4 2 22,897 0.903
34,035 4 1 22,897 1.486
1 Like

I was out for the day but just set the benchmark running. I changed a few settings but I’m gonna leave go now and lets see what happens :)
Edit: FYI the 5th order loop errored out because some varaible it was trying to create was 35.7GB in size! I restricted it to only order 2,3,4 now

1 Like

You could alternatively edit the line

K <- (10^order)/2

in the gen_dat.r-file, and e.g. set K to a fixed number.

Results from your benchmark2 script!
Edit: @Ole_Petter_Hansen - results updated:

time_in_ms order shards time_linear ratio
24,736 2 24 313 79.029
9,066 2 10 313 28.965
2,578 2 4 313 8.236
1,938 2 2 313 6.192
45,863 3 24 38,704 1.185
34,860 3 10 38,704 0.901
66,252 3 4 38,704 1.712
121,380 3 2 38,704 3.136
4,821,764 4 24 10,231,311 0.471
8,803,770 4 10 10,231,311 0.860
12,528,026 4 4 10,231,311 1.224
19,583,774 4 2 10,231,311 1.914
1 Like

Thanks! Could you please check if the standard linear regression actually has run? Times of 2ms is suspiciously fast.

Ok so with the updated results now things make more sense.
In the above table the number of variables K = (10^order / 2). So the parallel approach only starts to make sense when N ~ 10000 and K ~ 5000 and with higher shard/core count.

1 Like

Interesting! Is this with K=N/2? (i.e. the original gen_dat-file?). If so the order=4-rows are N=10000, K=5000. Doubling speed with 24-threads isn’t really that great. I suppose a linear model needs a ridiculously large N and K for parallellism to pay off!

However, a slighly more complex model (ordinal probit) give better performance for even moderately sized data sets (see above). Wonder what complexity is necessary to get linear reduction in time with #CPU’s.

I’m writing up a model with individual-level parameters, but the map_rect-syntax takes some getting used to.

1 Like

yes K=N/2.
Do you have a data generating file for the probit model? If so post it, and I’ll adapt the bash script to run that too.

gen_dat.r (675 Bytes)
ord_par.stan (1.2 KB)
ord.stan (382 Bytes)
start_benchmark.txt (717 Bytes)
post.r (391 Bytes)

Sure, attached. Maybe just let number of covariates (variable “D” in the gen_dat-file) be a constant, and just increase number of observations first?

1 Like

Ok here is the ordinal probit model for D=10.

time_in_ms order shards time_linear ratio
9,025 2 24 1,257 7.180
7,198 2 20 1,257 5.726
6,038 2 16 1,257 4.804
3,979 2 12 1,257 3.165
2,489 2 8 1,257 1.980
1,616 2 4 1,257 1.286
1,838 2 2 1,257 1.462
1,560 2 1 1,257 1.241
5,168 3 24 7,109 0.727
3,913 3 20 7,109 0.550
3,873 3 16 7,109 0.545
3,275 3 12 7,109 0.461
3,628 3 8 7,109 0.510
5,773 3 4 7,109 0.812
9,612 3 2 7,109 1.352
8,513 3 1 7,109 1.197
13,054 4 24 52,867 0.247
13,189 4 20 52,867 0.249
16,023 4 16 52,867 0.303
18,592 4 12 52,867 0.352
25,118 4 8 52,867 0.475
50,891 4 4 52,867 0.963
83,960 4 2 52,867 1.588
73,762 4 1 52,867 1.395
1 Like