I can’t help there @Ole_Petter_Hansen. I did a series of runs over the day using the original model from your first post and the data generating script you provided recently:
N
K
Shards
linreg time (s)
map_rect time (s)
Ratio
1000
50
1
2.22
18.618
8.39
1000
50
4
2.138
20.234
9.46
1000
50
12
2.388
11.206
4.69
1000
50
24
2.265
21.391
9.44
—
—
—
—
—
—
10000
50
1
36.794
400.119
10.87
10000
50
4
40.887
296.588
7.25
10000
50
12
40.624
96.013
2.36
10000
50
24
39.703
89.769
2.26
—
—
—
—
—
—
100000
50
1
781.682
NA*
#VALUE!
100000
50
4
954.53
3978.824
4.17
100000
50
12
854.336
2490.854
2.92
100000
50
24
828.47
1678.182
2.03
—
—
—
—
—
—
1000
200
4
13.126
141.935
10.81
1000
200
12
12.893
67.072
5.20
1000
200
24
12.897
70.384
5.46
—
—
—
—
—
—
10000
200
4
400.634
1934.303
4.83
10000
200
24
375.313
917.11
2.44
I got a bit fed up with it as the day went on so I skipped some options. However as you can see only in one case was map_rect faster. I suspect this is a fluke also since the linear model took much longer than the model immediately above it, I’m going to rerun this linear model to check it.
So on a second run with different random seed the last row linreg changed form 1450ish to 375
oh the NA is because this was clearly going to take many hours and I didn’t have the patience haha
Thanks. Puzzling with the disadvantage relative to linreg.
I’ve given up on Windows for this - crazy how easy it is to do in Ubuntu! I’ll see if I can get similar results.
[quote=“Ole_Petter_Hansen, post:22, topic:5230”]
Thanks. Puzzling with the disadvantage relative to linreg.[/quote]
I’m not that surprised. I was lurking on some of the development chat for this and if I understood correctly the parallelisation should work better for hierarchical models than linear models. Also I think @wds15 was dropping us hints that this might not work so well for this model :) But honestly - I don’t mind, at the start of the thread I had no clue how to go about this, so this has been worthwhile learning for me even if its not given efficiency gains :)
I know!! I spent about a day simply trying to install cmdstan in windows! In ubuntu I had the thing working in about an hour!
Ok, finally got it up and running. Results below. Looks like at least here the N=1000, K=500 is getting close to the baseline. Set STAN_NUM_THREADS to -1 in all cases.
…but now that the infrastructure is up an running I’ll try some more diffcult models. Thinking ordinal logit, with weak priors -> large tree.
Intersting to see all these results. Just a note: I created map_rect with models in mind which take days to compute. I am not saying you need to go there, but big hierarchical models are where I am expecting to see the payoff. The big hierarichal models have not too much of communication needs, but are costly for each unit (at least this is where this pays off).
… and again: if you are already on Ubuntu, then you should try MPI. However, please download the make/models file from the current cmdstan. Going MPI on Ubuntu is a matter of installing the openMPI package, see the stan-math wiki for details.
Is the table above created by a single script? If so, then I can offer to run this on our cluster.
I can try to make a script of it. Yes, I will check out MPI.
Trying to fit a more complex model - but it does take a bit of head scratching to figure out how to squeeze data structures through a funnel-like two-dimensional array. Maybe I’ll get used to it.
Are the arguments of map_rect settled, or still a discussion?
It is released in a stable version…you can expect it to stay like it.
This type of packing is standard for a few functions in stan right now. Things will hopefully improve once we have Stan 3. Another interesting potential future feature is the serialization thing which @Bob_Carpenter brought up in another thread, but this is only at the early design stage at the moment.
Mine was done manually - thats why I got fedup running various options, I would have to learn how to script it (or @Ole_Petter_Hansen if write one I can probably understand it and adapt as needed).
@wds15 - yes I appreciate you have big hierarchical models in mind. I’m not fully up to speed with the syntax for map_rect yet as I wasn’t sure til now that I could even figure out how to run it. I have use cases in mind for maybe 6 months from now, hence I’m trying to lay some groundwork for myself. I was also curious how it would run on a Threadripper CPU, as I have not seen anyone report using them for Stan before. They have fairly large cache compared to intel chips which might also effect how efficiently they run parallel STAN? Mine has 32MB L3 cache for example. (They just released an insane new chip also https://www.pcworld.com/article/3296378/components-processors/2nd-gen-threadripper-review-amds-32-core-cpu-is-insanely-fast.html )
Anyhow, I’m likely of limited assistance with coding models currently, but if you guys want me to run benchmark test models on my machine I’m happy to help so long as I don’t need to use the machine for something else!
Edit - I’ll have a go at installing MPI in the coming week.
Allright, here are some files for doing a benchmark.
1: All the attached files need to be in the same folder.
2: make linreg and reg_par (standard cmdstan-stuff)
3: make sure you have dplyr, magrittr, stargazer and rstan as available libraries in R.
4: run start_benchmark (had to give it a txt-extenstion to be allowed to upload it…you may remove it). You may need to issue chmod +x start_benchmark first.
Program will then loop over orders 2,3,4 and 5, with shards 1, 2, 4, 5 and 10 as well as a plain linear model. When it is done it runs the post.r -script, that reads in times and saves a file “results.html”. Note: If you change stuff in the gen_dat-file, (especially orders and shards), you may need to update the loops in start_benchmark).
On second thought, you might want to use the attached script instead. It scales down the number of iterations when N increases. Otherwise it will take forever with the larger datasets (might even be more iterations than what is needed for a relative measurement here…) start_benchmark2.txt (739 Bytes)
If anyone has better ideas for looping through individuals than this in the map_rect-function I’m all ears.
functions {
vector o_prob(vector beta_c, vector theta,
real[] xs, int[] xi) {
int J = xi[1];
int K = xi[2];
int D = xi[3];
real lp=0;
vector[D] beta = beta_c[1:D];
vector[K-1] c = beta_c[(1+D):(D+K-1)];
vector[K] mu;
for (n in 1:J){
real eta=0;
for(kk in 1:D)
eta += xs[n+J*(kk-1)]*beta[kk];
mu[1] = 1-Phi(eta-c[1]);
for (k in 2:(K-1))
mu[k] = Phi(eta-c[k-1]) - Phi(eta-c[k]);
mu[K] = Phi(eta-c[K-1]);
lp += categorical_lpmf(xi[n+3] | mu);
}
return [lp]';
}
}
As expected, with loops and a bunch of CDF-evaluations parallellisation pays off. Below results from a moderate number of iteration using the ordinal models above. 5 explanatory variables in all cases.
I was out for the day but just set the benchmark running. I changed a few settings but I’m gonna leave go now and lets see what happens :)
Edit: FYI the 5th order loop errored out because some varaible it was trying to create was 35.7GB in size! I restricted it to only order 2,3,4 now
Ok so with the updated results now things make more sense.
In the above table the number of variables K = (10^order / 2). So the parallel approach only starts to make sense when N ~ 10000 and K ~ 5000 and with higher shard/core count.
Interesting! Is this with K=N/2? (i.e. the original gen_dat-file?). If so the order=4-rows are N=10000, K=5000. Doubling speed with 24-threads isn’t really that great. I suppose a linear model needs a ridiculously large N and K for parallellism to pay off!
However, a slighly more complex model (ordinal probit) give better performance for even moderately sized data sets (see above). Wonder what complexity is necessary to get linear reduction in time with #CPU’s.
I’m writing up a model with individual-level parameters, but the map_rect-syntax takes some getting used to.