@flaxter, I would be happy to help to get the improved code to the master branch asap. @wds15, ideally I guess we would like to combine our two stan files and do a PR?
If I understand reduce_sum(), it is not using any parallelism if you do not explicitly tell that to the compiler? So the code would be good to go for single-core use as well?
I had some problems finding a good way to debug my code to asses that it gives the exact same result as the previous implementation. After debugging and discussions with @paul.buerkner I chose to just debug it with print statements. Do you have any suggestions on a smarter approach to check that we get the exact same results with both model files?
Thanks! As I understand it, the best way to debug (i.e. to ensure that the models are exactly the same) is to do runs with fixed seeds and check that lp__ is identical.
I have now updated the stan model with the speed improvements @wds15 proposed, but without reduce_sum() for now. I have checked that it works with Rstan 2.21.1.
I need to debug to be sure they both give the exact same result. As you note we can run with seed and check lp__. Although, sometimes when the stan file is changed, the compiler rearrange stuff so that the seed don’t have the same effect for two different model codes. I don’t know the inner workings well enough to know exactly why, but I expect the fact that we make alpha a vector instead of an array is an example. I think it should work by checking the lp__ for a specified data and parameter values. I will see if I can get those checks through Rstan somehow.
So, I’ll try to debug the code now to check that it gives the exact same result.
I have now done explicit checks on the lp for specific parameters to asses that all implementation gives the same lp__. When doing this I found a bug at line 80-83:
When including the speedup improvement there, I get a different log_prob.
I have now commented out your previous code in that section @wds15, but maybe you can see what is wrong?
The current implementation without that final improvement gives roughly a speedup of 50%, compared to 3-4x when including the part that now is commented out. Hence it would be really nice to get that final line to work.
Improved model (base_general_speed.stan - without improvement in E_deaths)
I think there are even more speedups on the table: The generated quantities block did not yet use the optimised expressions. So I did one more round of changes:
corrected model (there was an off by one error; thanks to @MansMeg for nailing that one)
modularised the model => now the generated quantities block also uses the optimised code with dot products. This makes the model again somewhat faster (but please have look as E_death[1] is now slightly different, but to me this looks like only a numerical difference, which should not matter)
all former model outputs are now part of the output (E_deaths, predictions, Rt, Rt_adj)
the model is setup to be compatible with RStan and does not use parallelisation as offered by Cmdstan 2.23 - but it can be easily switched as shown in the comments.
this time I checked that the log-lik value is exactly the same for a test draw
While I understand the desire the use RStan, I think it would be good to keep the model in this modularised form such that you can easily switch between RStan without parallelisation & CmdStan with parallelisation.
The extra speed should really give you guys a lot of room to improve the model in itself if that is possible.
This is great! Unfortunately I realized that we now have two different PRs to the original repository with more or less the same idea. I think we would like to merge them to one PR wthat both handle Rstan and CmdStan + handle design matrices as input. How do we do this in the easiest way? Should I hack on your PR or should you hack on mine? =)
I opened the PR more or less to show how to use cmdstan via cmdstanr in a simple way. I am going to close mine and you can add the cmdstanr use if you want.
The differences are really a few lines. If you need help let me know.
Cool! I think @wds15 added som additional improvements in that PR. I can later today take a look and try to merge them together so we get the final improvements as well.
@flaxter, do you want me to add cmdstan as well for within-chain parallelism?
I am afraid we have now three places here for the code (@rok_cesnovar, yours, mine ). It would be great if you, @mans_magnusson, could coordinate what now should finally be merged.
The model I wrote now tries to make it as simple as possible for COVID-19 modelers at Imperial College to work with it - so all the same outputs are created just a lot faster. As I sensed that RStan usability is priority the model can now be easily switched between running parallel (requiring 2.23) and not.
The bits you added about generalizing the covariates being used are helpful as well, of course. It‘s just that these need additional data prep steps which I have wanted to avoid to deal with myself.
So it would be great to settle all of this and I would personally, of course, be happy to see the uptake of reduce_sum coming with 2.23.
My contributions were minimal. I think @mans_magnusson should try merging the changes of your latest two models and we can then add the option of cmdstanr from my script. The latter is a piece of cake.
Now the new model proposed is included. I did not get much additional speedup including these last fixes, but it makes within-chain parallelism easiliy available if interested. You can see the final PR with this here:
Please let me know if I missed anything.
So now I let @flaxter et al decide what to do with the PR.
Thanks all! We’re in the midst of revising our report (and the speedups helped immensely!); just need a few days to catch up with all of this great work.
We haven’t done any modeling with it beyond descriptive stuff, but were planning to at some point fit the data with this simplified model I’ve been working on:
We have coded up an age extension of Seth’s model.
Our code is on Github:
We are running 1 chain with 100 iterations and 50 warmups with
Rscript base-ages.r
It takes 50 minutes on my hp with Ubuntu version 18.04.4, R version 3.6.3, and Rstan version 2.19.1.
Runtimes on a single core get quickly out of hand.
We would be very grateful if someone could point us to parts that can be sped up. We are keen to try to parallelise this over countries, using cmdstan 2.23.
You can do the same thing I did for Seth‘s model. So replace those inner for loops with dot products and use reduce sum as you can see from the speed flavors of the model which you have in your git already.
I think Melodie has done those kinds of things. There may be one or two more tricks which would be great to incorporate (and we d love to know of them!!), but do you think parallelising the code and using cmdstan would give us good gains? And how should this be done best as the Stan model is quite a bit different?