Convergence / latest version of Stan issues with our covid-19 model?

andrewgelman · April 10, 2020, 6:56pm

Seth:

I don’t think you should be concerned about stability of endpoints of 95% intervals. I don’t think these matter, or should matter, for your conclusions. If you care about tail events, fine, then incorporate that into your decision analysis.

Also see Section 10.5 of BDA3: How Many Simulation Draws Are Needed?

wds15 · April 16, 2020, 12:00pm

Maybe this model should be rewritten to take advantage of within-chain parallelism with the newest reduce_sum feature about to land in 2.23? Do you have sufficient cores to drive this?

I don’t know the data sizes, but if your data-likelihood evaluation involves lots of terms, then this should speed up things a lot.

I am happy to give a hand if you want to.

flaxter · April 16, 2020, 12:01pm

Please give us a hand, we have many many cores (thanks to the generosity of Microsoft and Amazon) and this could be a huge speedup! How do we get started?

wds15 · April 16, 2020, 12:06pm

Is there a way for me to run the model with some fake data (ideally a subset)? Then I can edit the stan file directly.

You need to install the current release candidate to get started; next week should be the final release out (some slight changes will happen to the facility, but nothing big).

EDIT: Oh, there is a README… let me start there…unless there is a super quick and dirty way to run the model (maybe a stan data file with fake data, for example)

flaxter · April 16, 2020, 12:49pm

Here’s a branch where we can collaborate (https://github.com/flaxter/covid19model/tree/parallel)–it’s setup to run with a subset of real data, look in parallel.r. (Happy to switch to synthetic data if that’s important.)

Are you wds15 on github? I’ll add you.

wds15 · April 16, 2020, 1:09pm

Yes, I am wds15 on GitHub.

Thanks for the branch.

mike-lawrence · April 16, 2020, 2:48pm

I think I also see opportunities to use the performance trick posted here. I’ll start working on that now and let you know if I have any success with a pull request.

wds15 · April 16, 2020, 7:51pm

Here we go. This was an interesting case and I had to write two flavours of this.

Since I decided that I want to parallelise over countries, I created my own data-set using the base.r script with the defaults. This gave me 14 countries and 50ish observations per country. My first attempt was to do the easy thing and merely parallelise the data log-likelihood evaluation of the negative binomial. See the base_rs.stan model for that. This is probably easiest to get the gist of this to dive in there first (though I haven’t commented that). Unfortunately that did not give me any speedup at all.

Ok… then I figured that you do a lot of computations in the transformed parameters block which creates quite a substantial strain on the AD system. Given how it was written, it was easy to push this model into the partial reducer function - which is now base_rs2.stan… and voila, now I am getting these timings on my 4-core laptop for 25 warmup / 25 iterations on the base data set:

## 132s on 1 core (old model)
## 113 on 1 core (new model)
## 87s on 2 cores
## 66s on 3 cores
## 64s on 4 cores

So basically we can double the speed with 3 cores here. Not bad. Maybe you will get even better behaviour on machines with more CPU caches. One could even further improve things by getting rid of the large matrix (time x country) and instead directly loop by country such that one avoids large matrices which are too large for some countries and instead loop over data-structures which have the correct number of entries per country. That would reduce the AD tape size and make it faster.

All of this runs with the 2.23 release candidate and watch out for the required change to the model when 2.23 comes out since a small convention will change (see the comment in the model).

Let me know if you have questions on this. If you want we can have a quick meeting on it should you need more explanations of what I have done.

flaxter · April 16, 2020, 9:03pm

Very cool, looking forward to trying this on longer runs and more cores. Just to check–I need to install cmdstan 2.23 release candidate, right? There’s no RStan 2.23 release candidate?

wds15 · April 16, 2020, 9:25pm

You need cpus with a lot of fast cpu cache to improve the situation. Right now it looks as if speed ups are somewhat limited beyond 3 cores with the data set I used. Possibly more data will allow you to speed up more with increasing cpu count (because then a larger fraction of cpu time is spend in the partial reduce).

And yes, you have to use cmdstan 2.23 for this. I used my own r cmdstan wrapper, but you can use cmdstanr as well…this is only available from GitHub at the moment though (you find it quickly).

mike-lawrence · April 17, 2020, 9:27am

I’m getting really big variance in the timings from chain to chain when chains are this short, so I’d be wary of these timings. Unless you’ve done something extra to eliminate that variance or average over many runs?

mike-lawrence · April 17, 2020, 10:20am

FYI, I discovered I’d been reading the model incorrectly (I’d had the covariate matrices transposed in my head) and there’s far less opportunity for this trick to make a difference than I thought, at least with the simple approach from that case study example. There’s still a lot of redundancy in that for a given covariate and country, there’s only a couple unique values being multiplied by that covariate’s alpha, and I’ve begun to think about approaches to eliminating that redundancy as well, but it’ll take me a while to work out something general/flexible.

wds15 · April 17, 2020, 12:22pm

So… I took one closer look now at the model and discovered substantial speedups by making more efficient use of what we have in Stan.

In your model you do lot’s of convolutions which are programmed by double-sums. It turns out that with a bit of reorganising the data-structures that you can write the inner-loop of these using dot_product. This drastically cuts down the AD tree size. I just pushed a 3 version of the model base_rs3.stan to the repository. I hope I haven’t messed up, but I am getting the same lp__ posterior as I can see. The timings are now:

## 132s on 1 core (old model)
## 113 on 1 core (new model)
## 2nd generation of new model with dot_product use
## 21s on 1 core (2nd gen model)
## 12s on 2 cores
## 10s on 3 cores
## 9s on 4 cores

So, we speedup more than 6x by using dot_product. Here is an example of what I mean:

      // old code using nested loops for a convolution
      E_deaths[1, m_slice]= 1e-15 * prediction[1,m_slice];
      for (i in 2:N2){
        for(j in 1:(i-1)){
          E_deaths[i,m_slice] += prediction[j,m_slice] * f[i-j,m] * ifr_noise[m];
        }
      }
      
      // new code using dot_product and a reversed f data-structure
      E_deaths[1, m_slice]= 1e-15 * prediction[1,m_slice];
      for (i in 2:N2) {
        E_deaths[i,m_slice] = ifr_noise[m] * dot_product(sub_col(prediction, 1, m_slice, i), tail(f_rev[m], i));
      }

Doing this allows Stan to represent the inner loop using a single node in the AD tree instead of up to N2 nodes.

In words: We speedup on just 2 cores the model by 11x compared to the original model!

I just run the same chain with a fixed seed. This way I (should) get rid of all the variations which should not matter when we talk about relative speedups.

mans_magnusson · April 17, 2020, 12:38pm

This is great!

I have been working with generalization some part of this model, that make use of the same ideas. Will you do a PR to the covid-19 repo? Please let me know when so I can make use of these improvements.

/Måns

wds15 · April 17, 2020, 12:41pm

It’s in a branch on Seth’s repo:

github.com

flaxter/covid19model/blob/parallel/stan-models/base_rs3.stan

functions {
  // this is the partial sum function which calculates for
  // a subset of countries the log-lik contribution
  // so it computes for the countries m= start...end the
  // log lik
  // note: in the 2.23 release candidate the partial sum function has
  // this signature below, but that is changed to
  //real country_log_lik(real[] mu_slice,
  //                     int start, int end,
  // in the final 2.23... so adapt it next week                     
  real country_log_lik(int start, int end,
                       real[] mu_slice,
                       vector alpha,
                       real[] y,
                       real ophi,
                       real[] ifr_noise,
                       int[] N,
                       int N0,
                       int N2,
                       /*

This file has been truncated. show original

One can probably continue on improving things, but to me it looks as if it is now fast enough.

(So the usual game can continue to make the model better and then iterate again over computational things)

mans_magnusson · April 17, 2020, 12:53pm

Cool!

I’ll take a look! So is this introducing both parallelism and more efficient stan code in one?

/Måns

wds15 · April 17, 2020, 12:55pm

Yes, both things are in there now.

Have a look at the base_rs.stan model first (in that directory) if you want to see how parallelism is introduced “on its own” … it’s just that this did not speedup at all as a lot of work is in the transformed parameters block.

Sebastian

mans_magnusson · April 17, 2020, 7:28pm

That’s great!

I should have checked your code earlier. I just opened a PR to enable the model to take an arbitrary number of covariates (something me and Aki are working on) here:

In part, it is similar to some parts of your code, but I extend it directly to the input data to deliver design matrices per country instead.

Would it be possible to include only a more efficient code and not parallelism, or are these intertwined now?

/Måns

wds15 · April 17, 2020, 7:38pm

You can simply avoid the reduce sum call and call the reducer function directly. Just tell it to reduce all countries at once.

Makes sense?

Although you seem to gain some speed by running this with reduce sum - even when using only one core. The reason is that the ad graph is evaluated differently.

The code I wrote also allows you to pass in directly design matrices per country…though a few more steps Would be needed to generalize it.

flaxter · April 17, 2020, 8:33pm

Still catching up with all the great stuff happening above–but just to say that a version of the code without parallelism would certainly be a great help to RStan users at the moment!

Topic		Replies	Views
Divergent transitions and exceed max treedepth warning General rstan , fitting-issues	5	3949	August 17, 2020
Strange behavior with lognormal() sampling statement Modeling fitting-issues	15	815	May 20, 2020
Getting two errors "Rejecting initial value" and "1000 transitions using 10 leapfrog steps per transition would take" while running stan Modeling fitting-issues , specification	10	1051	May 13, 2020
Error Message Modeling rstan	1	529	April 29, 2021
Warning on some iterations saturated the maximum tree depth of 10 Modeling rstan , fitting-issues	2	432	October 26, 2023

Convergence / latest version of Stan issues with our covid-19 model?

Related topics