Convergence / latest version of Stan issues with our covid-19 model?

Seth:

I don’t think you should be concerned about stability of endpoints of 95% intervals. I don’t think these matter, or should matter, for your conclusions. If you care about tail events, fine, then incorporate that into your decision analysis.

Also see Section 10.5 of BDA3: How Many Simulation Draws Are Needed?

Maybe this model should be rewritten to take advantage of within-chain parallelism with the newest reduce_sum feature about to land in 2.23? Do you have sufficient cores to drive this?

I don’t know the data sizes, but if your data-likelihood evaluation involves lots of terms, then this should speed up things a lot.

I am happy to give a hand if you want to.

3 Likes

Please give us a hand, we have many many cores (thanks to the generosity of Microsoft and Amazon) and this could be a huge speedup! How do we get started?

Is there a way for me to run the model with some fake data (ideally a subset)? Then I can edit the stan file directly.

You need to install the current release candidate to get started; next week should be the final release out (some slight changes will happen to the facility, but nothing big).

EDIT: Oh, there is a README… let me start there…unless there is a super quick and dirty way to run the model (maybe a stan data file with fake data, for example)

Here’s a branch where we can collaborate (https://github.com/flaxter/covid19model/tree/parallel)–it’s setup to run with a subset of real data, look in parallel.r. (Happy to switch to synthetic data if that’s important.)

Are you wds15 on github? I’ll add you.

2 Likes

Yes, I am wds15 on GitHub.

Thanks for the branch.

I think I also see opportunities to use the performance trick posted here. I’ll start working on that now and let you know if I have any success with a pull request.

Here we go. This was an interesting case and I had to write two flavours of this.

Since I decided that I want to parallelise over countries, I created my own data-set using the base.r script with the defaults. This gave me 14 countries and 50ish observations per country. My first attempt was to do the easy thing and merely parallelise the data log-likelihood evaluation of the negative binomial. See the base_rs.stan model for that. This is probably easiest to get the gist of this to dive in there first (though I haven’t commented that). Unfortunately that did not give me any speedup at all.

Ok… then I figured that you do a lot of computations in the transformed parameters block which creates quite a substantial strain on the AD system. Given how it was written, it was easy to push this model into the partial reducer function - which is now base_rs2.stan… and voila, now I am getting these timings on my 4-core laptop for 25 warmup / 25 iterations on the base data set:

## 132s on 1 core (old model)
## 113 on 1 core (new model)
## 87s on 2 cores
## 66s on 3 cores
## 64s on 4 cores

So basically we can double the speed with 3 cores here. Not bad. Maybe you will get even better behaviour on machines with more CPU caches. One could even further improve things by getting rid of the large matrix (time x country) and instead directly loop by country such that one avoids large matrices which are too large for some countries and instead loop over data-structures which have the correct number of entries per country. That would reduce the AD tape size and make it faster.

All of this runs with the 2.23 release candidate and watch out for the required change to the model when 2.23 comes out since a small convention will change (see the comment in the model).

Let me know if you have questions on this. If you want we can have a quick meeting on it should you need more explanations of what I have done.

3 Likes

Very cool, looking forward to trying this on longer runs and more cores. Just to check–I need to install cmdstan 2.23 release candidate, right? There’s no RStan 2.23 release candidate?

You need cpus with a lot of fast cpu cache to improve the situation. Right now it looks as if speed ups are somewhat limited beyond 3 cores with the data set I used. Possibly more data will allow you to speed up more with increasing cpu count (because then a larger fraction of cpu time is spend in the partial reduce).

And yes, you have to use cmdstan 2.23 for this. I used my own r cmdstan wrapper, but you can use cmdstanr as well…this is only available from GitHub at the moment though (you find it quickly).

I’m getting really big variance in the timings from chain to chain when chains are this short, so I’d be wary of these timings. Unless you’ve done something extra to eliminate that variance or average over many runs?

FYI, I discovered I’d been reading the model incorrectly (I’d had the covariate matrices transposed in my head) and there’s far less opportunity for this trick to make a difference than I thought, at least with the simple approach from that case study example. There’s still a lot of redundancy in that for a given covariate and country, there’s only a couple unique values being multiplied by that covariate’s alpha, and I’ve begun to think about approaches to eliminating that redundancy as well, but it’ll take me a while to work out something general/flexible.

So… I took one closer look now at the model and discovered substantial speedups by making more efficient use of what we have in Stan.

In your model you do lot’s of convolutions which are programmed by double-sums. It turns out that with a bit of reorganising the data-structures that you can write the inner-loop of these using dot_product. This drastically cuts down the AD tree size. I just pushed a 3 version of the model base_rs3.stan to the repository. I hope I haven’t messed up, but I am getting the same lp__ posterior as I can see. The timings are now:

## 132s on 1 core (old model)
## 113 on 1 core (new model)
## 2nd generation of new model with dot_product use
## 21s on 1 core (2nd gen model)
## 12s on 2 cores
## 10s on 3 cores
## 9s on 4 cores

So, we speedup more than 6x by using dot_product. Here is an example of what I mean:

      // old code using nested loops for a convolution
      E_deaths[1, m_slice]= 1e-15 * prediction[1,m_slice];
      for (i in 2:N2){
        for(j in 1:(i-1)){
          E_deaths[i,m_slice] += prediction[j,m_slice] * f[i-j,m] * ifr_noise[m];
        }
      }
      
      // new code using dot_product and a reversed f data-structure
      E_deaths[1, m_slice]= 1e-15 * prediction[1,m_slice];
      for (i in 2:N2) {
        E_deaths[i,m_slice] = ifr_noise[m] * dot_product(sub_col(prediction, 1, m_slice, i), tail(f_rev[m], i));
      }

Doing this allows Stan to represent the inner loop using a single node in the AD tree instead of up to N2 nodes.

In words: We speedup on just 2 cores the model by 11x compared to the original model!

I just run the same chain with a fixed seed. This way I (should) get rid of all the variations which should not matter when we talk about relative speedups.

4 Likes

This is great!

I have been working with generalization some part of this model, that make use of the same ideas. Will you do a PR to the covid-19 repo? Please let me know when so I can make use of these improvements.

/Måns

It’s in a branch on Seth’s repo:

One can probably continue on improving things, but to me it looks as if it is now fast enough.

(So the usual game can continue to make the model better and then iterate again over computational things)

Cool!

I’ll take a look! So is this introducing both parallelism and more efficient stan code in one?

/Måns

Yes, both things are in there now.

Have a look at the base_rs.stan model first (in that directory) if you want to see how parallelism is introduced “on its own” … it’s just that this did not speedup at all as a lot of work is in the transformed parameters block.

Sebastian

That’s great!

I should have checked your code earlier. I just opened a PR to enable the model to take an arbitrary number of covariates (something me and Aki are working on) here:

In part, it is similar to some parts of your code, but I extend it directly to the input data to deliver design matrices per country instead.

Would it be possible to include only a more efficient code and not parallelism, or are these intertwined now?

/Måns

You can simply avoid the reduce sum call and call the reducer function directly. Just tell it to reduce all countries at once.

Makes sense?

Although you seem to gain some speed by running this with reduce sum - even when using only one core. The reason is that the ad graph is evaluated differently.

The code I wrote also allows you to pass in directly design matrices per country…though a few more steps Would be needed to generalize it.

Still catching up with all the great stuff happening above–but just to say that a version of the code without parallelism would certainly be a great help to RStan users at the moment!