Strategies for fitting a large model via AWS EC2 services

Hi all,

I am working to fit an IRT hierarchical model using ‘brms’ in R, from which I intend to extract the estimation of a latent parameter (i.e. the person parameter in a GPCM model). So far, I have managed to fit a the full version of the model on a subset of the full dataset with no issues. However, I need to fit the full model on a dataset of roughly 1.2 million observations. To do so, I turned to AWS for the EC2 service as my personal machines is woefully under-equipped for the task. At the moment, I am working with a virtual cloud instance with following specs:

OS: Ubuntu 18.04
CPU: 16 cores (32 threads)
Ram: 128 Gb.
Storage: SSD 30 Gb.

The dataset itself is roughly 140 Mb. and from my estimations, the 30 Gb storage should be more than sufficient to hold the dataset and the saved fitted model.

My issue is that when I start to fit the full model, R becomes unresponsive after it reports beginning the first warm-up samples. I expect, given the specs I have rented and the time taken to fit the model on the subset data, that fitting should take roughly 20 hours. However, after roughly 4 hours, I saw no progress in the sampling.

I would love to hear suggestions for fitting hierarchical models on large datasets. Is it possible that I may need to rent an instance with greater compute capacities or is it possible to make use of a GPU in fitting a model through brms in R? Also, I could be missing an obvious mistake with how I have used AWS’s services.

One idea is to use brms::make_stancode() to get the Stan code for your model then use brms::make_standata() to get the data list as the code expects, then use ezStan, which has slightly better progress indicator.

Alternatively, I’m guessing there’s an option in brms to save samples to csv files, in which case you could use that option then check the contents of said files to check on the progress manually.

Oh, though you report using a 16 core system. Hopefully you’re not bothering to run 16 chains in parallel, as this would be overkill in terms of why we do multiple chains. Eventually brms will include the within-chain multi-core paralellism enabled by reduce_sum, but I think as it stands you’d have to get the code and data as I suggested in my previous reply and add the reduce_sum bits by hand. Watch here for a possible tutorial on how to do this latter tweaking.

1 Like

I don’t think brms allows for GPU acceleration, but there are really only a couple scenarios where you can use a GPU with Stan. If your model does a cholesky decomposition, there’s some speedup available, and indeed hierarchical models can involve cholesky decompositions, but unless you have a huge number of predictors in your design matrix, I don’t think it’s worth exploring. Given it sounds like you have very tall data (few predictor columns, lots of observation rows), reduce_sum is your best bet for speeding things up.

EDIT: oops, I might be wrong on my pessimism regarding GPUs for tall data; I forgot that the GPU crew added support for accelerating GLMs. You might look into that, though I suspect that it might be both easier and more performant (since the GPUs seem to max out at x10 speedups according to the GPU-Stan paper) to just use reduce_sum instead.

Edit2: FYI I made a post to check my intuition, and the GPU folks are a little less pessimistic than I am

1 Like

Thanks for the great responses! Yes, my data is certainly tall, i.e. I only have 6 predictors but with 1.2 million observations. As far as the model specifications, I have left the number of chains at the default of four.

It certainly appears that reduce_sum is the way to go then, so I will look into that. Am I understanding you right that ‘brms’ has not implemented supported yet and so 12 of my cores are going relatively unused (i.e. only one core/two threads are being used per chain)? If that is right, then the best way to improve speed would be to use make_stancode and then modify the resulting stan file to make use of reduce_sum and to utilize the extra compute power?

Also, thank you for sharing the link to ezStan. The watch_stan() function seems incredibly useful!

Yes, your understanding is correct on both points.

Just be careful; there’s a weird bug in RStan (and hence ezStan) on Linux such that failing to constantly watch the csv file as watch_stan() does causes the forked processes to self-terminate. This reminds me I should post a call for help in figuring this one out.

EDIT: FYI it seems RStan 2.21 might have fixed the weird bug I mentioned above; I can’t seem to replicate it anymore.

1 Like

there’s a weird bug in RStan (and hence ezStan) on Linux such that failing to constantly watch the csv file as watch_stan() does causes the forked processes to self-terminate.

Thanks for the heads up. From reading through the old post, it appears that if I do use exStan, I can avoid the issue by making sure to call watch_stan() and not terminating watch_stan() while the sampler is in action, correct?

This sounds like you might have multiple rows with exactly the same predictors and could gain additional speedup as in

Otherwise +1 to what @mike-lawrence wrote

Good call. That applies to the likelihood part of the computations, but you should also check for redundancy ealier in the model’s computations too. There’s a commented-code walkthrough here.

All but two of the predictor variables are quite low cardinality categorical variables. The remaining two are a continuous and a categorical with roughly 20 categories. The target is a three category variable as well, so I suspect that you are right here. I will certainly try to modify the stan file to incorporate this. Thanks for the links.

I will likely work on this extensively tomorrow and will try to incorporate both suggestions.

1 Like

Note that the sufficient statistics is completely feasible with brms without modifying the Stan file. If the family is categorical, you can collapse rows to multinomial. If response is cumulative or other ordinal, you can still collapse rows that are completely identical (same predictors AND response) and put the number of rows collapsed into weights argument.

Best of luck!


Out of curiosity (as well as the interest of my wallet), would utilizing reduce_sum improve multi-core usage sufficiently that I could take advantage of a 16 core machine? I am wondering if it would be better to save myself the trouble of the larger instance and just work with an 8 core system (and utilize the reduce_sum functionality).

Thanks again for the help!

It all depends on details, but I have seen cases where one chain is run on 20 cores on a single machine in order to gain massive speedups (days down to hours)… so the usual 4 chains did consume 80 cores in total over 4 machines. In this instance the model just scaled well given what needed to be calculated.

How that plays out in your case is hard to say until you have tried it out.

With reduce_sum you have the possibility to scale your performance (should you need it and have the resource). Sometimes, the use of reduce_sum even speeds up single-core usage due to better CPU cache usage.

1 Like