Strategies for fitting a large model via AWS EC2 services

Hi all,

I am working to fit an IRT hierarchical model using ‘brms’ in R, from which I intend to extract the estimation of a latent parameter (i.e. the person parameter in a GPCM model). So far, I have managed to fit a the full version of the model on a subset of the full dataset with no issues. However, I need to fit the full model on a dataset of roughly 1.2 million observations. To do so, I turned to AWS for the EC2 service as my personal machines is woefully under-equipped for the task. At the moment, I am working with a virtual cloud instance with following specs:

OS: Ubuntu 18.04
CPU: 16 cores (32 threads)
Ram: 128 Gb.
Storage: SSD 30 Gb.

The dataset itself is roughly 140 Mb. and from my estimations, the 30 Gb storage should be more than sufficient to hold the dataset and the saved fitted model.

My issue is that when I start to fit the full model, R becomes unresponsive after it reports beginning the first warm-up samples. I expect, given the specs I have rented and the time taken to fit the model on the subset data, that fitting should take roughly 20 hours. However, after roughly 4 hours, I saw no progress in the sampling.

I would love to hear suggestions for fitting hierarchical models on large datasets. Is it possible that I may need to rent an instance with greater compute capacities or is it possible to make use of a GPU in fitting a model through brms in R? Also, I could be missing an obvious mistake with how I have used AWS’s services.

1 Like

One idea is to use brms::make_stancode() to get the Stan code for your model then use brms::make_standata() to get the data list as the code expects, then use ezStan, which has slightly better progress indicator.

Alternatively, I’m guessing there’s an option in brms to save samples to csv files, in which case you could use that option then check the contents of said files to check on the progress manually.

Oh, though you report using a 16 core system. Hopefully you’re not bothering to run 16 chains in parallel, as this would be overkill in terms of why we do multiple chains. Eventually brms will include the within-chain multi-core paralellism enabled by reduce_sum, but I think as it stands you’d have to get the code and data as I suggested in my previous reply and add the reduce_sum bits by hand. Watch here for a possible tutorial on how to do this latter tweaking.

1 Like

I don’t think brms allows for GPU acceleration, but there are really only a couple scenarios where you can use a GPU with Stan. If your model does a cholesky decomposition, there’s some speedup available, and indeed hierarchical models can involve cholesky decompositions, but unless you have a huge number of predictors in your design matrix, I don’t think it’s worth exploring. Given it sounds like you have very tall data (few predictor columns, lots of observation rows), reduce_sum is your best bet for speeding things up.

EDIT: oops, I might be wrong on my pessimism regarding GPUs for tall data; I forgot that the GPU crew added support for accelerating GLMs. You might look into that, though I suspect that it might be both easier and more performant (since the GPUs seem to max out at x10 speedups according to the GPU-Stan paper) to just use reduce_sum instead.

Edit2: FYI I made a post to check my intuition, and the GPU folks are a little less pessimistic than I am

2 Likes

Thanks for the great responses! Yes, my data is certainly tall, i.e. I only have 6 predictors but with 1.2 million observations. As far as the model specifications, I have left the number of chains at the default of four.

It certainly appears that reduce_sum is the way to go then, so I will look into that. Am I understanding you right that ‘brms’ has not implemented supported yet and so 12 of my cores are going relatively unused (i.e. only one core/two threads are being used per chain)? If that is right, then the best way to improve speed would be to use make_stancode and then modify the resulting stan file to make use of reduce_sum and to utilize the extra compute power?

Also, thank you for sharing the link to ezStan. The watch_stan() function seems incredibly useful!

Yes, your understanding is correct on both points.

Just be careful; there’s a weird bug in RStan (and hence ezStan) on Linux such that failing to constantly watch the csv file as watch_stan() does causes the forked processes to self-terminate. This reminds me I should post a call for help in figuring this one out.

EDIT: FYI it seems RStan 2.21 might have fixed the weird bug I mentioned above; I can’t seem to replicate it anymore.

1 Like

there’s a weird bug in RStan (and hence ezStan) on Linux such that failing to constantly watch the csv file as watch_stan() does causes the forked processes to self-terminate.

Thanks for the heads up. From reading through the old post, it appears that if I do use exStan, I can avoid the issue by making sure to call watch_stan() and not terminating watch_stan() while the sampler is in action, correct?

This sounds like you might have multiple rows with exactly the same predictors and could gain additional speedup as in Stan User’s Guide

Otherwise +1 to what @mike-lawrence wrote

1 Like

Good call. That applies to the likelihood part of the computations, but you should also check for redundancy ealier in the model’s computations too. There’s a commented-code walkthrough here.

1 Like

All but two of the predictor variables are quite low cardinality categorical variables. The remaining two are a continuous and a categorical with roughly 20 categories. The target is a three category variable as well, so I suspect that you are right here. I will certainly try to modify the stan file to incorporate this. Thanks for the links.

I will likely work on this extensively tomorrow and will try to incorporate both suggestions.

1 Like

Note that the sufficient statistics is completely feasible with brms without modifying the Stan file. If the family is categorical, you can collapse rows to multinomial. If response is cumulative or other ordinal, you can still collapse rows that are completely identical (same predictors AND response) and put the number of rows collapsed into weights argument.

Best of luck!

3 Likes

Out of curiosity (as well as the interest of my wallet), would utilizing reduce_sum improve multi-core usage sufficiently that I could take advantage of a 16 core machine? I am wondering if it would be better to save myself the trouble of the larger instance and just work with an 8 core system (and utilize the reduce_sum functionality).

Thanks again for the help!

It all depends on details, but I have seen cases where one chain is run on 20 cores on a single machine in order to gain massive speedups (days down to hours)… so the usual 4 chains did consume 80 cores in total over 4 machines. In this instance the model just scaled well given what needed to be calculated.

How that plays out in your case is hard to say until you have tried it out.

With reduce_sum you have the possibility to scale your performance (should you need it and have the resource). Sometimes, the use of reduce_sum even speeds up single-core usage due to better CPU cache usage.

1 Like

Hmm, I think this could be useful easing the utilization of sufficient statistics. However, I am still a bit confused about the implementation you are suggesting. The response family I am using is cumulative, but I am not sure how to implement the collapsing you mentioned and I do not see the weights argument in brm(). Would you mind clarifying a bit further?

Thanks!

Oh, it appears to have moved to the “addition-terms”, sorry for confusion.
Maybe an example will clarify. With data:

y     x1    x2
1     4     2.5
2     3    -1
1     4     2.5

and formula y ~ x1 + x2 we get the exact same results as with data

y     x1    x2    w
1     4     2.5   2
2     3    -1     1

and formula y | weights(w) ~ x1 + x2 - regardless of family or other settings, because weight 2 means we add the likelihood for the row twice, which is the same as actually having the row two times in data. This obviously only works when you have multiple rows that match in all predictors AND outcome.

Does that make sense?

Thank you for the example! It certainly cleared things up completely. In hindsight, this seems highly applicable to my dataset, considering each row is a student’s response to each question on a test, meaning all but one of the right-hand columns are repeated roughly 43 times for each student (i.e. the test had 43 questions). Roughly initial calculations indicate this could reduce my dataset to about ~7.3 % of the current size by collapsing and adding a weights column.

However, I have just realized that I have one predictor column that represents the individual 43 questions, which I have included as part of the IRT model. The inclusion of this column makes each individual row a unique combination of student-question values. I do not see an obvious solution around this issue…

Thanks again for the help!

Yeah, that sounds bad for my proposal. However, as you seem to have a lot of data with only a few predictors, approximate solutions might work very well. I’ve had some pretty good experience with INLA, which can be an order of magnitude faster than Stan (a similar approximation is being made available in Stan, but I don’t think it’s ready yet). Though I don’t think it supports ordinal responses out of the box - and if I understand you correctly, you have ordinal responses, right? While I think it is actually possible to hack INLA to do ordinal regression, you might want to start with a Poisson / binomial regression and see if you can express everything else you need in INLA and that the speed is acceptable.

I’m late to the party, but anyway I hope I can revive this conversation. I’m a little puzzled by the statement here.

Isn’t the number of chains immaterial from a computational point of view? The distribution of post-warmup samples for every chain is approximately the posterior distribution, so the distribution of aggregated samples is again the posterior. Since that holds for any number of chains, however many you pick depends on other factors, such as what hardware is available. If one has a lot of cpus, and nothing else to do with them, what reason motivates against launching lots of chains?

I would expect the variance of the posterior approximation (I mean, looking at the variance of a histogram of samples) to be a function of the total number of post-warmup samples, whether that’s divided into a few long chains or a larger number of shorter chains. Is that variance actually a function of how the samples are divided up?

I don’t mean to be argumentative, I’m honestly perplexed. Thanks for any light you can shed on this topic.