I am relatively new to the use of Stan in bayesian modelling, and I am currently working on a fitting a hierarchical model to ecological data in R via rstan. After a few tests the best version of model takes a very long time (~ 1 week) to run fully for about 1000 iterations, as there is a large number of parameters being estimated. This means that for a proper run (which would require potentially 10 times as many iterations) this will take a huge amount of time on my 4 cores computer.
I have been thinking that a way to circumvent this would be to run more chains for fewer iterations in order to obtain a reasonable sample size, and this has led me to the idea of renting a remote online server with many cores on which to run several chains in parallel. I have never done this before, and I have read that it can be rather complicated to set up these machines (e.g. the ones offered by Google or Amazon) properly for someone with limited experience in programming. I was wondering if you have some previous experience with running stan models with R on remote servers and if you have some advice on which would be a sensible option for someone with limited programming experience. Thanks a lot!
I’ve seen several questions about this recently but as far as I know we don’t yet have any documentation/guide/tutorial on this. Personally I don’t have too much experience running on remote servers but it’s definitely possible and I know that many people do it. Anyone on @Stan_Development_Team or any Stan user reading this want to take a shot at writing a short guide on setting this up either using RStan or CmdStanR (it would be great to have for python too)? Or does anyone know of some tutorial on this that already exists?
It may be that you’ve already coded the model in the most efficient way, but if you want to share your model on the forums in a separate post someone may be able to help you speed it up.
The default of 2000 iterations (1000 warmup + 1000 sampling) is usually plenty for Stan. In general I wouldn’t run for 10,000 unless you really need a larger effective sample size to estimate something to more decimal places than is typical (depends on your application).
Hey @gio I can help out with remote servers. Would you mind explaining to me what you’re trying to achieve? I can help out with AWS since it’s the most widely used, or GCP, I myself sometimes buy virtual machines from hetzner.
Hi–Mitzi set up some examples on Google Colab, but apparently there are some difficulties with that? I agree that it would be good to have a remote server workflow for Stan if that is possible. Also ccing Rok in case he has thoughts.
Andrew
I use AWS at work. It was most simple for me to get rstudio server on it and then I log in through a web-browser. It works with Ubuntu 20.04 and the dev release of rstudio server (free version). I installed both rstan and cmdstan on it.
A word of caution, get at least 16 GB of ram preferably 32 GB. For some reason rstudio-server will sometimes error out - possibly related to rstan crash or memory issues. You’ll get a screen that says you can’t log in. In that case you need to SSH into the AWS terminal and move some rstudio files out to clear it’s cache. Don’t delete the rstudio folder, move it to rstudio-old that way you can recover any unsaved files.
cd /.local/share
# next line is only needed if rstudio-old already exists
# sudo rm -rf rstudio-old
sudo mv rstudio rstudio-old
To recover any unsaved files go to
Home/.local/share/rstudio-old/sources/s-****
the files that contain ****-contents have all the open tabs in whatever state Rstudio autosaved. These can be opened in rstudio.
Hi all and thanks very much for all the useful inputs!
@jonah Yes sharing the model on the forums will be great. Colleagues and I will put together a reproducible example soon, it would be great to have feedback on it.
@serban-nicusor & @spinkney AWS sounds like a good way that I could start trying with. I don’t have Linux os currently installed on my machine but I’ll try to get that sorted. I suppose there won’t be alternative for windows as rstudio server is only downloadable for Linux ?
Hey @gio if you have a windows machine you can always install Windows Subsystem for Linux, you can read more about it here basically you’re running linux kernel in parallel with your windows OS. I use that locally for almost everything since it’s almost as good as a linux VM, as a protip you can also install the New Terminal from the Windows Store to more easily switch between distros as you can have Ubuntu 18 and 20 at the same time for example.
There’s also another solution by using Docker locally. There’s a nice repository for rstudio images, basically it’s a pre-build container with everything you need to get your going.
With docker is as simple as docker run -it docker pull rocker/rstudio -- bash and it will pull the pre-build docker container and execute a shell inside from where you can issue your commands.
PS: RStudio is also available for Windows on their download page.
I’m surprised Andrew didn’t mention the folk theorem: When you have computational problems, often there’s a problem with your model
There’s a lot of the Stan literature, including the Stan User’s Guide, where the Stan models are written with the goal of readability and understandability; they aim to faithfully capture the process described by the model. While this makes Stan easy to understand, there’s a lot of bumps hiding under the rug; not all teaching models are what you need when working with real-world data.
Simplest speedups come from vectorizing operations which can be vectorized, and the Stan team is busy vectorizing more and more stuff. Next steps include finding that which can be parallelized and parallelizing accordingly. Perhaps the model needs better priors; perhaps there are reparameterizations which can speed things up here.
If you can share your code, or a simplified version thereof, it might be the case that this model can be rewritten and sped up.
Hi Gio, I blogged some instructions for R+Stan on DigitalOcean recently: http://www.robertgrantstats.co.uk/blog/25.html
You could skip Jupyter from this if you like and just run R scripts.
Hi all! As many of you have advised I have uploaded the model code in the modeling section of the forums (Joint Species Distribution Model Performance). Thanks a lot for the precious advices.