Best Practices for Stan in Production on AWS

Hi all,

I’m planning to put a Stan-based model into production (run on a schedule, serve some transformation of outputs to a website) on AWS, as close to “serverless” as possible. I was wondering what the community thinks is the best way to do this.

I’ve been customizing an Amazon Linux 2 AMI to run cmdstanpy. The production process is going to look like:

  1. Start EC2 instance
  2. trigger cmdstanpy code
  3. cron job on instance copies stan output to S3 using AWS cli’s s3 sync
  4. Instance shuts down on successful completion of (2), or on timeout
  5. downstream processes reference stan output on S3

I’ll be running the same code against multiple independent datasets (these are IRT models on a variety of courses), so I’m hoping to scale this up by running the same image in multiple EC2 instances.

Things I’ve learned so far:

  • httpstan is fiddly, and the zipped output it creates in .cache needs custom processing (it’s not valid JSON)
  • pystan is tightly integrated with a local httpstan server, and would need adjustment to run with a remote. You can’t just forward ports to localhost and be on your way.
  • “Amazon Linux 2” on ARM means compiling your own recent Python
  • I need a fairly beefy instance type to get performance comparable to my local (M1 Pro) machine. I’ve been using an a1.xlarge with ARM processors for testing and it’s not cutting it.

Things I believe to be true:

  • AMIs can’t be converted to docker images very easily, but the Amazon Linux AMI specifically has a pathway

Questions:

  • Is this, roughly, the pattern others have been using? Did you start with docker images instead?
  • Sagemaker and associated bells and whistles (Model Registry, etc.) looks inappropriate for an entirely batch processing workflow needing a custom image. Has anybody tried using Stan within Sagemaker recently?
  • What instance types and architectures are the best for running Stan code?
6 Likes

I have cmdstanpy deployed in sagemaker as a model, but if you just make it a processing step you won’t lose much. I install cmdstanpy to an AWS mantained scikit-learn image via commands in a dockerfile, then upload the image to AMI, then call that image for the sagemaker steps to run on

I think the main downside to doing this is that you pay 20% more for c5 instances that are 20% slower than the c6i instances you can get on EC2