I recently ran an introductory course for the Royal Statistical Society, which I did the last couple of years with rstan only, but now for the first time I offered a choice of rstan, pystan, cmdstanr or cmdstanpy. This gave me an opportunity to closely compare them. I offer up these pretty paltry observations in case they inspire ideas among developers of these and other interfaces.
I see some effort to converge on the names of arguments bearing fruit. It would be nice to have more of this. Of course, its possible to have two options available, one deprecated, like warmup= and num_warmup=
Suppose you want 1000 warmup draws and then 1000 retained after that. rstan requires warmup=1000, iter=2000, while all the others will have some form of num_warmup=1000,num_sample=1000. That’s a potential trap for the unwary.
rstan and cmdstanr accept a list or a filename for data. Cmdstanpy and pystan want only a dictionary. Writing and reading from JSON in Python is a massive pain, at least in my experience, so it would be great if pystan and cmdstanpy took data files as an argument, and also offered their own data-writing function that could handle numpy arrays and the various float lengths that might entail.
cmdstanr, pystan and cmdstanpy all require you to compile the model and then sample. rstan allows a single function stan() to rule them all. I’m not saying I like rstan::stan(), just that it’s good to have options for everyone to pick from, especially beginners.
cmdstanr, cmdstanpy and pystan all have syntax for sampling that looks and feels like applying a sample method to a model object. I like that, I suppose others might not. In R, especially, it is still quite unusual and many people expect functions everywhere. rstan is all about the functions.
In general, I am annoyed by Python packages that are called x for pip install and then when you go to import them, they are suddenly called y. Like pystan. Maybe that’s just me, but every time I explain to learners that they have to look out for things like that, it deflates their enthusiasm a little.
It’s a shame about Jupyter notebooks, pystan and nest-asyncio. I don’t know enough about the issue to comment. Maybe there’s a way to get around it.
pystan requires you to provide the data and the seed when you compile the model. Maybe you can overrule them in the sample method’s kwargs, I don’t know. This hurts my simulation study soul, because if I want to swap the data repeatedly, I don’t want to recompile or even check for code changes each time.
I think that the only way to have different initial values for chains in cmdstanpy is to supply a file (JSON? oh no).
I know readthedocs is convenient, and I’m not criticising anyone because my documentation is the worst, but really the hello world examples there are not sufficient for beginners, and the layout of the API reference takes some getting used to.
I’m thinking of just running with the two cmdstan interfaces next year, but we’ll see. I almost entirely use cmdstanr now, and as I said before, I’m advising any StataStan users to switch to cmdstanpy via the Python integration feature in Stata 16+ – and I’ll be writing a tutorial on that soon.
Ciao ciao
Robert