I thought I would ask the community here as I am currently trying to build some example materials that leverage Stan and investigate its scalability on larger scale datasets. Has anyone seen any particularly compelling pairings of Stan models supplied alongside large datasets, preferably open-source and datasets that focus on people / sensitive information? Any examples people can give would be much appreciated. I have read some similar discourse here which was somewhat helpful but lacking in full example pairings of a Stan model and the data.
I give a homework assignment using wage data from the U.S. Census
But there isn’t a whole lot to it, since the individual data are already anonymized.
I’m not sure this is close enough to what you’re looking for, but I recently did an analysis of a publicly available dataset of >6000 individual shootings in Baltimore using rstanarm and posted the complete code along with the dataset here.
The particular subset of the data that I analyzed may not be large enough to interest you (n=~6 thousand shootings), but the underlying dataset which is provided is actually every victim-based crime occurring in Baltimore over the past 7 years (n=~400 thousand). So, if you subset to a more common crime than shootings, you can probably run my code without too many modifications to obtain a reasonable case study of stan on data that’s “big”, at least by the standards of public health.
BTW: the same dataset was discussed previously in this thread, which you can check out for some actual Stan code.
Regardless of the anonymisation this is still a pretty good example so thank you for sharing!
This is great, I will go through your repo as there looks to be some interesting stuff inside, thank you!