It seems we could look at a number of pre-existing data sources (eg discourse views and contributors, papers, StanCon attendance etc) to inform an inference of how many people use Stan (and/or use things that use Stan). We could also generate new data (eg via surveys etc). Do we know the answer and/or how best to work it out?
Iāve thought a tonne about surveying the Stan community or potentially capture-recapture techniques, but generating new data in this way would be a lot of work and my current funding wouldnāt cover it (thatāll change next year though!). Iād be interested in collaborating if anyone is interested. :)
As is hopefully obvious, i am keen to help. I can probably deploy people on this who can help make this happen if we know what it is that we needed them to do.
How about asking who has registered to discourse? From that we could extrapolate the population size for those who we could reach by survey?
@avehtari: who would we survey and how would we get the survey to them? I feel I am being dim.
From memory the hope was to utilized a snowball survey (once you finish you forward onto other people in the population), which is generally what you use when you have a hard to define population.
One of the stats grads here, Jonathan Auerbach and I talked about starting multiple snowballs (Andrewās blog, Discourse, Twitter, StanCon mailing lists, etc.) and then tracking what snowball people were recruited by (and if multiples) as a way of measuring coverage.
This is a fascinating statistics question. Maybe I could post a blog on this and see if there are any thoughts.
@lauren I think you should name that design a snowball fight š
Thatās a fantastic name!!!
@andrewgelman you can, itād be interesting to see what folks think.
Whatās your definition of a Stan user? Is it someone who once downloaded Stan and ran a model? Or is it someone who uses Stan regularly, and if so, how regularly? What about people who use packages like brms or rstanarm or prophet that are built on top of Stan?
There are over 3K users registered on Discourse, but that doesnāt mean theyāre regulars. Many of them only showed up once.
Blog post on the topic scheduled for Monday.
For context, UK academic departments (eg the one I sit in) are assessed (as part of the āResearch Excellence Frameworkā, REF) on the basis of some criteria: the next assessment is imminent, but the one after that will be in 2026. When aggregated over each entire University, the outcome of that assessment process modulates the amount of (Quality-related Research, QR) income that the University receives from the UK government. So, itās important to inform the assessment process with pertinent information.
One of the three assessment criteria is āimpactā, which relates to the uptake of academic research outside of the academic discipline it came from and is metricated in terms of āreach and significanceā during the census period. Unfortunately, āreach and significanceā is not defined quantitatively. However, the notion of the census period is defined as people using the research during a specified period of time (eg 2021-2026).
The specific motivation for my question is that Iād like to understand how we (locally to my department) could quantify the āreach and significanceā of any enhancements to Stan that (we hope!) might come out of our work between now and 2026. It seems natural to start by finding out what we (as the Stan community) know about how big we are now.
So, in answer to your question, I think Iād ideally like to know how many people and/or organisations are making use of specific subsets of Stanās code (including in any packages that use Stan) during a specified period. Iād also like to know where they are based geographically, whether people work in academic, industry or government, the demographic of applications they are working on, etc etc.
Thatās clearly hinting at scope creep, but hopefully helps explain the specific reason I asked the question, which I see as an important step towards quantifying the impact of our work to help enhance Stan.
I think this would be relevant outside of the REF as well - a useful addition to many grants/impact sections.
We were at the point where we were focussing on questions around this, plus an emphasis on barriers to entry. I can dig them up if thereās renewed interest. I believe the conclusion last time was that the time-cost of doing it wasnāt worth the expected benefit, but it could have changed! :)
I posted my question here: https://statmodeling.stat.columbia.edu/2019/12/09/how-many-stan-users-are-there/
This paper might have useful ideas Using an Online Sample to Estimate the Size of an Offline Population how to estimate the number of those users who would be unlikely to be reached by survey.
Stack Overflow has tag āStanā and 254 users (if Iām interpreting their graphics correctly) have listed Stan as things theyāre interested in. about once a week someone asks a Stan question.
254 is the number of questions with that tag, the number of users will be less than that.
doh!
users do list their interests, but no way to scrape that out of SO.
Itās an interesting question! We might want to think about targeting this question using a statistical/surveying method and then benchmarking our population estimates against other potential indicators as another coverage check (number of downloads of Stan in R packages relative to Python etc.).
One of the challenges for this is agreeing upon what the definition of a Stan user is.
I think weāre going to have a hangouts meeting on this topic to see if we can make some sort of plan. Iāve emailed folks whoās email I have but for those interested weāre picking a time here, and if you put your full name in I can probably google your email for an invite. :) https://www.when2meet.com/?8493674-nr5Ai