Stan ecosystem usage metrics R packages + related packages

This is my second post for Stan ecosystem usage, the first Stan ecosystem metrics which has Scopus.com articles broken down by category over time. This post covers:

  • Downloads of Stan ecosystem and related packages on the RStudio mirror of CRAN. This only is available for R packages.

Monthly package downloads from RStudio CRAN mirror on log scale. Packages ggplot2 and Rccp provided as baseline usage rates of R packages. BART, lme4 and tensorflow are not a part of the Stan ecosystem, included for context.


Slopes added and extrapolated to 2030.


I can add more R packages if desired, happy to answer questions.

Breck

3 Likes

I would not include the ribbons in the last plot. But here is an alternative way of looking at things from an infrastructure perspective

which is essentially Google’s original algorithm applied to CRAN packages. If you do

install.packages("https://github.com/andrie/pagerank/archive/master.tar.gz",
                 repos = NULL)
library(pagerank)
pr <- compute_pagerank("https://cran.rstudio.com", decreasing = TRUE)
head(pr, 80)

The rstan package is currently 80th out of 16119 packages

         Rcpp       ggplot2          MASS         dplyr        Matrix      magrittr       stringr       mvtnorm 
 0.0292606801  0.0141961468  0.0127345517  0.0107156933  0.0075686855  0.0069382619  0.0061439859  0.0055342496 
   data.table RcppArmadillo      jsonlite        tibble         rlang      survival          plyr          httr 
 0.0054958643  0.0054150667  0.0053252688  0.0051572938  0.0051009970  0.0050220072  0.0046550127  0.0046362659 
        tidyr         purrr         shiny       foreach        igraph            sp       lattice      reshape2 
 0.0042626508  0.0040827308  0.0039510623  0.0037369293  0.0037330274  0.0034590004  0.0031187486  0.0030352190 
   doParallel        raster     lubridate        scales           zoo  RColorBrewer          coda            R6 
 0.0026634119  0.0024912280  0.0024265575  0.0023214226  0.0023009021  0.0022627788  0.0021832618  0.0021770948 
         xml2     gridExtra         knitr        glmnet          nlme          boot         readr      numDeriv 
 0.0020232967  0.0020228031  0.0019229942  0.0018524435  0.0018318426  0.0018191508  0.0018061785  0.0017807931 
    RcppEigen           XML          mgcv          lme4           ape        digest    assertthat         RCurl 
 0.0017152923  0.0017104067  0.0016335523  0.0016284246  0.0016136271  0.0015780888  0.0015517581  0.0015164083 
       pracma          glue           rgl        Rdpack         Hmisc        gtools     htmltools            BH 
 0.0015020734  0.0014736499  0.0014718172  0.0014712523  0.0014693716  0.0014411179  0.0014252742  0.0013906720 
         curl       cluster         rgdal       stringi           car         rJava     rmarkdown       Formula 
 0.0013880809  0.0013225083  0.0012620477  0.0012600271  0.0012570148  0.0012358317  0.0012285247  0.0012176682 
  htmlwidgets         abind            sf        crayon        fields         e1071        plotly           DBI 
 0.0012108465  0.0012059995  0.0011814621  0.0011472009  0.0011197394  0.0010698513  0.0010401082  0.0009954857 
    checkmate          nnet      quadprog   matrixStats  randomForest         vegan         rpart         rstan 
 0.0009791682  0.0009676665  0.0009645115  0.0009575255  0.0009241805  0.0009160678  0.0009123694  0.0009026298 

which a pagerank that is 14.55 times the average package

pr["rstan"] / mean(pr) # 14.54949 

Most of the packages that are more important than rstan on this metric are utilities, rather than statistics. MASS, survival, mgcv, cluster, nnet, and rpart statistical packages that come with the default installation of R, which means they are broadly useful but have a leg up on all the non-recommended packages. The other packages of note I think are

  • coda (31st): Has been around a long time but the posterior package should be better
  • glmnet (36th): A supervised learning package that emphasizes elastic net penalization
  • lme4 (44th): A package for estimating Frequentist hierarchical models
  • randomForest (77th): The canonical implementation (in R) of the most popular supervised learning approach these days

I think it is amazing that (R)Stan is essentially as fundamental to Bayesian modeling as randomForest is to supervised learning, but Bayesian modeling has been overtaken (by a lot) by supervised learning approaches during the decade since Stan has been developed.

7 Likes

Here you go:

I charted the above packages with rstan included and Rccp as a baseline.

Sorry for the ugly label placement.
RStan is doing really well with the dumb interpretation being RStan passing Rccp in 2028.

I’ll be posting time series citation counts which is where one really sees pytorch/tensorflow eclipsing Stan ecosystem.

which(names(head(pr,10000)) %in% c("rstan","rstantools","bayesplot","rstanarm","loo","shinystan","projpred"))

Gives placements 80, 282, 456, 509, 510, 793, 6526 out of 16119, so other packages are doing quite well, too (projpred being the most specialized of all these packages)

EDIT: fixed the total number of packages

2 Likes

Here the above packages “rstan”,“rstantools”,“bayesplot”,“rstanarm”,“loo”,“shinystan”,“projpred” are in time series. I am happy to share code but it is a bit of a train set–let me know. Loo is running a close second.

1 Like

Now, I would remove the lines and restrict the time interval to the past. Downloads is a pretty crude metric. I think a pagerank style approach has the advantage of indicating that other developers (most of whom are not at Columbia) choose to build off of RStan.

1 Like

@stevebronder and some others suggested some additional packages. I pulled some more packages from the list at: https://www.ubuntupit.com/best-r-machine-learning-packages/

Log scale

One year later, the pagerank placements among 18472 CRAN packages are

rstan      77  
rstantools 220 
loo        357 
bayesplot  389 
brms	   416 
rstanarm   548 
shinystan  938 
projpred   1306
posterior  2970

so the pageranks have gone up. As comparison the pageranks for some of the packages Breck plotted above

glmnet     33
coda	   34
lme4	   45
mgcv	   47
e1071	   63
caret	   83
1 Like

Current pageranks out of 20123 CRAN packages (it is possible that last time I had some labels in wrong order)

     rstan    70
rstantools   164
 bayesplot   334
      brms   373
  rstanarm   377
       loo   497
 shinystan   648
  projpred  1288
 posterior  1437

As comparison, some other stats packages

glmnet    32
  coda    42
  lme4    50
  mgcv    54
 rjags   141
 greta  2097
1 rstan      775889
2 loo        560694
3 bayesplot  536456
4 rstantools 469867
5 posterior  338881
6 shinystan  294260
7 brms       291736
8 rstanarm   160874
9 projpred    51603

pagerank is about dependencies, and the number of downloads during last 12 months is quite different (e.g. posterior might be downloaded a lot due to CmdStanR which is not in CRAN)

 lme4       6104089
 mgcv       1399456
 coda       1243240
 glmnet     1235058
 rstan       775889
 loo         560694
 bayesplot   536456
 rstantools  469867
 posterior   338881
 shinystan   294260
 brms        291736
 rjags       237991
 rstanarm    160874
 projpred     51603
 greta         6437
2 Likes