What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)!
By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. In this post I will share some nice plots and quick insights that can be made from this great data. The code for this analysis is given at the end of this post.
Top 8 most downloaded R packages – downloads over time
Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version):
We can see the strong weekly seasonality of the downloads, with Saturday and Sunday having much fewer downloads than other days. This is not surprising since we know that the countries which uses R the most have these days as rest days (see James Cheshire’s world map of R users). It is also interesting to note how some packages had exceptional peaks on some dates. For example, I wonder what happened on January 23rd 2013 that the digest package suddenly got so many downloads, or that colorspace started getting more downloads from April 15th 2013.
“Family tree” of the top 100 most downloaded R packages
We can extract from this data the top 100 most downloaded R packages. Moreover, we can create a matrix showing for each package which of our unique ids (censored IP addresses), has downloaded which package. Using this indicator matrix, we can thing of the “similarity” (or distance) between each two packages, and based on that we can create a hierarchical clustering of the packages – showing which packages “goes along” with one another.
With this analysis, you can locate package on the list which you often use, and then see which other packages are “related” to that package. If you don’t know that package – consider having a look at it – since other R users are clearly finding the two packages to be “of use”.
Such analysis can (and should!) be extended. For example, we can imagine creating a “suggest a package” feature based on this data, utilizing the package which you use, the OS that you use, and other parameters. But such coding is beyond the scope of this post.
Here is the “family tree” (dendrogram) of related packages:
To make it easier to navigate, here is a table with links to the top 100 R packages, and their links:
Package | Title | Downloads | |
---|---|---|---|
1 | plyr | Tools for splitting, applying and combining data | 84049 |
2 | digest | Create cryptographic hash digests of R objects | 83192 |
3 | ggplot2 | An implementation of the Grammar of Graphics | 82768 |
4 | colorspace | Color Space Manipulation | 81901 |
5 | stringr | Make it easier to work with strings | 77658 |
6 | RColorBrewer | ColorBrewer palettes | 66783 |
7 | reshape2 | Flexibly reshape data: a reboot of the reshape package | 64911 |
8 | zoo | S3 Infrastructure for Regular and Irregular Time Series (Z’s ordered observations) | 60844 |
9 | proto | Prototype object-based programming | 59043 |
10 | scales | Scale functions for graphics | 58369 |
11 | car | Companion to Applied Regression | 57453 |
12 | dichromat | Color Schemes for Dichromats | 56624 |
13 | gtable | Arrange grobs in tables | 54431 |
14 | munsell | Munsell colour system | 53183 |
15 | labeling | Axis Labeling | 51877 |
16 | Hmisc | Harrell Miscellaneous | 47836 |
17 | rJava | Low-level R to Java interface | 47731 |
18 | mvtnorm | Multivariate Normal and t Distributions | 46884 |
19 | bitops | Bitwise Operations | 45689 |
20 | rgl | 3D visualization device system (OpenGL) | 41001 |
21 | foreign | Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, .. | 37849 |
22 | XML | Tools for parsing and generating XML within R and S-Plus | 37153 |
23 | lattice | Lattice Graphics | 36597 |
24 | e1071 | Misc Functions of the Department of Statistics (e1071), TU Wien | 35180 |
25 | gtools | Various R programming tools | 35028 |
26 | sp | classes and methods for spatial data | 34786 |
27 | gdata | Various R programming tools for data manipulation | 34262 |
28 | Rcpp | Seamless R and C++ Integration | 33929 |
29 | MASS | Support Functions and Datasets for Venables and Ripley’s MASS | 33667 |
30 | Matrix | Sparse and Dense Matrix Classes and Methods | 30740 |
31 | lmtest | Testing Linear Regression Models | 30319 |
32 | survival | Survival Analysis | 30186 |
33 | caTools | Tools: moving window statistics, GIF, Base64, ROC AUC, etc | 29945 |
34 | multcomp | Simultaneous Inference in General Parametric Models | 29871 |
35 | RCurl | General network (HTTP/FTP/…) client interface for R | 28866 |
36 | knitr | A general-purpose package for dynamic report generation in R | 28104 |
37 | xtable | Export tables to LaTeX or HTML | 28091 |
38 | xts | eXtensible Time Series | 28058 |
39 | rpart | Recursive Partitioning | 27812 |
40 | evaluate | Parsing and evaluation tools that provide more details than the default | 27617 |
41 | RODBC | ODBC Database Access | 26131 |
42 | quadprog | Functions to solve Quadratic Programming Problems | 25433 |
43 | tseries | Time series analysis and computational finance | 25144 |
44 | DBI | R Database Interface | 24793 |
45 | nlme | Linear and Nonlinear Mixed Effects Models | 24360 |
46 | lme4 | Linear mixed-effects models using S4 classes | 24199 |
47 | reshape | Flexibly reshape data | 24118 |
48 | sandwich | Robust Covariance Matrix Estimators | 24016 |
49 | leaps | regression subset selection | 23666 |
50 | gplots | Various R programming tools for plotting data | 23251 |
51 | abind | Combine multi-dimensional arrays | 22758 |
52 | randomForest | Breiman and Cutler’s random forests for classification and regression | 22401 |
53 | Rcmdr | R Commander | 22131 |
54 | coda | Output analysis and diagnostics for MCMC | 21900 |
55 | maps | Draw Geographical Maps | 21550 |
56 | igraph | Network analysis and visualization | 21423 |
57 | formatR | Format R Code Automatically | 21049 |
58 | maptools | Tools for reading and handling spatial objects | 20957 |
59 | RSQLite | SQLite interface for R | 19671 |
60 | psych | Procedures for Psychological, Psychometric, and Personality Research | 19545 |
61 | KernSmooth | Functions for kernel smoothing for Wand & Jones (1995) | 19166 |
62 | rgdal | Bindings for the Geospatial Data Abstraction Library | 19064 |
63 | RcppArmadillo | Rcpp integration for Armadillo templated linear algebra library | 18899 |
64 | effects | Effect Displays for Linear, Generalized Linear, Multinomial-Logit, Proportional-Odds Logit Models and Mixed-Effects Models | 18843 |
65 | sem | Structural Equation Models | 18711 |
66 | vcd | Visualizing Categorical Data | 18589 |
67 | XLConnect | Excel Connector for R | 18230 |
68 | markdown | Markdown rendering for R | 18211 |
69 | timeSeries | Rmetrics – Financial Time Series Objects | 17932 |
70 | timeDate | Rmetrics – Chronological and Calendar Objects | 17838 |
71 | RJSONIO | Serialize R objects to JSON, JavaScript Object Notation | 17801 |
72 | cluster | Cluster Analysis Extended Rousseeuw et al | 17136 |
73 | scatterplot3d | 3D Scatter Plot | 17110 |
74 | nnet | Feed-forward Neural Networks and Multinomial Log-Linear Models | 17074 |
75 | fBasics | Rmetrics – Markets and Basic Statistics | 16278 |
76 | forecast | Forecasting functions for time series and linear models | 15638 |
77 | quantreg | Quantile Regression | 15509 |
78 | foreach | Foreach looping construct for R | 15405 |
79 | chron | Chronological objects which can handle dates and times | 15226 |
80 | plotrix | Various plotting functions | 15142 |
81 | matrixcalc | Collection of functions for matrix calculations | 15107 |
82 | aplpack | Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, and some slider functions | 14654 |
83 | strucchange | Testing, Monitoring, and Dating Structural Changes | 14503 |
84 | iterators | Iterator construct for R | 14449 |
85 | mgcv | Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation | 14186 |
86 | kernlab | Kernel-based Machine Learning Lab | 14135 |
87 | SparseM | Sparse Linear Algebra | 13921 |
88 | tree | Classification and regression trees | 13871 |
89 | robustbase | Basic Robust Statistics | 13778 |
90 | vegan | Community Ecology Package | 13686 |
91 | devtools | Tools to make developing R code easier | 13488 |
92 | latticeExtra | Extra Graphical Utilities Based on Lattice | 13253 |
93 | modeltools | Tools and Classes for Statistical Models | 13233 |
94 | xlsx | Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files | 13097 |
95 | slam | Sparse Lightweight Arrays and Matrices | 13060 |
96 | TTR | Technical Trading Rules | 12894 |
97 | quantmod | Quantitative Financial Modelling Framework | 12892 |
98 | relimp | Relative Contribution of Effects in a Regression Model | 12692 |
99 | akima | Interpolation of irregularly spaced data | 12680 |
100 | memoise | Memoise functions | 12600 |
R code
I hope you found this post useful, and will find new ways of using this interesting dataset. Note that there are issues with how much these numbers represent the “truth”, but for now, they are the most interesting estimate of it that I know of.
# get the latest installr package:
if (!require('devtools')) install.packages('devtools'); require('devtools')
install_github('installr', 'talgalili')
require(installr)
# read the data (this will take a LOOOONG time)
RStudio_CRAN_data_folder 0)
mode(package_ip_id) <- "numeric"
dend_package_ip_id
p.s: This post is a follow up of me discovering, two days ago how many people use my R package.
Tal,
Nice work! I’ve dreamed of having a decent count of R users to add to The Popularity of Data Analysis Software (https://bit.ly/statpop) since I first wrote it. As use of RStudio’s CRAN grows, I’ll finally have that and this wonderful list of packages as well. The package list will be a great help in optimizing the order in which I learn new packages.
Thanks!
Bob
An important point to note is that these are just for downloads off the RStudio CRAN mirror and there are a *lot* of other mirrors to choose from that might be used instead of their service.
Hi Gavin,
I felt I’ve been clear about it when I wrote that this is the data of only one CRAN. But maybe I should clarify that a bit further – thanks.
it’s unlikely that the pattern would change if a different cran mirror is used. the listing above will pretty much be the same.
The computation of package distances is a very nice idea! I agree, this analysis could be pushed further towards a recommendation engine.
Thanks Felix,
I hope to play with it, or see others playing with it some more…
You mention the peak at the 23rd of januari for digest. This has probably to do with a new version which was released the 21st. With the 21st being in the weekend and the time it takes for the package to reach the mirror, the peak are probably package updates. You also see a (smaller) peak around 16th of februari.
So the downloads are also related to the release of new versions of the package and the release of new R-versions (e.g. the ‘anomaly’ around the 3rd of april).
Hi Jan van der Laan,
Good points – thanks for mentioning them 🙂
Comliments! Very interesting. I liked your graph can you give us the code?
[email protected]
Hi Paolo,
The full code for the graphs are given at the end of the post.
For the code it relies on you’d need to install the {installr} package. If you are interested in the code from the package, you can see it here:
https://github.com/talgalili/installr/blob/master/R/RStudio_CRAN_data.r
With regards,
Tal
Commas would be nice.
Commas where?
Commas would be nice.
I am looking for AED package in R but could not find. Appreciate if anyone suggest a reliable site to download the package.
Thanks
BK
RStudio_CRAN_data_folder 0) <– this code throws me an error
The code seems incomplete. Can you provide fully functional chunk of code please?
The over-time top 8 most downloaded must’ve been by students completing their assignments!!!