top_8_R_Packages_over_time

Top 100 R packages for 2013 (Jan-May)!

What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)!

By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. In this post I will share some nice plots and quick insights that can be made from this great data. The code for this analysis is given at the end of this post.

Top 8 most downloaded R packages – downloads over time

Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version):

top_8_R_Packages_over_time

We can see the strong weekly seasonality of the downloads,  with Saturday and Sunday having much fewer downloads than other days. This is not surprising since we know that the countries which uses R the most have these days as rest days (see James Cheshire’s world map of R users). It is also interesting to note how some packages had exceptional peaks on some dates. For example, I wonder what happened on January 23rd 2013 that the digest package suddenly got so many downloads, or that colorspace started getting more downloads from April 15th 2013.

“Family tree” of the top 100 most downloaded R packages

We can extract from this data the top 100 most downloaded R packages. Moreover, we can create a matrix showing for each package which of our unique ids (censored IP addresses), has downloaded which package. Using this indicator matrix, we can thing of the “similarity” (or distance) between each two packages, and based on that we can create a hierarchical clustering of the packages – showing which packages “goes along” with one another.

With this analysis, you can locate package on the list which you often use, and then see which other packages are “related” to that package.  If you don’t know that package – consider having a look at it – since other R users are clearly finding the two packages to be “of use”.

Such analysis can (and should!) be extended. For example, we can imagine creating a “suggest a package” feature based on this data, utilizing the package which you use, the OS that you use, and other parameters.  But such coding is beyond the scope of this post.

Here is the “family tree” (dendrogram) of related packages:

Family_tree_of_Top_100_R_Packages

To make it easier to navigate, here is a table with links to the top 100 R packages, and their links:

PackageTitleDownloads
1 plyr Tools for splitting, applying and combining data84049
2 digest Create cryptographic hash digests of R objects83192
3 ggplot2 An implementation of the Grammar of Graphics82768
4 colorspace Color Space Manipulation81901
5 stringr Make it easier to work with strings77658
6 RColorBrewer ColorBrewer palettes66783
7 reshape2 Flexibly reshape data: a reboot of the reshape package64911
8 zoo S3 Infrastructure for Regular and Irregular Time Series (Z’s
ordered observations)
60844
9 proto Prototype object-based programming59043
10 scales Scale functions for graphics58369
11 car Companion to Applied Regression57453
12 dichromat Color Schemes for Dichromats56624
13 gtable Arrange grobs in tables54431
14 munsell Munsell colour system53183
15 labeling Axis Labeling51877
16 Hmisc Harrell Miscellaneous47836
17 rJava Low-level R to Java interface47731
18 mvtnorm Multivariate Normal and t Distributions46884
19 bitops Bitwise Operations45689
20 rgl 3D visualization device system (OpenGL)41001
21 foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase,
..
37849
22 XML Tools for parsing and generating XML within R and S-Plus37153
23 lattice Lattice Graphics36597
24 e1071 Misc Functions of the Department of Statistics (e1071), TU Wien35180
25 gtools Various R programming tools35028
26 sp classes and methods for spatial data34786
27 gdata Various R programming tools for data manipulation34262
28 Rcpp Seamless R and C++ Integration33929
29 MASS Support Functions and Datasets for Venables and Ripley’s MASS33667
30 Matrix Sparse and Dense Matrix Classes and Methods30740
31 lmtest Testing Linear Regression Models30319
32 survival Survival Analysis30186
33 caTools Tools: moving window statistics, GIF, Base64, ROC AUC, etc29945
34 multcomp Simultaneous Inference in General Parametric Models29871
35 RCurl General network (HTTP/FTP/…) client interface for R28866
36 knitr A general-purpose package for dynamic report generation in R28104
37 xtable Export tables to LaTeX or HTML28091
38 xts eXtensible Time Series28058
39 rpart Recursive Partitioning27812
40 evaluate Parsing and evaluation tools that provide more details than the
default
27617
41 RODBC ODBC Database Access26131
42 quadprog Functions to solve Quadratic Programming Problems25433
43 tseries Time series analysis and computational finance25144
44 DBI R Database Interface24793
45 nlme Linear and Nonlinear Mixed Effects Models24360
46 lme4 Linear mixed-effects models using S4 classes24199
47 reshape Flexibly reshape data24118
48 sandwich Robust Covariance Matrix Estimators24016
49 leaps regression subset selection23666
50 gplots Various R programming tools for plotting data23251
51 abind Combine multi-dimensional arrays22758
52 randomForest Breiman and Cutler’s random forests for classification and
regression
22401
53 Rcmdr R Commander22131
54 coda Output analysis and diagnostics for MCMC21900
55 maps Draw Geographical Maps21550
56 igraph Network analysis and visualization21423
57 formatR Format R Code Automatically21049
58 maptools Tools for reading and handling spatial objects20957
59 RSQLite SQLite interface for R19671
60 psych Procedures for Psychological, Psychometric, and Personality
Research
19545
61 KernSmooth Functions for kernel smoothing for Wand &amp Jones (1995)19166
62 rgdal Bindings for the Geospatial Data Abstraction Library19064
63 RcppArmadillo Rcpp integration for Armadillo templated linear algebra library18899
64 effects Effect Displays for Linear, Generalized Linear,
Multinomial-Logit, Proportional-Odds Logit Models and
Mixed-Effects Models
18843
65 sem Structural Equation Models18711
66 vcd Visualizing Categorical Data18589
67 XLConnect Excel Connector for R18230
68 markdown Markdown rendering for R18211
69 timeSeries Rmetrics – Financial Time Series Objects17932
70 timeDate Rmetrics – Chronological and Calendar Objects17838
71 RJSONIO Serialize R objects to JSON, JavaScript Object Notation17801
72 cluster Cluster Analysis Extended Rousseeuw et al17136
73 scatterplot3d 3D Scatter Plot17110
74 nnet Feed-forward Neural Networks and Multinomial Log-Linear Models17074
75 fBasics Rmetrics – Markets and Basic Statistics16278
76 forecast Forecasting functions for time series and linear models15638
77 quantreg Quantile Regression15509
78 foreach Foreach looping construct for R15405
79 chron Chronological objects which can handle dates and times15226
80 plotrix Various plotting functions15142
81 matrixcalc Collection of functions for matrix calculations15107
82 aplpack Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, and
some slider functions
14654
83 strucchange Testing, Monitoring, and Dating Structural Changes14503
84 iterators Iterator construct for R14449
85 mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness
estimation
14186
86 kernlab Kernel-based Machine Learning Lab14135
87 SparseM Sparse Linear Algebra13921
88 tree Classification and regression trees13871
89 robustbase Basic Robust Statistics13778
90 vegan Community Ecology Package13686
91 devtools Tools to make developing R code easier13488
92 latticeExtra Extra Graphical Utilities Based on Lattice13253
93 modeltools Tools and Classes for Statistical Models13233
94 xlsx Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files13097
95 slam Sparse Lightweight Arrays and Matrices13060
96 TTR Technical Trading Rules12894
97 quantmod Quantitative Financial Modelling Framework12892
98 relimp Relative Contribution of Effects in a Regression Model12692
99 akima Interpolation of irregularly spaced data12680
100 memoise Memoise functions12600

R code

I hope you found this post useful, and will find new ways of using this interesting dataset. Note that there are issues with how much these numbers represent the “truth”, but for now, they are the most interesting estimate of it that I know of.

 

# get the latest installr package:
if (!require('devtools')) install.packages('devtools'); require('devtools')
install_github('installr', 'talgalili')
require(installr)
 
# read the data (this will take a LOOOONG time)
RStudio_CRAN_data_folder 0)
mode(package_ip_id) <- "numeric"
dend_package_ip_id

p.s: This post is a follow up of me discovering, two days ago how many people use my R package.

  • Bob Muenchen

    Tal,

    Nice work! I’ve dreamed of having a decent count of R users to add to The Popularity of Data Analysis Software (http://bit.ly/statpop) since I first wrote it. As use of RStudio’s CRAN grows, I’ll finally have that and this wonderful list of packages as well. The package list will be a great help in optimizing the order in which I learn new packages.

    Thanks!
    Bob

  • Gavin Simpson

    An important point to note is that these are just for downloads off the RStudio CRAN mirror and there are a *lot* of other mirrors to choose from that might be used instead of their service.

    • http://www.r-statistics.com/ Tal Galili

      Hi Gavin,

      I felt I’ve been clear about it when I wrote that this is the data of only one CRAN. But maybe I should clarify that a bit further – thanks.

  • Felix Schönbrodt

    The computation of package distances is a very nice idea! I agree, this analysis could be pushed further towards a recommendation engine.

    • http://www.r-statistics.com/ Tal Galili

      Thanks Felix,
      I hope to play with it, or see others playing with it some more…

  • Jan van der Laan

    You mention the peak at the 23rd of januari for digest. This has probably to do with a new version which was released the 21st. With the 21st being in the weekend and the time it takes for the package to reach the mirror, the peak are probably package updates. You also see a (smaller) peak around 16th of februari.

    So the downloads are also related to the release of new versions of the package and the release of new R-versions (e.g. the ‘anomaly’ around the 3rd of april).

    • http://www.r-statistics.com/ Tal Galili

      Hi Jan van der Laan,
      Good points – thanks for mentioning them :)

  • Paolo

    Comliments! Very interesting. I liked your graph can you give us the code?

    vpaolo@yahoo.com

  • sadffdfd eee

    Commas would be nice.

    • http://www.r-statistics.com/ Tal Galili

      Commas where?

  • Pingback: R 简介 | 教研

  • Pingback: There’s An R Package For That | Coran Corbett

  • Pingback: R Packages | The D.H. Relay