R-statistics blog

Top 100 R packages for 2013 (Jan-May)!

What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)!

By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. In this post I will share some nice plots and quick insights that can be made from this great data. The code for this analysis is given at the end of this post.

Top 8 most downloaded R packages – downloads over time

Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version):

We can see the strong weekly seasonality of the downloads,  with Saturday and Sunday having much fewer downloads than other days. This is not surprising since we know that the countries which uses R the most have these days as rest days (see James Cheshire’s world map of R users). It is also interesting to note how some packages had exceptional peaks on some dates. For example, I wonder what happened on January 23rd 2013 that the digest package suddenly got so many downloads, or that colorspace started getting more downloads from April 15th 2013.

“Family tree” of the top 100 most downloaded R packages

We can extract from this data the top 100 most downloaded R packages. Moreover, we can create a matrix showing for each package which of our unique ids (censored IP addresses), has downloaded which package. Using this indicator matrix, we can thing of the “similarity” (or distance) between each two packages, and based on that we can create a hierarchical clustering of the packages – showing which packages “goes along” with one another.

With this analysis, you can locate package on the list which you often use, and then see which other packages are “related” to that package.  If you don’t know that package – consider having a look at it – since other R users are clearly finding the two packages to be “of use”.

Such analysis can (and should!) be extended. For example, we can imagine creating a “suggest a package” feature based on this data, utilizing the package which you use, the OS that you use, and other parameters.  But such coding is beyond the scope of this post.

Here is the “family tree” (dendrogram) of related packages:

To make it easier to navigate, here is a table with links to the top 100 R packages, and their links:

Package Title Downloads
1 plyr Tools for splitting, applying and combining data 84049
2 digest Create cryptographic hash digests of R objects 83192
3 ggplot2 An implementation of the Grammar of Graphics 82768
4 colorspace Color Space Manipulation 81901
5 stringr Make it easier to work with strings 77658
6 RColorBrewer ColorBrewer palettes 66783
7 reshape2 Flexibly reshape data: a reboot of the reshape package 64911
8 zoo S3 Infrastructure for Regular and Irregular Time Series (Z’s
ordered observations)
60844
9 proto Prototype object-based programming 59043
10 scales Scale functions for graphics 58369
11 car Companion to Applied Regression 57453
12 dichromat Color Schemes for Dichromats 56624
13 gtable Arrange grobs in tables 54431
14 munsell Munsell colour system 53183
15 labeling Axis Labeling 51877
16 Hmisc Harrell Miscellaneous 47836
17 rJava Low-level R to Java interface 47731
18 mvtnorm Multivariate Normal and t Distributions 46884
19 bitops Bitwise Operations 45689
20 rgl 3D visualization device system (OpenGL) 41001
21 foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase,
..
37849
22 XML Tools for parsing and generating XML within R and S-Plus 37153
23 lattice Lattice Graphics 36597
24 e1071 Misc Functions of the Department of Statistics (e1071), TU Wien 35180
25 gtools Various R programming tools 35028
26 sp classes and methods for spatial data 34786
27 gdata Various R programming tools for data manipulation 34262
28 Rcpp Seamless R and C++ Integration 33929
29 MASS Support Functions and Datasets for Venables and Ripley’s MASS 33667
30 Matrix Sparse and Dense Matrix Classes and Methods 30740
31 lmtest Testing Linear Regression Models 30319
32 survival Survival Analysis 30186
33 caTools Tools: moving window statistics, GIF, Base64, ROC AUC, etc 29945
34 multcomp Simultaneous Inference in General Parametric Models 29871
35 RCurl General network (HTTP/FTP/…) client interface for R 28866
36 knitr A general-purpose package for dynamic report generation in R 28104
37 xtable Export tables to LaTeX or HTML 28091
38 xts eXtensible Time Series 28058
39 rpart Recursive Partitioning 27812
40 evaluate Parsing and evaluation tools that provide more details than the
default
27617
41 RODBC ODBC Database Access 26131
42 quadprog Functions to solve Quadratic Programming Problems 25433
43 tseries Time series analysis and computational finance 25144
44 DBI R Database Interface 24793
45 nlme Linear and Nonlinear Mixed Effects Models 24360
46 lme4 Linear mixed-effects models using S4 classes 24199
47 reshape Flexibly reshape data 24118
48 sandwich Robust Covariance Matrix Estimators 24016
49 leaps regression subset selection 23666
50 gplots Various R programming tools for plotting data 23251
51 abind Combine multi-dimensional arrays 22758
52 randomForest Breiman and Cutler’s random forests for classification and
regression
22401
53 Rcmdr R Commander 22131
54 coda Output analysis and diagnostics for MCMC 21900
55 maps Draw Geographical Maps 21550
56 igraph Network analysis and visualization 21423
57 formatR Format R Code Automatically 21049
58 maptools Tools for reading and handling spatial objects 20957
59 RSQLite SQLite interface for R 19671
60 psych Procedures for Psychological, Psychometric, and Personality
Research
19545
61 KernSmooth Functions for kernel smoothing for Wand &amp Jones (1995) 19166
62 rgdal Bindings for the Geospatial Data Abstraction Library 19064
63 RcppArmadillo Rcpp integration for Armadillo templated linear algebra library 18899
64 effects Effect Displays for Linear, Generalized Linear,
Multinomial-Logit, Proportional-Odds Logit Models and
Mixed-Effects Models
18843
65 sem Structural Equation Models 18711
66 vcd Visualizing Categorical Data 18589
67 XLConnect Excel Connector for R 18230
68 markdown Markdown rendering for R 18211
69 timeSeries Rmetrics – Financial Time Series Objects 17932
70 timeDate Rmetrics – Chronological and Calendar Objects 17838
71 RJSONIO Serialize R objects to JSON, JavaScript Object Notation 17801
72 cluster Cluster Analysis Extended Rousseeuw et al 17136
73 scatterplot3d 3D Scatter Plot 17110
74 nnet Feed-forward Neural Networks and Multinomial Log-Linear Models 17074
75 fBasics Rmetrics – Markets and Basic Statistics 16278
76 forecast Forecasting functions for time series and linear models 15638
77 quantreg Quantile Regression 15509
78 foreach Foreach looping construct for R 15405
79 chron Chronological objects which can handle dates and times 15226
80 plotrix Various plotting functions 15142
81 matrixcalc Collection of functions for matrix calculations 15107
82 aplpack Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, and
some slider functions
14654
83 strucchange Testing, Monitoring, and Dating Structural Changes 14503
84 iterators Iterator construct for R 14449
85 mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness
estimation
14186
86 kernlab Kernel-based Machine Learning Lab 14135
87 SparseM Sparse Linear Algebra 13921
88 tree Classification and regression trees 13871
89 robustbase Basic Robust Statistics 13778
90 vegan Community Ecology Package 13686
91 devtools Tools to make developing R code easier 13488
92 latticeExtra Extra Graphical Utilities Based on Lattice 13253
93 modeltools Tools and Classes for Statistical Models 13233
94 xlsx Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files 13097
95 slam Sparse Lightweight Arrays and Matrices 13060
96 TTR Technical Trading Rules 12894
97 quantmod Quantitative Financial Modelling Framework 12892
98 relimp Relative Contribution of Effects in a Regression Model 12692
99 akima Interpolation of irregularly spaced data 12680
100 memoise Memoise functions 12600

R code

I hope you found this post useful, and will find new ways of using this interesting dataset. Note that there are issues with how much these numbers represent the “truth”, but for now, they are the most interesting estimate of it that I know of.

 

# get the latest installr package:
if (!require('devtools')) install.packages('devtools'); require('devtools')
install_github('installr', 'talgalili')
require(installr)

# read the data (this will take a LOOOONG time)
RStudio_CRAN_data_folder 0)
mode(package_ip_id) <- "numeric"
dend_package_ip_id

p.s: This post is a follow up of me discovering, two days ago how many people use my R package.

Exit mobile version