A speed test comparison of plyr, data.table, and dplyr

ssssssspeed_521872450_d085d1e928

Guest post by Jake Russ

For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr. However, the job took plyr roughly 13 hours to complete.

plyr is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.

I asked for some help on the manipulator Google group , and their feedback led me to data.table and dplyr, a new, and still in progress, package project by Hadley.

What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.

Continue reading “A speed test comparison of plyr, data.table, and dplyr”

Comparison of ave, ddply and data.table

A guest post by Paul Hiemstra. ———— Fortran and C programmers often say that interpreted languages like R are nice and all, but lack in terms of speed. How fast something works in R greatly depends on how it is implemented, i.e. which packages/functions does one use. A prime example, which shows up regularly on […]

A guest post by Paul Hiemstra.
————

Fortran and C programmers often say that interpreted languages like R are nice and all, but lack in terms of speed. How fast something works in R greatly depends on how it is implemented, i.e. which packages/functions does one use. A prime example, which shows up regularly on the R-help list, is letting a vector grow as you perform an analysis. In pseudo-code this might look like:

1
2
3
4
5
dum = NULL
for(i in 1:100000) {
   # new_outcome = ...do some stuff...
   dum = c(dum, new_outcome)
}

The problem here is that dum is continuously growing in size. This forces the operating system to allocate new memory space for the object, which is terribly slow. Preallocating dum to the length it is supposed to be greatly improves the performance. Alternatively, the use of apply type of functions, or functions from plyr package prevent these kinds of problems. But even between more advanced methods there are large differences between different implementations.

Take the next example. We create a dataset which has two columns, one column with values (e.g. amount of rainfall) and in the other a category (e.g. monitoring station id). We would like to know what the mean value is per category. One way is to use for loops, but I’ll skip that one for now. Three possibilities exist that I know of: ddply (plyr), ave (base R) and data.table. The piece of code at the end of this post compares these three methods. The outcome in terms of speed is:
(press the image to see a larger version)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
   datsize noClasses  tave tddply tdata.table
1    1e+05        10 0.091  0.035       0.011
2    1e+05        50 0.102  0.050       0.012
3    1e+05       100 0.105  0.065       0.012
4    1e+05       200 0.109  0.101       0.010
5    1e+05       500 0.113  0.248       0.012
6    1e+05      1000 0.123  0.438       0.012
7    1e+05      2500 0.146  0.956       0.013
8    1e+05     10000 0.251  3.525       0.020
9    1e+06        10 0.905  0.393       0.101
10   1e+06        50 1.003  0.473       0.100
11   1e+06       100 1.036  0.579       0.105
12   1e+06       200 1.052  0.826       0.106
13   1e+06       500 1.079  1.508       0.109
14   1e+06      1000 1.092  2.652       0.111
15   1e+06      2500 1.167  6.051       0.117
16   1e+06     10000 1.338 23.224       0.132

It is quite obvious that ddply performs very bad when the number of unique categories is large. The ave function performs better. However, the data.table option is by far the best one, outperforming both other alternatives easily. In response to this, Hadley Wickham (author of plyr) responded:

This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I’m still thinking about how to overcome this fundamental limitation of the ddply approach.

I hope this comparison is of use to readers. And remember, think before complaining that R is slow :) .

Paul (e-mail: [email protected])

ps This blogpost is based on discussions on the R-help and manipulatr mailing lists:
http://www.mail-archive.com/[email protected]/msg142797.html
http://groups.google.com/group/manipulatr/browse_thread/thread/5e8dfed85048df99

R code to perform the comparison

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
                      cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)
 
  t1 = system.time(res1 <- with(expdata, ave(value, cat)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
  t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
 
res
 
ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

Article about plyr published in JSS, and the citation was added to the new plyr (version 1.5)

The plyr package (by Hadley Wickham) is one of the few R packages for which I can claim to have used for all of my statistical projects. So whenever a new version of plyr comes out I tend to be excited about it (as was when version 1.2 came out with support for parallel processing)

So it is no surprise that the new release of plyr 1.5 got me curious. While going through the news file with the new features and bug fixes, I noticed how (quietly) Hadley has also released (6 days ago) another version of plyr prior to 1.5 which was numbered 1.4.1. That version included only one more function, but a very important one – a new citation reference for when using the plyr package. Here is how to use it:

install.packages("plyr") # so to upgrade to the latest release
citation("plyr")

The output gives both a simple text version as well as a BibTeX entry for LaTeX users. Here it is (notice the download link for yourself to read):

To cite plyr in publications use:
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data
Analysis. Journal of Statistical Software, 40(1), 1-29. URL
http://www.jstatsoft.org/v40/i01/.

I hope to see more R contributers and users will make use of the ?citation() function in the future.

Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

    Reminder what is the `plyr` package all about

    (as written in Hadley’s announcement)

    plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

    • fit the same model each patient subsets of a data frame
    • quickly calculate summary statistics for each group
    • perform group-wise transformations like scaling or standardising

    It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

    • totally consistent names, arguments and outputs
    • convenient parallelisation through the foreach package
    • input from and output to data.frames, matrices and lists
    • progress bars to keep track of long running operations
    • built-in error recovery, and informative error messages
    • labels that are maintained across all transformations

    Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

    You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf.  You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

    What’s new in `plyr` (1.2.1)

    The exiting news about the release of the new plyr version is the added support for parallel processing.

    l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
    foreach package.

    The new package also has some minor changes and bug fixes, all can be read here.

    In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux.  For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here).  But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

    All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:


    require(plyr) # make sure you have 1.2 or later installed
    x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0 0 2 require(doSMP) workers <- startWorkers(2) # My computer has 2 cores registerDoSMP(workers) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.09 0.00 1.11

    Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

    Trying to run it will result in the error massage:

    Loading required package: doSMP
    Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
      there is no package called ‘doSMP’
    

    Because trying to install the package will give the error massage:

    > install.packages("doSMP")
    Installing package(s) into ‘D:/R/library’
    (as ‘lib’ is unspecified)
    Warning message:
    package ‘doSMP’ is not available (for R version 2.15.0)
    

    You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

    if(!require(foreach)) install.packages("foreach")
    if(!require(doParallel)) install.packages("doParallel")
    # require(doSMP) # will no longer work...
    library(foreach)
    library(doParallel)
    workers <- makeCluster(2) # My computer has 2 cores
    registerDoParallel(workers)
    
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.3)
    system.time(llply(x, wait)) # 6 sec
    system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec
    

    New versions for ggplot2 (0.8.8) and plyr (1.0) were released today

    As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages.
    plyr image
    And today (through twitter) Hadley has updates the rest of us with the news:

    just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats

    Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23. And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements.
    ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.

    Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists). So Hadley, thank you!

    Here is the complete change-log list for both packages:
    Continue reading “New versions for ggplot2 (0.8.8) and plyr (1.0) were released today”