A speed test comparison of plyr, data.table, and dplyr

ssssssspeed_521872450_d085d1e928

Guest post by Jake Russ

For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr. However, the job took plyr roughly 13 hours to complete.

plyr is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.

I asked for some help on the manipulator Google group , and their feedback led me to data.table and dplyr, a new, and still in progress, package project by Hadley.

What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.


Let’s start by making a data frame which fits my description above, but make it reproducible:

set.seed(42)
 
types <- c("A", "B", "C", "D", "E", "F")
 
obs <- 4e+07
 
one <- data.frame(id = as.factor(seq(from = 1, to = 80000, by = 1)), percent = round(runif(obs, 
    min = 0, max = 1), digits = 2), type = as.factor(sample(types, obs, replace = TRUE)))
 
print(object.size(one), units = "GB")
## 0.6 Gb
summary(one)
##        id              percent     type       
##  1      :     500   Min.   :0.00   A:6672132  
##  2      :     500   1st Qu.:0.25   B:6663570  
##  3      :     500   Median :0.50   C:6668009  
##  4      :     500   Mean   :0.50   D:6668684  
##  5      :     500   3rd Qu.:0.75   E:6660437  
##  6      :     500   Max.   :1.00   F:6667168  
##  (Other):39997000

I’ll start the testing with plyr, using ddply, but I’ll also show the difference between subsetting a data frame from within a ddply call and doing the subset first from outside the call. Then I offer a third way to use plyr‘s count function to achieve the same result.

library(plyr)
 
## Test 1 (plyr): Use ddply and subset one with [ ] style indexing from
## within the ddply call.
 
typeSubset <- c("A", "C", "E")
 
system.time(test1 <- ddply(one[one$type %in% typeSubset, ], .(id), summarise, 
    percent_total = sum(percent)))
##    user  system elapsed 
##  104.51   21.23  125.81
## Test 2 (plyr):, Use ddply but subset one outside of the ddply call
 
two <- subset(one, type %in% typeSubset)
 
system.time(test2 <- ddply(two, .(id), summarise, percent_total = sum(percent)))
##    user  system elapsed 
##  101.20   46.14  147.64
## Test 3 (plyr): For a simple sum, an alternative is to use plyr's count
## function
 
system.time(test3 <- count(two, "id", "percent"))
##    user  system elapsed 
##    5.90    0.22    6.12

Doing the subset outside of the ddply call did speed things up, but not as much I as orinially thought it would. For my particular project, doing the subset outside of the ddply call reduced the run time to 12 hours. So largely this is still a “wrong tool” problem, rather than a “when to subset” problem.

Next, I’ll try data.table and for this test and the dplyr one below I’ll operate on the data frame which has been pre-subset:

library(data.table)
 
## Test 4 (data.table): Speed test for package data.table
 
## Define the data table
three <- data.table(two, key = c("id"))
 
tables()  # check that the key columns are correct
##      NAME        NROW  MB COLS            KEY
## [1,] three 20,000,578 310 id,percent,type id 
## Total: 310MB
## Operate on it
system.time(test4 <- three[, list(percent_total = sum(percent)), by = key(three)])
##    user  system elapsed 
##    0.17    0.01    0.19

dplyr is not currently available on CRAN but you can install it from github with:

devtools::install_github("assertthat")
devtools::install_github("dplyr")
library(dplyr)
 
## Test 5 (dplyr): Speed test for package dplyr
 
fourDf <- group_by(two, id)
 
system.time(test5 <- summarise(fourDf, percent_total = sum(percent)))
##    user  system elapsed 
##    1.49    0.03    1.52
 
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.01       Rcpp_0.10.4      data.table_1.8.8 plyr_1.8        
## [5] knitr_1.4.1     
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.3   evaluate_0.4.7 formatR_0.9   
## [5] stringr_0.6.2  tools_3.0.1

Both data.table and dplyr were able to reduce the problem to less than a few seconds. If you’re looking for pure speed data.table is the clear winner. However, it is my understanding that data.table‘s syntax can be frustrating, so if you’re already used to the ‘Hadley ecosystem’ of packages, dplyr is a formitable alternative, even if it is still in the early stages.

Have a nice day!

(Editor’s edition: image credit link)

  • hadley

    You can also do

    fiveDf <- group_by(as.data.table(two), id)
    system.time(test6 <- summarise(fiveDf, percent_total = sum(percent)))

    to use dplyr commands with a data table backend.

  • Matthew Dowle

    Can you explain in more detail when data.table syntax is frustrating? If what you mean is it has to be learnt, then sure. But what’s frustrating other than that?

    • hadley

      I think the challenge is that you overload operators to mean different things inside and outside of [. This means that this:

      dt[f(x)]

      is not equivalent to

      y <- f(x)
      df[y]

      for different f. (e.g. !, unique)

      That makes it difficult to perform a complex operation a step at a time – you either achieve your goal totally or fail. SQL has a similar problem.

      As per my tweet, dplyr has at over 11 functions that replicate part of [.data.table. This makes dplyr considerably more verbose, but each function corresponds to a simple verb, so you can string together complicated operations through a combination of simple and explicit primitives, checking your results as you go. This doesn't necessarily come at a performance cost, since you could do everything lazily and only force computation when the user needs results.

      To be clear, I don't think this makes dplyr "better" than data.table, we're just optimising different features.

      • Matt Dowle

        You’ve picked on one small corner of data.table here. I encourage readers of this article and these comments to take a look at data.table and decide for themselves. Perhaps start with the 19 independent reviews here :
        http://crantastic.org/packages/data-table

    • Jake Russ

      Matthew,

      I thank Hadley for taking the time to give a specific example, but as the author of the post, let me say that my statement was not referencing any particular problem with data.table. It was an offhanded summary of the feedback I received when I canvased many folks for help with plyr. I admit I chose a poor way to phrase that, as it was not intended as a slight of data.table. I see now that it reads that way.

      I’m actually indebted to you, since the use of data.table reduced my particular issue, which spawned this post, to under 30 seconds, when my previous approach took 13 excruciating hours. For that, I thank you.

      • Matt Dowle

        Thanks for your comments, much appreciated. Trouble is these off hand comments spread quickly on the internet. Trey Causey tweeted this article as “data.table wins, but syntax is suboptimal”, for example. For the record: the syntax of data.table is actually very simple and flexible. The essence of it is `DT[i, j, by]`. And you read it out loud as “from DT subset by ‘i’, then do ‘j’ grouped by ‘by'”.

    • Dan Keshet

      One nit for me has been that data.table syntax (especially :=) is not well-understood by StatET in eclipse, marking every instance as an error.

      • Matt Dowle

        Didn’t know that, thanks for highlighting. I’ve now asked Stephan if it’s at all possible to add := to the StatET definition.

  • Pingback: » A speed test comparison of plyr, data.table, and dplyrMarkR()

  • Shannon Callan

    I just started doing a few tasks in data.table rather than plyr. It’s been working well and this article encourages me to explore data.table further. Thanks for sharing your work!

  • Sean Bergstedt

    I’ve experimented with data.table. As a novice user of the R language, I’m looking to find those packages worth investing time in learning as part of a collection of most recent R best practice modules. I’m new to the R analysis process and would prefer not to invest time (at least up front) in learning legacy functions especially if there are more efficient and intuitive alternatives. I want to cut to the chase with packages that ease the steepness of my learning curve, and I definitely see data.table as a candidate for serving as my primary tool kit for querying/reshaping/transforming data.

    However, with that said, some of the challenges of using the package were evident (1) in the way I need to manage versions of DTs, and (2) in formulating a generalized, standardized approach to slicing and dicing data into a desired format.

    On the first point, say I need to maintain different states of subset/queried data in separate tables. I struggle to see how I can fork data states if I’m constantly forced to accept changes to my original table outside of copying the data table before performing manipulation. Perhaps, there is a way to do this that I missed in the documentation?

    For the second point, why are there multiple ways to specify i, j, and k parameters? It’s confusing keeping track of how to use the API appropriately because it depends on my prior knowledge of much of R’s legacy functionality. As a new user, I see this and begin wondering what other simple, more patterned packages are available to achieve the similar outcomes. The way I see it, the efficiency gains in processing can never be realized if I can’t figure out if I’m using the right “door”. Give me a single entry. Perhaps this is what other writers are referring to when they speak of “syntax frustration”.

    Outside of these area, I’ve been very pleased with the package as it does help me achieve the desired transformation. With some minor tweaks, I could definitely see this package being a dominant utility in generalized approaches to R statistical analysis.

    • Matt Dowle

      This is a bit wordy to be able to respond to. I’ll try but I’ll have to guess…
      On (1) I suspect you think that setkey() is necessary before grouping? It isn’t. Grouping doesn’t depend on anything prior or any state. You have a point if you want to binary search to different columns than the primary key, and we intend to add secondary keys (set2key) to address that. Unless you update by reference using :=, all queries return a new data.table, so I’m not sure what you mean.
      On (2) honestly I can’t guess what you mean here. What different ways of specifying i, j and k are there? If you know SQL then i=WHERE, j=SELECT, k=GROUP BY. That doesn’t vary. Have you run example(data.table) and worked through the examples at the prompt? That’s the quickest way to pick it up (by example) as suggested at the top of ?data.table.

  • Jonas

    I reran the code in this article and now dplyr is two times faster than data.table. Did the dplyr package change substantially since September?

    • http://www.r-statistics.com/ Tal Galili

      Hello Jonas,

      There has been a recent official announcement of the package:
      http://blog.rstudio.org/2014/01/17/introducing-dplyr/

      And it seems there has been a bunch of commits in the past several months:
      https://github.com/hadley/dplyr/graphs/commit-activity

      So the answer to your question is probably: yes.

    • Arun Srinivasan

      Hi Jonas, that’s not quite the case. The issue is that `data.table` does not require `setkey` to `aggregate`. However `dplyr` requires `group_by` before to `summarise`. So, it’s not quite fair to measure *just* the time to `summarise`. We’ve pointed out these things quite nicely here, in case you’re interested: http://arunsrinivasan.github.io/dplyr_benchmark/

      Here is how it should be timed…

      **Note 1:** I also add the time to create the data.table, which will normally not be the case because you can directly use “fread”.

      **Note 2:** I have compared data.table version 1.8.10 (CRAN) to the development version of dplyr (on github) here.

      # No key at all:
      system.time({
      three <- as.data.table(two)
      test4 <- three[, list(percent_total = sum(percent)), by = "id"]
      })

      This takes app. 6.333 seconds on my laptop.

      library(dplyr)
      system.time({
      fourDf <- group_by(two, id)
      test5 <- summarise(fourDf, percent_total = sum(percent))
      })

      This takes 8.71 seconds.

  • Guest

    First of all I need to thank both the developers of dplyr and data.table. I use them both.

    data.table: for simple things where speed is more important than transparency
    dplyr: for the grammar when transparency is more important than speed
    (mostly when multiple operations are used)

    I wonder if something changed as I can’t replicate the given example anymore

    > ## Test 5 (dplyr): Speed test for package dplyr
    >
    > system.time(fourDf
    > system.time(test5
    > str(test5)
    ‘data.frame': 1 obs. of 1 variable:
    $ percent_total: num 1e+07

    this should have been a data.frame with 80.000 rows like test(4)

    > str(test4)
    Classes ‘data.table’ and ‘data.frame': 80000 obs. of 2 variables:
    $ id : Factor w/ 80000 levels “1”,”2″,”3″,”4″,..: 1 2 3 4 5 6 7 8 9 10 …
    $ percent_total: num 138 116 124 118 119 …
    – attr(*, “sorted”)= chr “id”
    – attr(*, “.internal.selfref”)=

    > sessionInfo()

    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)
    locale:
    [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
    [9] LC_ADDRESS=C LC_TELEPHONE=C
    [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C

    attached base packages:
    [1] stats graphics grDevices utils datasets methods
    [7] base
    other attached packages:
    [1] data.table_1.8.10 plyr_1.8 dplyr_0.1.1

    loaded via a namespace (and not attached):
    [1] assertthat_0.1 Rcpp_0.10.6 tools_3.0.2

  • Floris Padt

    Just did some testing and actually am amazed by the performance on my household system for both data.table and dplyr. Performance Is almost similar and I am very gratefull to both Hadley and Matthew. I hope that someday multicores can be used,

    The test is executed in sequence for data.table and dplyr so both get equal CPU. Although multiple CPU’s were available only 1 is used. The system contains 64GB and is running ubuntu 13.10. Max memory used was 37GB.

    install.packages(“data.table”)
    install.packages(“dplyr”)
    install.packages(“microbenchmark”)

    require(data.table)
    require(dplyr)
    require(microbenchmark)

    set.seed(42)

    types <- c("A", "B", "C", "D", "E", "F")
    obs <- 4e+08

    one <- data.frame(id = as.factor(seq(from = 1, to = 8e+05, by = 1)),
    percent = round(runif(obs, min = 0, max = 1), digits = 2),
    type = as.factor(sample(types, obs, replace = TRUE)))

    typeSubset NROW(one)
    [1] 400.000.000

    test_dt <- function(){
    two <- subset(one, type %in% typeSubset)
    three <- data.table(two, key = c("id"))
    test4 <- three[, list(percent_total = sum(percent)),
    by = key(three)]
    }

    test_dp <- function(){
    two <- subset(one, type %in% typeSubset)
    four <- group_by(two, id)
    test5 <- four %.%
    summarise(percent_total = sum(percent))
    }

    m_dt <- microbenchmark(testdt = test_dt(), times = 2)
    m_dp m_dt
    Unit: seconds
    expr min lq median uq max neval
    testdt 129.5227 129.5227 130.3159 131.109 131.109 2

    > m_dp
    Unit: seconds
    expr min lq median uq max neval
    testdp 133.1595 133.1595 134.7981 136.4368 136.4368 2

  • Pingback: More on handling data frames in R: dplyr package | FreshBiostats()

  • Pingback: data.table vs. dplyr | brodieG()