A speed test comparison of plyr, data.table, and dplyr

ssssssspeed_521872450_d085d1e928

Guest post by Jake Russ

For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr. However, the job took plyr roughly 13 hours to complete.

plyr is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.

I asked for some help on the manipulator Google group , and their feedback led me to data.table and dplyr, a new, and still in progress, package project by Hadley.

What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.


Let’s start by making a data frame which fits my description above, but make it reproducible:

set.seed(42)
 
types <- c("A", "B", "C", "D", "E", "F")
 
obs <- 4e+07
 
one <- data.frame(id = as.factor(seq(from = 1, to = 80000, by = 1)), percent = round(runif(obs,
    min = 0, max = 1), digits = 2), type = as.factor(sample(types, obs, replace = TRUE)))
 
print(object.size(one), units = "GB")
## 0.6 Gb
summary(one)
##        id              percent     type
##  1      :     500   Min.   :0.00   A:6672132
##  2      :     500   1st Qu.:0.25   B:6663570
##  3      :     500   Median :0.50   C:6668009
##  4      :     500   Mean   :0.50   D:6668684
##  5      :     500   3rd Qu.:0.75   E:6660437
##  6      :     500   Max.   :1.00   F:6667168
##  (Other):39997000

I’ll start the testing with plyr, using ddply, but I’ll also show the difference between subsetting a data frame from within a ddply call and doing the subset first from outside the call. Then I offer a third way to use plyr‘s count function to achieve the same result.

library(plyr)
 
## Test 1 (plyr): Use ddply and subset one with [ ] style indexing from
## within the ddply call.
 
typeSubset <- c("A", "C", "E")
 
system.time(test1 <- ddply(one[one$type %in% typeSubset, ], .(id), summarise,
    percent_total = sum(percent)))
##    user  system elapsed
##  104.51   21.23  125.81
## Test 2 (plyr):, Use ddply but subset one outside of the ddply call
 
two <- subset(one, type %in% typeSubset)
 
system.time(test2 <- ddply(two, .(id), summarise, percent_total = sum(percent)))
##    user  system elapsed
##  101.20   46.14  147.64
## Test 3 (plyr): For a simple sum, an alternative is to use plyr's count
## function
 
system.time(test3 <- count(two, "id", "percent"))
##    user  system elapsed
##    5.90    0.22    6.12

Doing the subset outside of the ddply call did speed things up, but not as much I as orinially thought it would. For my particular project, doing the subset outside of the ddply call reduced the run time to 12 hours. So largely this is still a “wrong tool” problem, rather than a “when to subset” problem.

Next, I’ll try data.table and for this test and the dplyr one below I’ll operate on the data frame which has been pre-subset:

library(data.table)
 
## Test 4 (data.table): Speed test for package data.table
 
## Define the data table
three <- data.table(two, key = c("id"))
 
tables()  # check that the key columns are correct
##      NAME        NROW  MB COLS            KEY
## [1,] three 20,000,578 310 id,percent,type id
## Total: 310MB
## Operate on it
system.time(test4 <- three[, list(percent_total = sum(percent)), by = key(three)])
##    user  system elapsed
##    0.17    0.01    0.19

dplyr is not currently available on CRAN but you can install it from github with:

devtools::install_github("assertthat")
devtools::install_github("dplyr")
library(dplyr)
 
## Test 5 (dplyr): Speed test for package dplyr
 
fourDf <- group_by(two, id)
 
system.time(test5 <- summarise(fourDf, percent_total = sum(percent)))
##    user  system elapsed
##    1.49    0.03    1.52
 
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] dplyr_0.01       Rcpp_0.10.4      data.table_1.8.8 plyr_1.8
## [5] knitr_1.4.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.3   evaluate_0.4.7 formatR_0.9
## [5] stringr_0.6.2  tools_3.0.1

Both data.table and dplyr were able to reduce the problem to less than a few seconds. If you’re looking for pure speed data.table is the clear winner. However, it is my understanding that data.table‘s syntax can be frustrating, so if you’re already used to the ‘Hadley ecosystem’ of packages, dplyr is a formitable alternative, even if it is still in the early stages.

Have a nice day!

(Editor’s edition: image credit link)

27 thoughts on “A speed test comparison of plyr, data.table, and dplyr”

  1. You can also do

    fiveDf <- group_by(as.data.table(two), id)
    system.time(test6 <- summarise(fiveDf, percent_total = sum(percent)))

    to use dplyr commands with a data table backend.

  2. Can you explain in more detail when data.table syntax is frustrating? If what you mean is it has to be learnt, then sure. But what’s frustrating other than that?

    1. I think the challenge is that you overload operators to mean different things inside and outside of [. This means that this:

      dt[f(x)]

      is not equivalent to

      y <- f(x)
      df[y]

      for different f. (e.g. !, unique)

      That makes it difficult to perform a complex operation a step at a time – you either achieve your goal totally or fail. SQL has a similar problem.

      As per my tweet, dplyr has at over 11 functions that replicate part of [.data.table. This makes dplyr considerably more verbose, but each function corresponds to a simple verb, so you can string together complicated operations through a combination of simple and explicit primitives, checking your results as you go. This doesn't necessarily come at a performance cost, since you could do everything lazily and only force computation when the user needs results.

      To be clear, I don't think this makes dplyr "better" than data.table, we're just optimising different features.

    2. Matthew,

      I thank Hadley for taking the time to give a specific example, but as the author of the post, let me say that my statement was not referencing any particular problem with data.table. It was an offhanded summary of the feedback I received when I canvased many folks for help with plyr. I admit I chose a poor way to phrase that, as it was not intended as a slight of data.table. I see now that it reads that way.

      I’m actually indebted to you, since the use of data.table reduced my particular issue, which spawned this post, to under 30 seconds, when my previous approach took 13 excruciating hours. For that, I thank you.

      1. Thanks for your comments, much appreciated. Trouble is these off hand comments spread quickly on the internet. Trey Causey tweeted this article as “data.table wins, but syntax is suboptimal”, for example. For the record: the syntax of data.table is actually very simple and flexible. The essence of it is `DT[i, j, by]`. And you read it out loud as “from DT subset by ‘i’, then do ‘j’ grouped by ‘by'”.

    3. One nit for me has been that data.table syntax (especially :=) is not well-understood by StatET in eclipse, marking every instance as an error.

      1. Didn’t know that, thanks for highlighting. I’ve now asked Stephan if it’s at all possible to add := to the StatET definition.

  3. I just started doing a few tasks in data.table rather than plyr. It’s been working well and this article encourages me to explore data.table further. Thanks for sharing your work!

  4. I’ve experimented with data.table. As a novice user of the R language, I’m looking to find those packages worth investing time in learning as part of a collection of most recent R best practice modules. I’m new to the R analysis process and would prefer not to invest time (at least up front) in learning legacy functions especially if there are more efficient and intuitive alternatives. I want to cut to the chase with packages that ease the steepness of my learning curve, and I definitely see data.table as a candidate for serving as my primary tool kit for querying/reshaping/transforming data.

    However, with that said, some of the challenges of using the package were evident (1) in the way I need to manage versions of DTs, and (2) in formulating a generalized, standardized approach to slicing and dicing data into a desired format.

    On the first point, say I need to maintain different states of subset/queried data in separate tables. I struggle to see how I can fork data states if I’m constantly forced to accept changes to my original table outside of copying the data table before performing manipulation. Perhaps, there is a way to do this that I missed in the documentation?

    For the second point, why are there multiple ways to specify i, j, and k parameters? It’s confusing keeping track of how to use the API appropriately because it depends on my prior knowledge of much of R’s legacy functionality. As a new user, I see this and begin wondering what other simple, more patterned packages are available to achieve the similar outcomes. The way I see it, the efficiency gains in processing can never be realized if I can’t figure out if I’m using the right “door”. Give me a single entry. Perhaps this is what other writers are referring to when they speak of “syntax frustration”.

    Outside of these area, I’ve been very pleased with the package as it does help me achieve the desired transformation. With some minor tweaks, I could definitely see this package being a dominant utility in generalized approaches to R statistical analysis.

    1. This is a bit wordy to be able to respond to. I’ll try but I’ll have to guess…
      On (1) I suspect you think that setkey() is necessary before grouping? It isn’t. Grouping doesn’t depend on anything prior or any state. You have a point if you want to binary search to different columns than the primary key, and we intend to add secondary keys (set2key) to address that. Unless you update by reference using :=, all queries return a new data.table, so I’m not sure what you mean.
      On (2) honestly I can’t guess what you mean here. What different ways of specifying i, j and k are there? If you know SQL then i=WHERE, j=SELECT, k=GROUP BY. That doesn’t vary. Have you run example(data.table) and worked through the examples at the prompt? That’s the quickest way to pick it up (by example) as suggested at the top of ?data.table.

  5. I reran the code in this article and now dplyr is two times faster than data.table. Did the dplyr package change substantially since September?

    1. Hi Jonas, that’s not quite the case. The issue is that `data.table` does not require `setkey` to `aggregate`. However `dplyr` requires `group_by` before to `summarise`. So, it’s not quite fair to measure *just* the time to `summarise`. We’ve pointed out these things quite nicely here, in case you’re interested: http://arunsrinivasan.github.io/dplyr_benchmark/

      Here is how it should be timed…

      **Note 1:** I also add the time to create the data.table, which will normally not be the case because you can directly use “fread”.

      **Note 2:** I have compared data.table version 1.8.10 (CRAN) to the development version of dplyr (on github) here.

      # No key at all:
      system.time({
      three <- as.data.table(two)
      test4 <- three[, list(percent_total = sum(percent)), by = "id"]
      })

      This takes app. 6.333 seconds on my laptop.

      library(dplyr)
      system.time({
      fourDf <- group_by(two, id)
      test5 <- summarise(fourDf, percent_total = sum(percent))
      })

      This takes 8.71 seconds.

    2. Hi Jonas, that’s not quite the case. The issue is that `data.table` does not require `setkey` to `aggregate`. However `dplyr` requires `group_by` before to `summarise`. So, it’s not quite fair to measure *just* the time to `summarise`. We’ve pointed out these things quite nicely here, in case you’re interested: http://arunsrinivasan.github.io/dplyr_benchmark/

      Here is how it should be timed…

      **Note 1:** I also add the time to create the data.table, which will normally not be the case because you can directly use “fread”.

      **Note 2:** I have compared data.table version 1.8.10 (CRAN) to the development version of dplyr (on github) here.

      # No key at all:
      system.time({
      three <- as.data.table(two)
      test4 <- three[, list(percent_total = sum(percent)), by = "id"]
      })

      This takes app. 6.333 seconds on my laptop.

      library(dplyr)
      system.time({
      fourDf <- group_by(two, id)
      test5 <- summarise(fourDf, percent_total = sum(percent))
      })

      This takes 8.71 seconds.

  6. First of all I need to thank both the developers of dplyr and data.table. I use them both.

    data.table: for simple things where speed is more important than transparency
    dplyr: for the grammar when transparency is more important than speed
    (mostly when multiple operations are used)

    I wonder if something changed as I can’t replicate the given example anymore

    > ## Test 5 (dplyr): Speed test for package dplyr
    >
    > system.time(fourDf
    > system.time(test5
    > str(test5)
    ‘data.frame': 1 obs. of 1 variable:
    $ percent_total: num 1e+07

    this should have been a data.frame with 80.000 rows like test(4)

    > str(test4)
    Classes ‘data.table’ and ‘data.frame': 80000 obs. of 2 variables:
    $ id : Factor w/ 80000 levels “1”,”2″,”3″,”4″,..: 1 2 3 4 5 6 7 8 9 10 …
    $ percent_total: num 138 116 124 118 119 …
    – attr(*, “sorted”)= chr “id”
    – attr(*, “.internal.selfref”)=

    > sessionInfo()

    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)
    locale:
    [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
    [9] LC_ADDRESS=C LC_TELEPHONE=C
    [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C

    attached base packages:
    [1] stats graphics grDevices utils datasets methods
    [7] base
    other attached packages:
    [1] data.table_1.8.10 plyr_1.8 dplyr_0.1.1

    loaded via a namespace (and not attached):
    [1] assertthat_0.1 Rcpp_0.10.6 tools_3.0.2

  7. Just did some testing and actually am amazed by the performance on my household system for both data.table and dplyr. Performance Is almost similar and I am very gratefull to both Hadley and Matthew. I hope that someday multicores can be used,

    The test is executed in sequence for data.table and dplyr so both get equal CPU. Although multiple CPU’s were available only 1 is used. The system contains 64GB and is running ubuntu 13.10. Max memory used was 37GB.

    install.packages(“data.table”)
    install.packages(“dplyr”)
    install.packages(“microbenchmark”)

    require(data.table)
    require(dplyr)
    require(microbenchmark)

    set.seed(42)

    types <- c("A", "B", "C", "D", "E", "F")
    obs <- 4e+08

    one <- data.frame(id = as.factor(seq(from = 1, to = 8e+05, by = 1)),
    percent = round(runif(obs, min = 0, max = 1), digits = 2),
    type = as.factor(sample(types, obs, replace = TRUE)))

    typeSubset NROW(one)
    [1] 400.000.000

    test_dt <- function(){
    two <- subset(one, type %in% typeSubset)
    three <- data.table(two, key = c("id"))
    test4 <- three[, list(percent_total = sum(percent)),
    by = key(three)]
    }

    test_dp <- function(){
    two <- subset(one, type %in% typeSubset)
    four <- group_by(two, id)
    test5 <- four %.%
    summarise(percent_total = sum(percent))
    }

    m_dt <- microbenchmark(testdt = test_dt(), times = 2)
    m_dp m_dt
    Unit: seconds
    expr min lq median uq max neval
    testdt 129.5227 129.5227 130.3159 131.109 131.109 2

    > m_dp
    Unit: seconds
    expr min lq median uq max neval
    testdp 133.1595 133.1595 134.7981 136.4368 136.4368 2

    1. Hi Floris, Here are the benchmarks run on dplyr (commit 1527 from github) and data.table (commit 1306 from github):

      Summary: data.table takes 26 seconds and dplyr 170 seconds.

      I moved “two” outside in both functions as it just hinders measuring “just” the aggregation time accurately. Also, I’ve avoided the overhead of creating: 1) keys – not necessary for grouping at all and 2) creation of data.table – we can use `setDT` instead of data.table() as the latter makes an unnecessary copy. This allows to measure *just* the time to aggregate pretty accurately. And here’s the code to replicate the results:

      require(data.table)
      require(dplyr)
      require(microbenchmark)

      set.seed(42)
      types <- c("A", "B", "C", "D", "E", "F")
      obs <- 4e+08

      one <- data.table(id = as.factor(seq(from = 1, to = 8e+05, by = 1)),
      percent = round(runif(obs, min = 0, max = 1), digits = 2),
      type = as.factor(sample(types, obs, replace = TRUE)))
      setDF(one)
      typeSubset <- c("A", "C", "E")

      two <- subset(one, type %in% typeSubset)
      test_dt <- function() {
      test4 <- setDT(two)[, list(percent_total = sum(percent)), by = id]
      setDF(two)
      test4
      }

      test_dp <- function(){
      four <- group_by(two, id) %.% summarise(percent_total = sum(percent))
      }

      microbenchmark(testdt = test_dt(), testdp = test_dp(), times = 3)
      Unit: seconds
      expr min lq median uq max neval
      testdt 22.69444 24.43425 26.17406 26.70787 27.24168 3
      testdp 146.19651 158.67427 171.15202 172.56525 173.97848 3

  8. I believe this post should be updated. dpyr has improved substantially since its release, and it is still improving. On this particular test, it is faster than data table. On my machine it is 30% faster.

    system.time(test4 fourDf system.time(test5 three system.time(test4 fourDf system.time(test5 system.time({three <- data.table(one, key = c("id"));test4 system.time(test5 % summarise(percent_total = sum(percent)))
    user system elapsed
    2.94 0.07 3.03

    As of today (dplyr_0.2.0.99, data.table_1.9.2), dplyr slightly lags behind data.table on this measure, (because of more expensive indexing) but outperforms it otherwise. I practice, the performance differential is not large, and the choice is dictated more by syntactic convenience. I recommend data.table for its fread(), setnames() and a few other unique functions, and for relatively small transformations which will be readable and concise. For more complex code, dplyr is more maintainable. Moreover, it’s possible to use data tables in dplyr, and data.table functions play well with magrittr, so there’s no conflict here.

    1. Hi gappy3000, `data.table` has improved tremendously as well since this post. Like I said in my answer above, the timings should include `group_by` as well. `data.table’s` syntax includes the grouping and there’s no reason not to (which I see you’ve done). On my system that takes 2.6 (data.table) vs 7.6 seconds (dplyr) if I include `data.table(.)` creation. Without it `data.table` takes 1.2 seconds.

      Note that with `setDT()`, `fread()` and `setDF()` (new in v1.9.3), it is not necessary to copy a data.frame to a data.table – that’s not timing just “aggregate” anyway.

      The points on readability, maintainability are all quite subjective and much more opinionated and therefore quite hard to argue on. DT[, list(.), by=.] is not hard to maintain, really! It resembles R even more than %>% I’d argue (which you’d deny :)).

      On features, Hadley has recently tweeted here: https://twitter.com/hadleywickham/status/482706640110886912 that the support for data.tables isn’t that great. It also makes quite some copies (to adhere to dplyr philosophy).

      Many features of data.table are not possible through dplyr interface (yet?) – rolling joins, Selecting only necessary columns while join (memory-efficiency) or directly manipulating columns while join, joins with “nomatch=0L”, joins with “mult=first/last/all”, by-without-by, sub-assignment by reference, preserving order while summarising, join in `j` for each group

      1. Thanks. In general I agree. I think that an accurate benchmarking of the two packages would require many more tests. For the time being, my anecdotal experience is: i) data.table is from 1x to 3x faster than both dplyr and pandas. ii) fread is irreplaceable and one of the fastest flat file readers I know in any languages; iii) I find dplyr code the most readable, as it integrates nicely with the chaining operator, and my manipulations usually involve more than data.table syntax allows; sometimes I mix and match (when execution is a bottleneck), but I like that dplyr separates the data format layer from the manipulation one. iv) I agree, simple-to-average complexity transformations are simple in data.table.

        1. Thanks. I’ll reply in order: i) We’ve some preliminary benchmarks on 1/2 a billion rows and see substantial differences (only against dplyr though as of now). We really hope to get that out ASAP. ii) Thanks again. Matt is awesome and deserves all the praise! iii) I don’t quite follow which operation that dplyr performs is not possible using data.table, seriously. I was always under the impression that dplyr dint have all data.table functionalities. Would be great if you could explain a bit on that.. (perhaps on data.table mailing list?) iv) Once again, I work with genomics data and my entire pipeline uses data.table (for it’s speed *and* syntax) and they have some pretty complex operations (even before becoming a developer of data.table). It’s the reason I started to contribute to data.table in the first place! I’d have to disagree on that (that complex operations aren’t data.table’s forte) as well therefore.

  9. data.table saved me so much time. the syntax does require some getting used to, but aint that big of deal. thanks for the post

Leave a Reply