Guest post by Jake Russ
For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use plyr
. However, the job took plyr
roughly 13 hours to complete.
plyr
is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.
I asked for some help on the manipulator Google group , and their feedback led me to data.table
and dplyr
, a new, and still in progress, package project by Hadley.
What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.
Let’s start by making a data frame which fits my description above, but make it reproducible:
set.seed(42)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e+07
one <- data.frame(id = as.factor(seq(from = 1, to = 80000, by = 1)), percent = round(runif(obs,
min = 0, max = 1), digits = 2), type = as.factor(sample(types, obs, replace = TRUE)))
print(object.size(one), units = "GB")
## 0.6 Gb
summary(one)
## id percent type
## 1 : 500 Min. :0.00 A:6672132
## 2 : 500 1st Qu.:0.25 B:6663570
## 3 : 500 Median :0.50 C:6668009
## 4 : 500 Mean :0.50 D:6668684
## 5 : 500 3rd Qu.:0.75 E:6660437
## 6 : 500 Max. :1.00 F:6667168
## (Other):39997000
I'll start the testing with plyr
, using ddply, but I'll also show the difference between subsetting a data frame from within a ddply call and doing the subset first from outside the call. Then I offer a third way to use plyr
's count function to achieve the same result.
library(plyr)
## Test 1 (plyr): Use ddply and subset one with [ ] style indexing from
## within the ddply call.
typeSubset <- c("A", "C", "E")
system.time(test1 <- ddply(one[one$type %in% typeSubset, ], .(id), summarise,
percent_total = sum(percent)))
## user system elapsed
## 104.51 21.23 125.81
## Test 2 (plyr):, Use ddply but subset one outside of the ddply call
two <- subset(one, type %in% typeSubset)
system.time(test2 <- ddply(two, .(id), summarise, percent_total = sum(percent)))
## user system elapsed
## 101.20 46.14 147.64
## Test 3 (plyr): For a simple sum, an alternative is to use plyr's count
## function
system.time(test3 <- count(two, "id", "percent"))
## user system elapsed
## 5.90 0.22 6.12
Doing the subset outside of the ddply call did speed things up, but not as much I as orinially thought it would. For my particular project, doing the subset outside of the ddply call reduced the run time to 12 hours. So largely this is still a “wrong tool” problem, rather than a “when to subset” problem.
Next, I'll try data.table
and for this test and the dplyr
one below I'll operate on the data frame which has been pre-subset:
library(data.table)
## Test 4 (data.table): Speed test for package data.table
## Define the data table
three <- data.table(two, key = c("id"))
tables() # check that the key columns are correct
## NAME NROW MB COLS KEY
## [1,] three 20,000,578 310 id,percent,type id
## Total: 310MB
## Operate on it
system.time(test4 <- three[, list(percent_total = sum(percent)), by = key(three)])
## user system elapsed
## 0.17 0.01 0.19
dplyr
is not currently available on CRAN but you can install it from github with:
devtools::install_github("assertthat")
devtools::install_github("dplyr")
library(dplyr)
## Test 5 (dplyr): Speed test for package dplyr
fourDf <- group_by(two, id)
system.time(test5 <- summarise(fourDf, percent_total = sum(percent)))
## user system elapsed
## 1.49 0.03 1.52
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.01 Rcpp_0.10.4 data.table_1.8.8 plyr_1.8
## [5] knitr_1.4.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 digest_0.6.3 evaluate_0.4.7 formatR_0.9
## [5] stringr_0.6.2 tools_3.0.1
Both data.table
and dplyr
were able to reduce the problem to less than a few seconds. If you're looking for pure speed data.table
is the clear winner. However, it is my understanding that data.table
's syntax can be frustrating, so if you're already used to the 'Hadley ecosystem' of packages, dplyr
is a formitable alternative, even if it is still in the early stages.
Have a nice day!
(Editor's edition: image credit link)
You can also do
fiveDf <- group_by(as.data.table(two), id)
system.time(test6 <- summarise(fiveDf, percent_total = sum(percent)))
to use dplyr commands with a data table backend.
Can you explain in more detail when data.table syntax is frustrating? If what you mean is it has to be learnt, then sure. But what’s frustrating other than that?
I think the challenge is that you overload operators to mean different things inside and outside of [. This means that this:
dt[f(x)]
is not equivalent to
y <- f(x)
df[y]
for different f. (e.g. !, unique)
That makes it difficult to perform a complex operation a step at a time – you either achieve your goal totally or fail. SQL has a similar problem.
As per my tweet, dplyr has at over 11 functions that replicate part of [.data.table. This makes dplyr considerably more verbose, but each function corresponds to a simple verb, so you can string together complicated operations through a combination of simple and explicit primitives, checking your results as you go. This doesn't necessarily come at a performance cost, since you could do everything lazily and only force computation when the user needs results.
To be clear, I don't think this makes dplyr "better" than data.table, we're just optimising different features.
You’ve picked on one small corner of data.table here. I encourage readers of this article and these comments to take a look at data.table and decide for themselves. Perhaps start with the 19 independent reviews here :
https://crantastic.org/packages/data-table
Matthew,
I thank Hadley for taking the time to give a specific example, but as the author of the post, let me say that my statement was not referencing any particular problem with data.table. It was an offhanded summary of the feedback I received when I canvased many folks for help with plyr. I admit I chose a poor way to phrase that, as it was not intended as a slight of data.table. I see now that it reads that way.
I’m actually indebted to you, since the use of data.table reduced my particular issue, which spawned this post, to under 30 seconds, when my previous approach took 13 excruciating hours. For that, I thank you.
Thanks for your comments, much appreciated. Trouble is these off hand comments spread quickly on the internet. Trey Causey tweeted this article as “data.table wins, but syntax is suboptimal”, for example. For the record: the syntax of data.table is actually very simple and flexible. The essence of it is `DT[i, j, by]`. And you read it out loud as “from DT subset by ‘i’, then do ‘j’ grouped by ‘by'”.
One nit for me has been that data.table syntax (especially :=) is not well-understood by StatET in eclipse, marking every instance as an error.
Didn’t know that, thanks for highlighting. I’ve now asked Stephan if it’s at all possible to add := to the StatET definition.
It looks he still did not fix it?
I just started doing a few tasks in data.table rather than plyr. It’s been working well and this article encourages me to explore data.table further. Thanks for sharing your work!
I’ve experimented with data.table. As a novice user of the R language, I’m looking to find those packages worth investing time in learning as part of a collection of most recent R best practice modules. I’m new to the R analysis process and would prefer not to invest time (at least up front) in learning legacy functions especially if there are more efficient and intuitive alternatives. I want to cut to the chase with packages that ease the steepness of my learning curve, and I definitely see data.table as a candidate for serving as my primary tool kit for querying/reshaping/transforming data.
However, with that said, some of the challenges of using the package were evident (1) in the way I need to manage versions of DTs, and (2) in formulating a generalized, standardized approach to slicing and dicing data into a desired format.
On the first point, say I need to maintain different states of subset/queried data in separate tables. I struggle to see how I can fork data states if I’m constantly forced to accept changes to my original table outside of copying the data table before performing manipulation. Perhaps, there is a way to do this that I missed in the documentation?
For the second point, why are there multiple ways to specify i, j, and k parameters? It’s confusing keeping track of how to use the API appropriately because it depends on my prior knowledge of much of R’s legacy functionality. As a new user, I see this and begin wondering what other simple, more patterned packages are available to achieve the similar outcomes. The way I see it, the efficiency gains in processing can never be realized if I can’t figure out if I’m using the right “door”. Give me a single entry. Perhaps this is what other writers are referring to when they speak of “syntax frustration”.
Outside of these area, I’ve been very pleased with the package as it does help me achieve the desired transformation. With some minor tweaks, I could definitely see this package being a dominant utility in generalized approaches to R statistical analysis.
This is a bit wordy to be able to respond to. I’ll try but I’ll have to guess…
On (1) I suspect you think that setkey() is necessary before grouping? It isn’t. Grouping doesn’t depend on anything prior or any state. You have a point if you want to binary search to different columns than the primary key, and we intend to add secondary keys (set2key) to address that. Unless you update by reference using :=, all queries return a new data.table, so I’m not sure what you mean.
On (2) honestly I can’t guess what you mean here. What different ways of specifying i, j and k are there? If you know SQL then i=WHERE, j=SELECT, k=GROUP BY. That doesn’t vary. Have you run example(data.table) and worked through the examples at the prompt? That’s the quickest way to pick it up (by example) as suggested at the top of ?data.table.
I reran the code in this article and now dplyr is two times faster than data.table. Did the dplyr package change substantially since September?
Hello Jonas,
There has been a recent official announcement of the package:
http://blog.rstudio.org/2014/01/17/introducing-dplyr/
And it seems there has been a bunch of commits in the past several months:
https://github.com/hadley/dplyr/graphs/commit-activity
So the answer to your question is probably: yes.
Hi Jonas, that’s not quite the case. The issue is that `data.table` does not require `setkey` to `aggregate`. However `dplyr` requires `group_by` before to `summarise`. So, it’s not quite fair to measure *just* the time to `summarise`. We’ve pointed out these things quite nicely here, in case you’re interested: http://arunsrinivasan.github.io/dplyr_benchmark/
Here is how it should be timed…
**Note 1:** I also add the time to create the data.table, which will normally not be the case because you can directly use “fread”.
**Note 2:** I have compared data.table version 1.8.10 (CRAN) to the development version of dplyr (on github) here.
# No key at all:
system.time({
three <- as.data.table(two)
test4 <- three[, list(percent_total = sum(percent)), by = "id"]
})
This takes app. 6.333 seconds on my laptop.
library(dplyr)
system.time({
fourDf <- group_by(two, id)
test5 <- summarise(fourDf, percent_total = sum(percent))
})
This takes 8.71 seconds.
Hi Jonas, that’s not quite the case. The issue is that `data.table` does not require `setkey` to `aggregate`. However `dplyr` requires `group_by` before to `summarise`. So, it’s not quite fair to measure *just* the time to `summarise`. We’ve pointed out these things quite nicely here, in case you’re interested: http://arunsrinivasan.github.io/dplyr_benchmark/
Here is how it should be timed…
**Note 1:** I also add the time to create the data.table, which will normally not be the case because you can directly use “fread”.
**Note 2:** I have compared data.table version 1.8.10 (CRAN) to the development version of dplyr (on github) here.
# No key at all:
system.time({
three <- as.data.table(two)
test4 <- three[, list(percent_total = sum(percent)), by = "id"]
})
This takes app. 6.333 seconds on my laptop.
library(dplyr)
system.time({
fourDf <- group_by(two, id)
test5 <- summarise(fourDf, percent_total = sum(percent))
})
This takes 8.71 seconds.
First of all I need to thank both the developers of dplyr and data.table. I use them both.
data.table: for simple things where speed is more important than transparency
dplyr: for the grammar when transparency is more important than speed
(mostly when multiple operations are used)
I wonder if something changed as I can’t replicate the given example anymore
> ## Test 5 (dplyr): Speed test for package dplyr
>
> system.time(fourDf
> system.time(test5
> str(test5)
‘data.frame’: 1 obs. of 1 variable:
$ percent_total: num 1e+07
this should have been a data.frame with 80.000 rows like test(4)
> str(test4)
Classes ‘data.table’ and ‘data.frame’: 80000 obs. of 2 variables:
$ id : Factor w/ 80000 levels “1”,”2″,”3″,”4″,..: 1 2 3 4 5 6 7 8 9 10 …
$ percent_total: num 138 116 124 118 119 …
– attr(*, “sorted”)= chr “id”
– attr(*, “.internal.selfref”)=
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] data.table_1.8.10 plyr_1.8 dplyr_0.1.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 Rcpp_0.10.6 tools_3.0.2
Just did some testing and actually am amazed by the performance on my household system for both data.table and dplyr. Performance Is almost similar and I am very gratefull to both Hadley and Matthew. I hope that someday multicores can be used,
The test is executed in sequence for data.table and dplyr so both get equal CPU. Although multiple CPU’s were available only 1 is used. The system contains 64GB and is running ubuntu 13.10. Max memory used was 37GB.
install.packages(“data.table”)
install.packages(“dplyr”)
install.packages(“microbenchmark”)
require(data.table)
require(dplyr)
require(microbenchmark)
set.seed(42)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e+08
one <- data.frame(id = as.factor(seq(from = 1, to = 8e+05, by = 1)),
percent = round(runif(obs, min = 0, max = 1), digits = 2),
type = as.factor(sample(types, obs, replace = TRUE)))
typeSubset NROW(one)
[1] 400.000.000
test_dt <- function(){
two <- subset(one, type %in% typeSubset)
three <- data.table(two, key = c("id"))
test4 <- three[, list(percent_total = sum(percent)),
by = key(three)]
}
test_dp <- function(){
two <- subset(one, type %in% typeSubset)
four <- group_by(two, id)
test5 <- four %.%
summarise(percent_total = sum(percent))
}
m_dt <- microbenchmark(testdt = test_dt(), times = 2)
m_dp m_dt
Unit: seconds
expr min lq median uq max neval
testdt 129.5227 129.5227 130.3159 131.109 131.109 2
> m_dp
Unit: seconds
expr min lq median uq max neval
testdp 133.1595 133.1595 134.7981 136.4368 136.4368 2
Hi Floris, Here are the benchmarks run on dplyr (commit 1527 from github) and data.table (commit 1306 from github):
Summary: data.table takes 26 seconds and dplyr 170 seconds.
—
I moved “two” outside in both functions as it just hinders measuring “just” the aggregation time accurately. Also, I’ve avoided the overhead of creating: 1) keys – not necessary for grouping at all and 2) creation of data.table – we can use `setDT` instead of data.table() as the latter makes an unnecessary copy. This allows to measure *just* the time to aggregate pretty accurately. And here’s the code to replicate the results:
require(data.table)
require(dplyr)
require(microbenchmark)
set.seed(42)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e+08
one <- data.table(id = as.factor(seq(from = 1, to = 8e+05, by = 1)),
percent = round(runif(obs, min = 0, max = 1), digits = 2),
type = as.factor(sample(types, obs, replace = TRUE)))
setDF(one)
typeSubset <- c("A", "C", "E")
two <- subset(one, type %in% typeSubset)
test_dt <- function() {
test4 <- setDT(two)[, list(percent_total = sum(percent)), by = id]
setDF(two)
test4
}
test_dp <- function(){
four <- group_by(two, id) %.% summarise(percent_total = sum(percent))
}
microbenchmark(testdt = test_dt(), testdp = test_dp(), times = 3)
Unit: seconds
expr min lq median uq max neval
testdt 22.69444 24.43425 26.17406 26.70787 27.24168 3
testdp 146.19651 158.67427 171.15202 172.56525 173.97848 3
I believe this post should be updated. dpyr has improved substantially since its release, and it is still improving. On this particular test, it is faster than data table. On my machine it is 30% faster.
system.time(test4 fourDf system.time(test5 three system.time(test4 fourDf system.time(test5 system.time({three <- data.table(one, key = c("id"));test4 system.time(test5 % summarise(percent_total = sum(percent)))
user system elapsed
2.94 0.07 3.03
As of today (dplyr_0.2.0.99, data.table_1.9.2), dplyr slightly lags behind data.table on this measure, (because of more expensive indexing) but outperforms it otherwise. I practice, the performance differential is not large, and the choice is dictated more by syntactic convenience. I recommend data.table for its fread(), setnames() and a few other unique functions, and for relatively small transformations which will be readable and concise. For more complex code, dplyr is more maintainable. Moreover, it’s possible to use data tables in dplyr, and data.table functions play well with magrittr, so there’s no conflict here.
Hi gappy3000, `data.table` has improved tremendously as well since this post. Like I said in my answer above, the timings should include `group_by` as well. `data.table’s` syntax includes the grouping and there’s no reason not to (which I see you’ve done). On my system that takes 2.6 (data.table) vs 7.6 seconds (dplyr) if I include `data.table(.)` creation. Without it `data.table` takes 1.2 seconds.
Note that with `setDT()`, `fread()` and `setDF()` (new in v1.9.3), it is not necessary to copy a data.frame to a data.table – that’s not timing just “aggregate” anyway.
The points on readability, maintainability are all quite subjective and much more opinionated and therefore quite hard to argue on. DT[, list(.), by=.] is not hard to maintain, really! It resembles R even more than %>% I’d argue (which you’d deny :)).
On features, Hadley has recently tweeted here: https://twitter.com/hadleywickham/status/482706640110886912 that the support for data.tables isn’t that great. It also makes quite some copies (to adhere to dplyr philosophy).
Many features of data.table are not possible through dplyr interface (yet?) – rolling joins, Selecting only necessary columns while join (memory-efficiency) or directly manipulating columns while join, joins with “nomatch=0L”, joins with “mult=first/last/all”, by-without-by, sub-assignment by reference, preserving order while summarising, join in `j` for each group
Thanks. In general I agree. I think that an accurate benchmarking of the two packages would require many more tests. For the time being, my anecdotal experience is: i) data.table is from 1x to 3x faster than both dplyr and pandas. ii) fread is irreplaceable and one of the fastest flat file readers I know in any languages; iii) I find dplyr code the most readable, as it integrates nicely with the chaining operator, and my manipulations usually involve more than data.table syntax allows; sometimes I mix and match (when execution is a bottleneck), but I like that dplyr separates the data format layer from the manipulation one. iv) I agree, simple-to-average complexity transformations are simple in data.table.
Thanks. I’ll reply in order: i) We’ve some preliminary benchmarks on 1/2 a billion rows and see substantial differences (only against dplyr though as of now). We really hope to get that out ASAP. ii) Thanks again. Matt is awesome and deserves all the praise! iii) I don’t quite follow which operation that dplyr performs is not possible using data.table, seriously. I was always under the impression that dplyr dint have all data.table functionalities. Would be great if you could explain a bit on that.. (perhaps on data.table mailing list?) iv) Once again, I work with genomics data and my entire pipeline uses data.table (for it’s speed *and* syntax) and they have some pretty complex operations (even before becoming a developer of data.table). It’s the reason I started to contribute to data.table in the first place! I’d have to disagree on that (that complex operations aren’t data.table’s forte) as well therefore.
data.table saved me so much time. the syntax does require some getting used to, but aint that big of deal. thanks for the post
I now find the data.table syntax very simple and logical, but it was not this way at the very beginning. I think the data.table syntax seems “confusing” at first because many users are not familiar with SQL or take the up front time to understand the system for data table syntax. I quickly switched to data table and find it to be a gorgeously simple tool to use now, but that’s because I spent the up front time. When I have encouraged others to use it, they often appreciate the speed advantages but still get tripped up frequently when trying to solve novel problems because they did not yet fully understand the way the syntax works. Typically they did not look at the FAQ or example(data.table) etc materials, often it’s simply because they know others that use dplyr so that’s what they’re used to, etc. I find the data.table syntax to be quite intuitive but the use pervasiveness of resources such as this (http://blog.datacamp.com/data-table-cheat-sheet/) make it much easier for beginners to quickly use and start to get an intuition. I think data table developers would be well served to make even better cheat sheets that really show off the full array of tools of data table (joins, reshaping, etc) while also providing the simple “desk reference” that a lot of people need to be reminded of the ways to tackle problems.
Doesn’t dplyr needs R >= 3.1.2 ?
For me it’s failing while installing with error message
ERROR: this R is version 3.1.1, package ‘dplyr’ requires R >= 3.1.2
Error: Command failed (1)
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Indeed, dplyr depends on R 3.1.2 at least:
https://cran.r-project.org/web/packages/dplyr/index.html
But that is a rather old release at this point, it is time for you to upgrade…