Guest post by Jake Russ
For a recent project I needed to make a simple sum calculation on a rather large data frame (0.8 GB, 4+ million rows, and ~80,000 groups). As an avid user of Hadley Wickham’s packages, my first thought was to use
plyr. However, the job took
plyr roughly 13 hours to complete.
plyr is extremely efficient and user friendly for most problems, so it was clear to me that I was using it for something it wasn’t meant to do, but I didn’t know of any alternative screwdrivers to use.
I asked for some help on the manipulator Google group , and their feedback led me to
dplyr, a new, and still in progress, package project by Hadley.
What follows is a speed comparison of these three packages incorporating all the feedback from the manipulator folks. They found it informative, so Tal asked me to write it up as a reproducible example.