Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

Reminder what is the `plyr` package all about

(as written in Hadley’s announcement)

plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

fit the same model each patient subsets of a data frame
quickly calculate summary statistics for each group
perform group-wise transformations like scaling or standardising

It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

totally consistent names, arguments and outputs
convenient parallelisation through the foreach package
input from and output to data.frames, matrices and lists
progress bars to keep track of long running operations
built-in error recovery, and informative error messages
labels that are maintained across all transformations

Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf. You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

What’s new in `plyr` (1.2.1)

The exiting news about the release of the new plyr version is the added support for parallel processing.

l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
foreach package.

The new package also has some minor changes and bug fixes, all can be read here.

In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux. For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here). But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:

require(plyr) # make sure you have 1.2 or later installed
x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0 0 2 require(doSMP) workers <- startWorkers(2) # My computer has 2 cores registerDoSMP(workers) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.09 0.00 1.11

Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

Trying to run it will result in the error massage:

Loading required package: doSMP
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘doSMP’

Because trying to install the package will give the error massage:

> install.packages("doSMP")
Installing package(s) into ‘D:/R/library’
(as ‘lib’ is unspecified)
Warning message:
package ‘doSMP’ is not available (for R version 2.15.0)

You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

if(!require(foreach)) install.packages("foreach")
if(!require(doParallel)) install.packages("doParallel")
# require(doSMP) # will no longer work...
library(foreach)
library(doParallel)
workers <- makeCluster(2) # My computer has 2 cores
registerDoParallel(workers)

x <- seq_len(20)
wait <- function(i) Sys.sleep(0.3)
system.time(llply(x, wait)) # 6 sec
system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec

4 thoughts on “Using the {plyr} (1.2) package parallel processing backend with windows”

When I run this example, I get the following warning:

Warning messages:
1: : … may be used in an incorrect context: ‘.fun(piece, …)’

2: : … may be used in an incorrect context: ‘.fun(piece, …)’

Should I worry about this? Thanks!

Emil Kirkegaard says:
December 23, 2015 at 4:24 am
I get the same, but results are correct. Looks like it is not happy about passing arguments to the function via dotdotdot.
Reply

Pingback: Article about plyr published in JSS, and the citation was added to the new plyr (version 1.5) | R-statistics blog

I am running Windows and the doSMP compatible code fails if I call an external function (in my case called “get.district”), complaining “Error in do.ply(i) : task 1 failed – “could not find function “get.district””

I think this is because doSMP is not exporting function names and objects in a way that plyr expects. See this stack overflow question: http://stackoverflow.com/questions/5559287/how-do-i-make-dosmp-play-nicely-with-plyr

Zach says:
April 4, 2011 at 3:57 pm
When I run this example, I get the following warning:
Warning messages:
1: : … may be used in an incorrect context: ‘.fun(piece, …)’
2: : … may be used in an incorrect context: ‘.fun(piece, …)’
Should I worry about this? Thanks!
1. Emil Kirkegaard says:
  December 23, 2015 at 4:24 am
  I get the same, but results are correct. Looks like it is not happy about passing arguments to the function via dotdotdot.
Pingback: Article about plyr published in JSS, and the citation was added to the new plyr (version 1.5) | R-statistics blog
Boris Shor says:
December 3, 2012 at 6:45 pm
I am running Windows and the doSMP compatible code fails if I call an external function (in my case called “get.district”), complaining “Error in do.ply(i) : task 1 failed – “could not find function “get.district””
I think this is because doSMP is not exporting function names and objects in a way that plyr expects. See this stack overflow question: http://stackoverflow.com/questions/5559287/how-do-i-make-dosmp-play-nicely-with-plyr

Using the {plyr} (1.2) package parallel processing backend with windows

Reminder what is the `plyr` package all about

What’s new in `plyr` (1.2.1)

Related

4 thoughts on “Using the {plyr} (1.2) package parallel processing backend with windows”

Leave a ReplyCancel reply