Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

    Reminder what is the `plyr` package all about

    (as written in Hadley’s announcement)

    plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

    • fit the same model each patient subsets of a data frame
    • quickly calculate summary statistics for each group
    • perform group-wise transformations like scaling or standardising

    It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

    • totally consistent names, arguments and outputs
    • convenient parallelisation through the foreach package
    • input from and output to data.frames, matrices and lists
    • progress bars to keep track of long running operations
    • built-in error recovery, and informative error messages
    • labels that are maintained across all transformations

    Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

    You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf.  You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

    What’s new in `plyr` (1.2.1)

    The exiting news about the release of the new plyr version is the added support for parallel processing.

    l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
    foreach package.

    The new package also has some minor changes and bug fixes, all can be read here.

    In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux.  For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here).  But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

    All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:


    require(plyr) # make sure you have 1.2 or later installed
    x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0 0 2 require(doSMP) workers <- startWorkers(2) # My computer has 2 cores registerDoSMP(workers) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.09 0.00 1.11

    Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

    Trying to run it will result in the error massage:

    Loading required package: doSMP
    Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
      there is no package called ‘doSMP’
    

    Because trying to install the package will give the error massage:

    > install.packages("doSMP")
    Installing package(s) into ‘D:/R/library’
    (as ‘lib’ is unspecified)
    Warning message:
    package ‘doSMP’ is not available (for R version 2.15.0)
    

    You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

    if(!require(foreach)) install.packages("foreach")
    if(!require(doParallel)) install.packages("doParallel")
    # require(doSMP) # will no longer work...
    library(foreach)
    library(doParallel)
    workers <- makeCluster(2) # My computer has 2 cores
    registerDoParallel(workers)
    
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.3)
    system.time(llply(x, wait)) # 6 sec
    system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec
    

    Parallel Multicore Processing with R (on Windows)

    Parallel Processing backend for R under windows – installation tips and some examples.

    This post offers simple example and installation tips for “doSMP” the new Parallel Processing backend package for R under windows.
    * * *

    Update:
    The required packages are not yet now available on CRAN, but until they will get online, you can download them from here:
    REvolution foreach windows bundle
    (Simply unzip the folders inside your R library folder)

    * * *

    Recently, REvolution blog announced the release of “doSMP”, an R package which offers support for symmetric multicore processing (SMP) on Windows.
    This means you can now speed up loops in R code running iterations in parallel on a multi-core or multi-processor machine, thus offering windows users what was until recently available for only Linux/Mac users through the doMC package.

    Installation

    For now, doSMP is not available on CRAN, so in order to get it you will need to download the REvolution R distribution “R Community 3.2” (they will ask you to supply your e-mail, but I trust REvolution won’t do anything too bad with it…)
    If you already have R installed, and want to keep using it (and not the REvolution distribution, as was the case with me), you can navigate to the library folder inside the REvolution distribution it, and copy all the folders (package folders) from there to the library folder in your own R installation.

    If you are using R 2.11.0, you will also need to download (and install) the revoIPC package from here:
    revoIPC package – download link (required for running doSMP on windows)
    (Thanks to Tao Shi for making this available!)

    Usage

    Once you got the folders in place, you can then load the packages and do something like this:

    require(doSMP)
    workers <- startWorkers(2) # My computer has 2 cores
    registerDoSMP(workers)
    
    # create a function to run in each itteration of the loop
    check <-function(n) {
    	for(i in 1:1000)
    	{
    		sme <- matrix(rnorm(100), 10,10)
    		solve(sme)
    	}
    }
    
    
    times <- 10	# times to run the loop
    
    # comparing the running time for each loop
    system.time(x <- foreach(j=1:times ) %dopar% check(j))  #  2.56 seconds  (notice that the first run would be slower, because of R's lazy loading)
    system.time(for(j in 1:times ) x <- check(j))  #  4.82 seconds
    
    # stop workers
    stopWorkers(workers)
    

    Points to notice:

    • You will only benefit from the parallelism if the body of the loop is performing time-consuming operations. Otherwise, R serial loops will be faster
    • Notice that on the first run, the foreach loop could be slow because of R's lazy loading of functions.
    • I am using startWorkers(2) because my computer has two cores, if your computer has more (for example 4) use more.
    • Lastly - if you want more examples on usage, look at the "ParallelR Lite User's Guide", included with REvolution R Community 3.2 installation in the "doc" folder

    Updates

    (15.5.10) :
    The new R version (2.11.0) doesn't work with doSMP, and will return you with the following error:

    Loading required package: revoIPC
    Error: package 'revoIPC' was built for i386-pc-intel32


    So far, a solution is not found, except using REvolution R distribution, or using R 2.10
    A thread on the subject was started recently to report the problem. Updates will be given in case someone would come up with better solutions.

    Thanks to Tao Shi, there is now a solution to the problem. You'll need to download the revoIPC package from here:
    revoIPC package - download link (required for running doSMP on windows)
    Install the package on your R distribution, and follow all of the other steps detailed earlier in this post. It will now work fine on R 2.11.0


    Update 2: Notice that I added, in the beginning of the post, a download link to all the packages required for running parallel foreach with R 2.11.0 on windows. (That is until they will be uploaded to CRAN)

    Update 3 (04.03.2011): doSMP is now officially on CRAN!