Category Archives: R programming

Managing a statistical analysis project – guidelines and best practices

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

  1. The R tag page on Stackoverflow, and
  2. Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project.  They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together.  And with his permission, I am now republishing it here.  I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

Continue reading

Dumping functions from the global environment into an R script file

Looking at a project you didn’t touch for years poses many challenges. The less documentation and organization you had in your files, the more time you’ll have to spend tracing back what you did back when the code was written.

I just opened up such a project, that was before I ever knew to split my .r files to “data.r”, “functions.r”, “do.r”. All I have are several versions of an old .RData file and many .r files with a mix of functions and commands (oh the shame!)

One idea I had for the tracing back was to take the latest version of .RData I had, and see what functions I had in it’s environment. simply typing ls() wouldn’t work. Also, I wanted to have a list of all the functions that where defined in my .RData environment. Thanks to the code recently published by Richie Cotton, I was able to create the “save.functions.from.env”. This function will go through all your defined functions and write them into “d:\\temp.r”.

I hope this might be useful to one of you in the future, here is the code to do it:

save.functions.from.env <- function(file = "d:\\temp.r")
{
	# This function will go through all your defined functions and write them into "d:\\temp.r"
	# let's get all the functions from the envoirnement:
	funs <- Filter(is.function, sapply(ls( ".GlobalEnv"), get))
 
	# Let's 
	for(i in seq_along(funs))
	{
		cat(	# number the function we are about to add
			paste("\n" , "#------ Function number ", i , "-----------------------------------" ,"\n"),
			append = T, file = file
			)
 
		cat(	# print the function into the file
			paste(names(funs)[i] , "<-", paste(capture.output(funs[[i]]), collapse = "\n"), collapse = "\n"),
			append = T, file = file
			)
 
		cat(
			paste("\n" , "#-----------------------------------------" ,"\n"),
			append = T, file = file
			)
	}
 
	cat( # writing at the end of the file how many new functions where added to it
		paste("# A total of ", length(funs), " Functions where written into", file),
		append = T, file = file
		)
	print(paste("A total of ", length(funs), " Functions where written into", file))
}
 
# save.functions.from.env() # this is how you run it

Update: Joshua Ulrich gave on stackoverflow another solution for this challenge:

	newEnv <- new.env()
	load("myFunctions.Rdata", newEnv)
	dump(c(lsf.str(newEnv)), file="normalCodeFile.R", envir=newEnv)

And also suggested to look into ?prompt (which creates documentation files for objects) and / or ?package.skeleton.

Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

    Reminder what is the `plyr` package all about

    (as written in Hadley’s announcement)

    plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

    • fit the same model each patient subsets of a data frame
    • quickly calculate summary statistics for each group
    • perform group-wise transformations like scaling or standardising

    It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

    • totally consistent names, arguments and outputs
    • convenient parallelisation through the foreach package
    • input from and output to data.frames, matrices and lists
    • progress bars to keep track of long running operations
    • built-in error recovery, and informative error messages
    • labels that are maintained across all transformations

    Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

    You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf.  You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

    What’s new in `plyr` (1.2.1)

    The exiting news about the release of the new plyr version is the added support for parallel processing.

    l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
    foreach package.

    The new package also has some minor changes and bug fixes, all can be read here.

    In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux.  For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here).  But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

    All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:


    require(plyr) # make sure you have 1.2 or later installed
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.1)
    system.time(llply(x, wait))
    # user system elapsed
    # 0 0 2
    require(doSMP)
    workers <- startWorkers(2) # My computer has 2 cores
    registerDoSMP(workers)
    system.time(llply(x, wait, .parallel = TRUE))
    # user system elapsed
    # 0.09 0.00 1.11

    Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

    Trying to run it will result in the error massage:

    Loading required package: doSMP
    Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
      there is no package called ‘doSMP’

    Because trying to install the package will give the error massage:

    > install.packages("doSMP")
    Installing package(s) into ‘D:/R/library(as ‘lib’ is unspecified)
    Warning message:
    package ‘doSMP’ is not available (for R version 2.15.0)

    You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

    if(!require(foreach)) install.packages("foreach")
    if(!require(doParallel)) install.packages("doParallel")
    # require(doSMP) # will no longer work...
    library(foreach)
    library(doParallel)
    workers <- makeCluster(2) # My computer has 2 cores
    registerDoParallel(workers)
     
    x <- seq_len(20)
    wait <- function(i) Sys.sleep(0.3)
    system.time(llply(x, wait)) # 6 sec
    system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec

    The difference between “letters[c(1,NA)]” and “letters[c(NA,NA)]“

    In David Smith’s latest blog post (which, in a sense, is a continued response to the latest public attack on R), there was a comment by Barry that caught my eye. Barry wrote:

    Even I get caught out on R quirks after 20 years of using it. Compare letters[c(12,NA)] and letters[c(NA,NA)] for the most recent thing that made me bang my head against the wall.

    So I did, and here’s the output:

    > letters[c(12,NA)]
    [1] "l" NA 
    >  letters[c(NA,NA)] 
     [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
    >

    Interesting isn’t it?
    I had no clue why this had happened but luckily for us, Barry gave a follow-up reply with an explanation. And here is what he wrote:
    Continue reading

    Parallel Multicore Processing with R (on Windows)

    This post offers simple example and installation tips for “doSMP” the new Parallel Processing backend package for R under windows.
    * * *

    Update:
    The required packages are not yet now available on CRAN, but until they will get online, you can download them from here:
    REvolution foreach windows bundle
    (Simply unzip the folders inside your R library folder)

    * * *

    Recently, REvolution blog announced the release of “doSMP”, an R package which offers support for symmetric multicore processing (SMP) on Windows.
    This means you can now speed up loops in R code running iterations in parallel on a multi-core or multi-processor machine, thus offering windows users what was until recently available for only Linux/Mac users through the doMC package.

    Installation

    For now, doSMP is not available on CRAN, so in order to get it you will need to download the REvolution R distribution “R Community 3.2” (they will ask you to supply your e-mail, but I trust REvolution won’t do anything too bad with it…)
    If you already have R installed, and want to keep using it (and not the REvolution distribution, as was the case with me), you can navigate to the library folder inside the REvolution distribution it, and copy all the folders (package folders) from there to the library folder in your own R installation.

    If you are using R 2.11.0, you will also need to download (and install) the revoIPC package from here:
    revoIPC package – download link (required for running doSMP on windows)
    (Thanks to Tao Shi for making this available!)

    Usage

    Once you got the folders in place, you can then load the packages and do something like this:

    require(doSMP)
    workers <- startWorkers(2) # My computer has 2 cores
    registerDoSMP(workers)
     
    # create a function to run in each itteration of the loop
    check <-function(n) {
    	for(i in 1:1000)
    	{
    		sme <- matrix(rnorm(100), 10,10)
    		solve(sme)
    	}
    }
     
     
    times <- 10	# times to run the loop
     
    # comparing the running time for each loop
    system.time(x <- foreach(j=1:times ) %dopar% check(j))  #  2.56 seconds  (notice that the first run would be slower, because of R's lazy loading)
    system.time(for(j in 1:times ) x <- check(j))  #  4.82 seconds
     
    # stop workers
    stopWorkers(workers)

    Points to notice:

    • You will only benefit from the parallelism if the body of the loop is performing time-consuming operations. Otherwise, R serial loops will be faster
    • Notice that on the first run, the foreach loop could be slow because of R’s lazy loading of functions.
    • I am using startWorkers(2) because my computer has two cores, if your computer has more (for example 4) use more.
    • Lastly – if you want more examples on usage, look at the “ParallelR Lite User’s Guide”, included with REvolution R Community 3.2 installation in the “doc” folder

    Updates

    (15.5.10) :
    The new R version (2.11.0) doesn’t work with doSMP, and will return you with the following error:

    Loading required package: revoIPC
    Error: package ‘revoIPC’ was built for i386-pc-intel32


    So far, a solution is not found, except using REvolution R distribution, or using R 2.10
    A thread on the subject was started recently to report the problem. Updates will be given in case someone would come up with better solutions.

    Thanks to Tao Shi, there is now a solution to the problem. You’ll need to download the revoIPC package from here:
    revoIPC package – download link (required for running doSMP on windows)
    Install the package on your R distribution, and follow all of the other steps detailed earlier in this post. It will now work fine on R 2.11.0


    Update 2: Notice that I added, in the beginning of the post, a download link to all the packages required for running parallel foreach with R 2.11.0 on windows. (That is until they will be uploaded to CRAN)

    Update 3 (04.03.2011): doSMP is now officially on CRAN!