Uncategorized | R-statistics blog

The R Journal, Vol.2 Issue 2 is out

The second issue of the second volume of The R Journal is now available .

Refereed articles may be downloaded individually using the links below. [Bibliography of refereed articles]

Editorial	3
Contributed Research Articles
Solving Differential Equations in R Karline Soetaert, Thomas Petzoldt and R. Woodrow Setzer	5
Source References Duncan Murdoch	16
hglm: A Package for Fitting Hierarchical Generalized Linear Models Lars Rönnegård, Xia Shen and Moudud Alam	20
dclone: Data Cloning in R Péter Sólymos	29
stringr: modern, consistent string processing Hadley Wickham	38
Bayesian Estimation of the GARCH(1,1) Model with Student-t Innovations David Ardia and Lennart F. Hoogerheide	41
cudaBayesreg: Bayesian Computation in CUDA Adelino Ferreira da Silva	48
binGroup: A Package for Group Testing Christopher R. Bilder, Boan Zhang, Frank Schaarschmidt and Joshua M. Tebbs	56
The RecordLinkage Package: Detecting Errors in Data Murat Sariyar and Andreas Borg	61
spikeslab: Prediction and Variable Selection Using Spike and Slab Regression Hemant Ishwaran, Udaya B. Kogalur and J. Sunil Rao	68
From the Core
What’s New?	74
News and Notes
useR! 2010	77
Forthcoming Events: useR! 2011	79
Changes in R	81
Changes on CRAN	90
News from the Bioconductor Project	101
R Foundation News	102

WP-CodeBox: A better R syntax highlighter plugin for WordPress

Today I was informed of (what I believe is) ~~a better~~ the best WordPress plugin for R syntax highlighting called WP-CodeBox. This plugin doesn’t require any hacks to make it work (as opposed to the WP-Syntax plugin, which I wrote about in the past). WP-CodeBox can be downloaded and installed on a WordPress by searching for it in the “Add New” section in the plugins menu.

WP-CodeBox provides some nice features (some AJAX based) to the display of the code in the post:

The code box in the post can now be folded (top right of the code box) so the code can be hidden so to not clutter the post (if the code is too long)
The code box is added with another button (top left of the code box) which allows the reader to see the code in a new window – so to easily enable a copy paste of the code.
The options of the plugin allows automatic row numbering of the code, control over “tab” length and some other features.

p.s: Lastly, my thanks goes to guangchuang yu who’s comment on my original post, and he’s post on wp-codebox and R, has introduced me to this better plugin.

p.p.s: in case you blog on WordPress.com, there is also a solution for R syntax highlighting for WordPress.com bloggers.

A new version of ff released (version 2.2.0)

A few hours ago, Jens Oehlschlägel has announced on the R-help mailing list of the release of a new version of the ff package.

The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory – the effective virtual memory consumption per ff object.

Here are the new features of ff, as Jens wrote in his announcement:

—-
Dear R community,

The next release of package ff is available on CRAN. With kind help of Brian Ripley it now supports the Win64 and Sun versions of R. It has three major functional enhancements:

a) new fast in-memory sorting and ordering functions (single-threaded)
b) ff now supports on-disk sorting and ordering of ff vectors and ffdf dataframes
c) ff integer vectors now can be used as subscripts of ff vectors and ffdf dataframes

a) is achieved by careful implementation of NA-handling and exploiting context information
b) although permanently stored, sorting and ordering of ff objects can be faster than the standard routines in R
c) applying an order to ff vectors and ffdf dataframes is substantially slower than in pure R because it involves disk-access AND sorting index positions (to avoid random access).

There is still room for improvement, however, the current status should already be useful. I run some comparisons with SAS (see end of mail):
– both could sort German census size (81e6 rows) on a 3GB notebook
– ff sorts and orders faster on single columns
– sorting big multicolumn-tables is faster in SAS

Continue reading “A new version of ff released (version 2.2.0)”

Open source and money – why paying R developers might not always help the project

This post can be summed up by ~~one~~ two sentences: ~~“We can’t buy love.”~~ “Starting to pay for love could make it disappear” while at the same time “We need money to live and love”. These two conflicting forces, with relation to open source, are the topic of this post.

This post is directed to the community of R users but is relevant to people of all open source projects. It deals with the question of open source projects and funding. Specifically, should a community of open source developers and users, once it exists, want to start raising/donating money to the main code contributers?

The conflict arises when, on the one side, we intuitively wish to repay the people who have helped us but worry of the implications of behavioral studies that suggests that doing so might destroy the motivation of the developers to continue working without contently getting payed, and that making the shift from doing something for one reason (whatever it is) to doing it for money, might not easily be turned back.
On the other side, developers needs to make a (good) living, and we (as a community) should strive for them to be well payed.
How can these two be reconciled?

This article won’t offer a decisive conclusions – and my hope is to invite discussion on the matter (from both amatures and professionals in the field of open source and behavioral economics) so to give more ideas for people to base their opinions on.

Update: this post was substantially updated from it’s original version, thanks to responses both in the comments, and especially in the e-mails. I apologies for writing a post that had needed so many corrections, and at the same time I am grateful for all the people who took the time to shed light in places where I was wrong.

* * * *

Motivation: R has issues – how do we get them fixed?

In the past two weeks there has been a raging debate regarding the future of R (hint: “what is R“). Without going deeper into the topic (I already wrote about it here, where you too can go and respond), I’ll sum up the issue with a quote from Ross Ihaka (one of the two founders of R) who recently wrote:

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

After this, several discussion threads where started around the web (for example: 0, 1, 2, 3, 4 ,5, 6 ), but then a comment was made in the R-help mailing list by Jaroslaw Piskorski who wrote:

A few days ago Tal Galili posted a message about some controversies concerning the future of R. Having read the discussions, especially those following Ross Ihaka’s post, I have come to the conclusion, that, as usual, the problem is money. I doubt there would be discussions about dropping R in its present form if the R-Foundation were properly funded and could hire computer scientists, programmers and statisticians. If a commercial company is able to provide big-database and multicore solutions, then so would a properly founded R-Foundation.

To which my response is that: I ~~strongly~~ disagree with this statement..
That is, I do agree that money could help with things. It could be that money could be a part of the solution. But I doubt that the core of this problem is money. Nor that it would be solved if we could only now hire “computer scientists, programmers and statisticians” (although that could be part of the solution).

And the reason I am doubtful stems from two sources:

Continue reading “Open source and money – why paying R developers might not always help the project”

Using the {plyr} (1.2) package parallel processing backend with windows

Hadley Wickham has just announced the release of a new R package “reshape2” which is (as Hadley wrote) “a reboot of the reshape package”. Alongside, Hadley announced the release of plyr 1.2.1 (now faster and with support to parallel computation!).
Both releases are exciting due to a significant speed increase they have now gained.

Yet in case of the new plyr package, an even more interesting new feature added is the introduction of the parallel processing backend.

Reminder what is the `plyr` package all about

(as written in Hadley’s announcement)

plyr is a set of tools for a common set of problems: you need to __split__ up a big data structure into homogeneous pieces, __apply__ a function to each piece and then __combine__ all the results back together. For example, you might want to:

fit the same model each patient subsets of a data frame
quickly calculate summary statistics for each group
perform group-wise transformations like scaling or standardising

It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

totally consistent names, arguments and outputs
convenient parallelisation through the foreach package
input from and output to data.frames, matrices and lists
progress bars to keep track of long running operations
built-in error recovery, and informative error messages
labels that are maintained across all transformations

Considerable effort has been put into making plyr fast and memory efficient, and in many cases plyr is as fast as, or faster than, the built-in functions.

You can find out more at http://had.co.nz/plyr/, including a 20 page introductory guide, http://had.co.nz/plyr/plyr-intro.pdf. You can ask questions about plyr (and data-manipulation in general) on the plyr mailing list. Sign up at http://groups.google.com/group/manipulatr

What’s new in `plyr` (1.2.1)

The exiting news about the release of the new plyr version is the added support for parallel processing.

l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when TRUE, applies functions in parallel using a parallel backend registered with the
foreach package.

The new package also has some minor changes and bug fixes, all can be read here.

In the original announcement by Hadley, he gave an example of using the new parallel backend with the doMC package for unix/linux. For windows (the OS I’m using) you should use the doSMP package (as David mentioned in his post earlier today). However, this package is currently only released for “REvolution R” and not released yet for R 2.11 (see more about it here). But due to the kind help of Tao Shi there is a solution for windows users wanting to have parallel processing backend to plyr in windows OS.

All you need is to install the doSMP package, according to the instructions in the post “Parallel Multicore Processing with R (on Windows)“, and then use it like this:

require(plyr) # make sure you have 1.2 or later installed
x <- seq_len(20) wait <- function(i) Sys.sleep(0.1) system.time(llply(x, wait)) # user system elapsed # 0 0 2 require(doSMP) workers <- startWorkers(2) # My computer has 2 cores registerDoSMP(workers) system.time(llply(x, wait, .parallel = TRUE)) # user system elapsed # 0.09 0.00 1.11

Update (03.09.2012): the above code will no longer work with updated versions of R (R 2.15 etc.)

Trying to run it will result in the error massage:

Loading required package: doSMP
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘doSMP’

Because trying to install the package will give the error massage:

> install.packages("doSMP")
Installing package(s) into ‘D:/R/library’
(as ‘lib’ is unspecified)
Warning message:
package ‘doSMP’ is not available (for R version 2.15.0)

You can fix this be replacing the use of {doSMP} package with the {doParallel}+{foreach} packages. Here is how:

if(!require(foreach)) install.packages("foreach")
if(!require(doParallel)) install.packages("doParallel")
# require(doSMP) # will no longer work...
library(foreach)
library(doParallel)
workers <- makeCluster(2) # My computer has 2 cores
registerDoParallel(workers)

x <- seq_len(20)
wait <- function(i) Sys.sleep(0.3)
system.time(llply(x, wait)) # 6 sec
system.time(llply(x, wait, .parallel = TRUE)) # 3.53 sec

Category: Uncategorized

The R Journal, Vol.2 Issue 2 is out

Table of Contents

Contributed Research Articles

From the Core

News and Notes