The present and future of the R blogosphere (~7 minute video from useR2011)

This is (roughly) the lightning talk I gave in useR2011. If you are a reader of R-bloggers.com then this talk is not likely to tell you anything new. However, if you have a friend, college or student who is a new useRs of R, this talk will offer him a decent introduction to what the R blogosphere is all about.

The talk is a call for people of the R community to participate more in reading, writing and interacting with blogs.

I was encouraged to record this talk per the request of Chel Hee Lee, so it may be used in the recent useR conference in Korea (2011)

The talk (briefly) goes through:

  1. The widespread influence of the R blogosphere
  2. What R bloggers write about
  3. How to encourage a blogger you enjoy reading to keep writing
  4. How to start your own R blog (just go to wordpress.com)
  5. Basic tips about writing a blog
  6. One advice about marketing your R blog (add it to R-bloggers.com)
  7. And two thoughts about the future of R blogging (more bloggers and readers, and more interactive online visualization)

My apologies for any of the glitches in my English. For more talks about R, you can visit the R user groups blog. I hope more speakers from useR 2011 will consider uploading their talks online.

Comparison of ave, ddply and data.table

A guest post by Paul Hiemstra.
————

Fortran and C programmers often say that interpreted languages like R are nice and all, but lack in terms of speed. How fast something works in R greatly depends on how it is implemented, i.e. which packages/functions does one use. A prime example, which shows up regularly on the R-help list, is letting a vector grow as you perform an analysis. In pseudo-code this might look like:

1
2
3
4
5
dum = NULL
for(i in 1:100000) {
   # new_outcome = ...do some stuff...
   dum = c(dum, new_outcome)
}

The problem here is that dum is continuously growing in size. This forces the operating system to allocate new memory space for the object, which is terribly slow. Preallocating dum to the length it is supposed to be greatly improves the performance. Alternatively, the use of apply type of functions, or functions from plyr package prevent these kinds of problems. But even between more advanced methods there are large differences between different implementations.

Take the next example. We create a dataset which has two columns, one column with values (e.g. amount of rainfall) and in the other a category (e.g. monitoring station id). We would like to know what the mean value is per category. One way is to use for loops, but I’ll skip that one for now. Three possibilities exist that I know of: ddply (plyr), ave (base R) and data.table. The piece of code at the end of this post compares these three methods. The outcome in terms of speed is:
(press the image to see a larger version)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
   datsize noClasses  tave tddply tdata.table
1    1e+05        10 0.091  0.035       0.011
2    1e+05        50 0.102  0.050       0.012
3    1e+05       100 0.105  0.065       0.012
4    1e+05       200 0.109  0.101       0.010
5    1e+05       500 0.113  0.248       0.012
6    1e+05      1000 0.123  0.438       0.012
7    1e+05      2500 0.146  0.956       0.013
8    1e+05     10000 0.251  3.525       0.020
9    1e+06        10 0.905  0.393       0.101
10   1e+06        50 1.003  0.473       0.100
11   1e+06       100 1.036  0.579       0.105
12   1e+06       200 1.052  0.826       0.106
13   1e+06       500 1.079  1.508       0.109
14   1e+06      1000 1.092  2.652       0.111
15   1e+06      2500 1.167  6.051       0.117
16   1e+06     10000 1.338 23.224       0.132

It is quite obvious that ddply performs very bad when the number of unique categories is large. The ave function performs better. However, the data.table option is by far the best one, outperforming both other alternatives easily. In response to this, Hadley Wickham (author of plyr) responded:

This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I’m still thinking about how to overcome this fundamental limitation of the ddply approach.

I hope this comparison is of use to readers. And remember, think before complaining that R is slow :) .

Paul (e-mail: p.h.hiemstra@gmail.com)

ps This blogpost is based on discussions on the R-help and manipulatr mailing lists:
http://www.mail-archive.com/r-help@r-project.org/msg142797.html
http://groups.google.com/group/manipulatr/browse_thread/thread/5e8dfed85048df99

R code to perform the comparison

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
                      cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)
 
  t1 = system.time(res1 <- with(expdata, ave(value, cat)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
  t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
 
res
 
ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

Calling R lovers and bloggers – to work together on “The R Programming wikibook”

This post is a call for both R community members and R-bloggers, to come and help make The R Programming wikibook be amazing.

The R Programming wikibook is not just another one of the many free books about statistics/R, it is a community project which aims to create a cross-disciplinary practical guide to the R programming language.  Here is how you can join:

Continue reading

Engineering Data Analysis (with R and ggplot2) – a Google Tech Talk given by Hadley Wickham

It appears that just days ago, Google Tech Talk released a new, one hour long, video of a presentation (from June 6, 2011) made by one of R’s community more influential contributors, Hadley Wickham.

This seems to be one of the better talks to send a programmer friend who is interested in getting into R.

Talk abstract

Data analysis, the process of converting data into knowledge, insight and understanding, is a critical part of statistics, but there’s surprisingly little research on it. In this talk I’ll introduce some of my recent work, including a model of data analysis. I’m a passionate advocate of programming that data analysis should be carried out using a programming language, and I’ll justify this by discussing some of the requirement of good data analysis (reproducibility, automation and communication). With these in mind, I’ll introduce you to a powerful set of tools for better understanding data: the statistical programming language R, and the ggplot2 domain specific language (DSL) for visualisation.

The video

More resources

How to upgrade R on windows 7

Background – time to upgrade to R 2.13.0

The news of the new release of R 2.13.0 is out, and the R blogosphere is buzzing. Bloggers posting excitedly about the new R compiler package that brings with it the hope to speed up our R code with up to 4 times improvement and even a JIT compiler for R. So it is time to upgrade, and bloggers are here to help. Some wrote how to upgrade R on Linux and mac OSX (based on posts by Paolo). And it is now my turn, with suggestions on how to upgrade R on windows 7.

Upgrading R on windows – the two strategies

The classic description of how to upgrade R can be found in the R project FAQ page (and also the FAQ on how to install R on windows)

There are basically two strategies for R upgrading on windows. The first is to install a new R version and copy paste all the packages to the new R installation folder. The second is to have a global R package folder, each time synced to the most current R installation (thus saving us the time of copying the package library each we upgrade R).

I described the second strategy in detail in a post I wrote a year ago titled: “How to upgrade R on windows XP – another strategy” which explains how to upgrade R using the simple two-liner code:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

p.s: If this is the first time you are upgrading R using this method, then first run the following two lines on your old R installation (before running the above code in the new R intallation):

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
Old.R.RunMe()

The above code should be enough.  However, there are some common pitfalls you might encounter when upgrading R on windows 7, bellow I outline the ones I know about, and how they can be solved.

Continue reading

Article about plyr published in JSS, and the citation was added to the new plyr (version 1.5)

The plyr package (by Hadley Wickham) is one of the few R packages for which I can claim to have used for all of my statistical projects. So whenever a new version of plyr comes out I tend to be excited about it (as was when version 1.2 came out with support for parallel processing)

So it is no surprise that the new release of plyr 1.5 got me curious. While going through the news file with the new features and bug fixes, I noticed how (quietly) Hadley has also released (6 days ago) another version of plyr prior to 1.5 which was numbered 1.4.1. That version included only one more function, but a very important one – a new citation reference for when using the plyr package. Here is how to use it:

install.packages("plyr") # so to upgrade to the latest release
citation("plyr")

The output gives both a simple text version as well as a BibTeX entry for LaTeX users. Here it is (notice the download link for yourself to read):

To cite plyr in publications use:
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data
Analysis. Journal of Statistical Software, 40(1), 1-29. URL
http://www.jstatsoft.org/v40/i01/.

I hope to see more R contributers and users will make use of the ?citation() function in the future.

Beeswarm Boxplot (and plotting it with R)

(The image above is called a “Beeswarm Boxplot” , the code for producing this image is provided at the end of this post)

The above plot is implemented under different names in different softwares. This “Scatter Dot Beeswarm Box Violin – plot” (in the lack of an agreed upon term) is a one-dimensional scatter plot which is like “stripchart”, but with closely-packed, non-overlapping points; the positions of the points are corresponding to the frequency in a similar way as the violin-plot. The plot can be superimposed with a boxplot to give a very rich description of the underlaying distribution.

This plot has been implemented in various statistical packages, in this post I will list the few I came by so far. And if you know of an implementation I’ve missed please tell me about it in the comments.

Continue reading

Book review: 25 Recipes for Getting Started with R

Recently I was asked by O’Reilly publishing to give a book review for Paul Teetor new introductory book to R.  After giving the book some attention and appreciating it’s delivery of the material, I was happy to write and post this review.  Also, I’m very happy to see how a major publishing house like O’Reilly is producing more and more R books, great news indeed.

And now for the book review:

Executive summary: a book that offers a well designed gentle introduction for people with some background in statistics wishing to learn how to get common (basic) tasks done with R.

Information

By: Paul Teetor
Publisher:O’Reilly
MediaReleased: January 2011
Pages: 58 (est.)

Format

The book “25 Recipes for Getting Started with R” offers an interesting take on how to bring R to the general (statistically oriented) public.

Continue reading

Statistics with R, and open source stuff (software, data, community)