The difference between "letters[c(1,NA)]" and "letters[c(NA,NA)]"

In David Smith’s latest blog post (which, in a sense, is a continued response to the latest public attack on R), there was a comment by Barry that caught my eye. Barry wrote:

Even I get caught out on R quirks after 20 years of using it. Compare letters[c(12,NA)] and letters[c(NA,NA)] for the most recent thing that made me bang my head against the wall.

So I did, and here’s the output:

> letters[c(12,NA)]
[1] "l" NA
>  letters[c(NA,NA)]

Interesting isn’t it?
I had no clue why this had happened but luckily for us, Barry gave a follow-up reply with an explanation. And here is what he wrote:
Continue reading The difference between "letters[c(1,NA)]" and "letters[c(NA,NA)]"

Parallel Multicore Processing with R (on Windows)

This post offers simple example and installation tips for “doSMP” the new Parallel Processing backend package for R under windows.
* * *

The required packages are not yet now available on CRAN, but until they will get online, you can download them from here:
REvolution foreach windows bundle
(Simply unzip the folders inside your R library folder)

* * *

Recently, REvolution blog announced the release of “doSMP”, an R package which offers support for symmetric multicore processing (SMP) on Windows.
This means you can now speed up loops in R code running iterations in parallel on a multi-core or multi-processor machine, thus offering windows users what was until recently available for only Linux/Mac users through the doMC package.


For now, doSMP is not available on CRAN, so in order to get it you will need to download the REvolution R distribution “R Community 3.2” (they will ask you to supply your e-mail, but I trust REvolution won’t do anything too bad with it…)
If you already have R installed, and want to keep using it (and not the REvolution distribution, as was the case with me), you can navigate to the library folder inside the REvolution distribution it, and copy all the folders (package folders) from there to the library folder in your own R installation.

If you are using R 2.11.0, you will also need to download (and install) the revoIPC package from here:
revoIPC package – download link (required for running doSMP on windows)
(Thanks to Tao Shi for making this available!)


Once you got the folders in place, you can then load the packages and do something like this:

workers <- startWorkers(2) # My computer has 2 cores
# create a function to run in each itteration of the loop
check <-function(n) {
	for(i in 1:1000)
		sme <- matrix(rnorm(100), 10,10)
times <- 10	# times to run the loop
# comparing the running time for each loop
system.time(x <- foreach(j=1:times ) %dopar% check(j))  #  2.56 seconds  (notice that the first run would be slower, because of R's lazy loading)
system.time(for(j in 1:times ) x <- check(j))  #  4.82 seconds
# stop workers

Points to notice:

  • You will only benefit from the parallelism if the body of the loop is performing time-consuming operations. Otherwise, R serial loops will be faster
  • Notice that on the first run, the foreach loop could be slow because of R’s lazy loading of functions.
  • I am using startWorkers(2) because my computer has two cores, if your computer has more (for example 4) use more.
  • Lastly – if you want more examples on usage, look at the “ParallelR Lite User’s Guide”, included with REvolution R Community 3.2 installation in the “doc” folder


(15.5.10) :
The new R version (2.11.0) doesn’t work with doSMP, and will return you with the following error:

Loading required package: revoIPC
Error: package ‘revoIPC’ was built for i386-pc-intel32

So far, a solution is not found, except using REvolution R distribution, or using R 2.10
A thread on the subject was started recently to report the problem. Updates will be given in case someone would come up with better solutions.

Thanks to Tao Shi, there is now a solution to the problem. You’ll need to download the revoIPC package from here:
revoIPC package – download link (required for running doSMP on windows)
Install the package on your R distribution, and follow all of the other steps detailed earlier in this post. It will now work fine on R 2.11.0

Update 2: Notice that I added, in the beginning of the post, a download link to all the packages required for running parallel foreach with R 2.11.0 on windows. (That is until they will be uploaded to CRAN)

Update 3 (04.03.2011): doSMP is now officially on CRAN!

An article attacking R gets responses from the R blogosphere – some reflections

In this post I reflect on the current state of the R blogosphere, and share my hopes for it’s future.

* * *


I am very grateful to Dr. AnnMaria De Mars for writing her post “The Next Big Thing”.
In her post, Dr. De Mars attacked R by accusing it of being “an epic fail” (in being user-friendly) and “NOT the next big thing”. Of course one should look at Dr. De Mars claims in their context. She is talking about particular aspects in which R fails (the lacking of a mature GUI for non-statisticians), and had her own (very legitimate) take on where to look for “the next big thing”. All in all, her post was decent, and worth contemplating upon respectfully (even if one, me for example, doesn’t agree with all of Dr. De Mars claims.)

R bloggers are becoming a community

But Dr. De Mars post is (very) important for a different reason. Not because her claims are true or false, but because her writing angered people who love and care for R (whether legitimately or not, it doesn’t matter). Anger, being a very powerful emotion, can reveal interesting things. In our case, it just showed that R bloggers are connected to each other.

So far there are 69 R bloggers who wrote in reply to Dr. De Mars post (some more kind then others), they are:

  • R and the Next Big Thing by David Smith
  • This is good news, since it shows that R has a community of people (not “just people”) who write about it.
    In one of the posts, someone commented about how R current stage reminds him of how linux was in 1998, and how he believes R will grow to be amazingly dominant in the next 10 years.
    In the same way, I feel the R blogosphere is just now starting to “wake up” and become aware that it exists. Already 6 bloggers found they can write not just about R code, but also reply to does who “attack” R (in their view). Imagine how the R blogosphere might look in a few years from now…

    I would like to end with a more general note about the importance of R bloggers collaboration to the R ecosystem.

    Continue reading An article attacking R gets responses from the R blogosphere – some reflections

    "The next big thing", R, and Statistics in the cloud

    A friend just e-mailed me about a blog post by Dr. AnnMaria De Mars titled “The Next Big Thing”.

    In it Dr. De Mars wrote (I allowed myself to emphasize some parts of the text):

    Contrary to what some people seem to think, R is definitely not the next big thing, either. I am always surprised when people ask me why I think that, because to my mind it is obvious. […]
    for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

    Here are my two cents on the subject:
    Continue reading "The next big thing", R, and Statistics in the cloud

    Repeated measures ANOVA with R (functions and tutorials)

    Repeated measures ANOVA is a common task for the data analyst.

    There are (at least) two ways of performing “repeated measures ANOVA” using R but none is really trivial, and each way has it’s own complication/pitfalls (explanation/solution to which I was usually able to find through searching in the R-help mailing list).

    So for future reference, I am starting this page to document links I find to tutorials, explanations (and troubleshooting) of “repeated measure ANOVA” done with R

    Functions and packages

    (I suggest using the tutorials supplied bellow for how to use these functions)

    • aov {stats} – offers SS type I repeated measures anova, by a call to lm for each stratum. A short example is given in the ?aov help file
    • Anova {car} – Calculates type-II or type-III analysis-of-variance tables for model objects produced by lm, and for various other object. The ?Anova help file offers an example for how to use this for repeated measures
    • ezANOVA {ez} – This function provides easy analysis of data from factorial experiments, including purely within-Ss designs (a.k.a. “repeated measures”), purely between-Ss designs, and mixed within-and-between-Ss designs, yielding ANOVA results and assumption checks. It is a wrapper of the Anova {car} function, and is easier to use. The ez package also offers the functions ezPlot and ezStats to give plot and statistics of the ANOVA analysis. The ?ezANOVA help file gives a good demonstration for the functions use (My thanks goes to Matthew Finkbe for letting me know about this cool package)
    • friedman.test {stats} – Performs a Friedman rank sum test with unreplicated blocked data. That is, a non-parametric one-way repeated measures anova. I also wrote a wrapper function to perform and plot a post-hoc analysis on the friedman test results
    • Non parametric multi way repeated measures anova – I believe such a function could be developed based on the Proportional Odds Model, maybe using the {repolr} or the {ordinal} packages. But I still didn’t come across any function that implements these models (if you do – please let me know in the comments).
    • Repeated measures, non-parametric, multivariate analysis of variance – as far as I know, such a method is not currently available in R.  There is, however, the Analysis of similarities (ANOSIM) analysis which provides a way to test statistically whether there is a significantdifference between two or more groups of sampling units.  Is is available in the {vegan} package through the “anosim” function.  There is also a tutorial and a relevant published paper.

    Good Tutorials


    Unbalanced design
    Unbalanced design doesn’t work when doing repeated measures ANOVA with aov, it just doesn’t. This situation occurs if there are missing values in the data or that the data is not from a fully balanced design. The way this will show up in your output is that you will see the between subject section showing withing subject variables.

    A solution for this might be to use the Anova function from library car with parameter type=”III”. But before doing that, first make sure you understand the difference between SS type I, II and III. Here is a good tutorial for helping you out with that.
    By the way, these links are also useful in case you want to do a simple two way ANOVA for unbalanced design

    I will “later” add R-help mailing list discussions that I found helpful on the subject.

    If you come across good resources, please let me know about them in the comments.

    Jeroen Ooms's ggplot2 web interface – a new version released (V0.2)

    Good news.

    Jeroen Ooms released a new version of his (amazing) online ggplot2 web interface: is a web interface for Hadley Wickham’s R package ggplot2. It is used as a tool for rapid prototyping, exploratory graphical analysis and education of statistics and R. The interface is written completely in javascript, therefore there is no need to install anything on the client side: a standard browser will do.

    The new version has a lot of cool new features, like advanced data import, integration with Google docs, converting variables from numeric to factor to dates and vice versa, and a lot of new geom’s. Some of which you can watch in his new video demo of the application:

    The application is on:

    p.s: other posts about this (including videos explaining how some of this was done) can be views on the category page: R and the web

    Correlation scatter-plot matrix for ordered-categorical data

    When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

    When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables).
    Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

    There are four solution for the point-overlap problem that I know of:

    1. Jitter the data a bit to give a sense of the “density” of the points
    2. Use a color spectrum to represent when a point actually represent “many points”
    3. Use different points sizes to represent when there are “many points” in the location of that point
    4. Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

    In this post I will offer the code for the  a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

    And here is the code to produce this plot:

    Continue reading Correlation scatter-plot matrix for ordered-categorical data