The news of the new release of R 2.13.0 is out, and the R blogosphere is buzzing. Bloggers posting excitedly about the new R compiler package that brings with it the hope to speed up our R code with up to 4 times improvement and even a JIT compiler for R. So it is time to upgrade, and bloggers are here to help. Some wrote how to upgrade R on Linux and mac OSX (based on posts by Paolo). And it is now my turn, with suggestions on how to upgrade R on windows 7.

There are basically two strategies for R upgrading on windows. The first is to install a new R version and copy paste all the packages to the new R installation folder. The second is to have a global R package folder, each time synced to the most current R installation (thus saving us the time of copying the package library each we upgrade R).

p.s: If this is the first time you are upgrading R using this method, then first run the following two lines on your old R installation (before running the above code in the new R intallation):

The above code should be enough. However, there are some common pitfalls you might encounter when upgrading R on windows 7, bellow I outline the ones I know about, and how they can be solved.

In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

A similar article was later written and was (maybe) published in “computational statistics”.

Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.

Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.

Clustergram in R (a basic function)

After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit. Therefore, and since I didn’t find any implementation of the graph in R, I went about writing the code to implement it.

The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.

How does the function work: The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns. The function assumes the data is scaled. The function then goes about calculating the cluster centers for our data, for varying number of clusters. For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data. Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won’t go into that in this post). Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).

My thank goes to Hadley Wickham for offering some good tips on how to prepare the graph.

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")# Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")data(iris)set.seed(250)par(cex.lab=1.5, cex.main=1.2)
Data <-scale(iris[,-5])# notice I am scaling the vectors)
clustergram(Data, k.range=2:8, line.width=0.004)# notice how I am using line.width. Play with it on your problem, according to the scale of Y.

Here is the output:

Looking at the image we can notice a few interesting things. We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck). We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not “close” to each other, and that about half the observations goes to each of the new clusters). And then notice how moving to 5 clusters makes almost no difference. Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember – this is according the mean of cluster centers by the loading of the first component of the PCA on the data)

If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.

But wait, did our clustering algorithm do a stable job? Let’s try running the algorithm 6 more times (each run will have a different starting point for the clusters)

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")# Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")set.seed(500)
Data <-scale(iris[,-5])# notice I am scaling the vectors)par(cex.lab=1.2, cex.main= .7)par(mfrow =c(3,2))for(i in1:6) clustergram(Data, k.range=2:8 , line.width= .004, add.center.points=T)

Resulting with: (press the image to enlarge it)

Repeating the analysis offers even more insights. First, it would appear that until 3 clusters, the algorithm gives rather stable results. From 4 onwards we get various outcomes at each iteration. At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.

Reviewing the new plots, I would prefer to go with the 3 clusters option. Noting how the two “upper” clusters might have similar properties while the lower cluster is quite distinct from the other two.

By the way, the Iris data set is composed of three types of flowers. I imagine the kmeans had done a decent job in distinguishing the three.

Limitation of the method (and a possible way to overcome it?!)

It is worth noting that the current way the algorithm is built has a fundamental limitation: The plot is good for detecting a situation where there are several clusters but each of them is clearly “bigger” then the one before it (on the first principal component of the data).

For example, let’s create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")# Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")set.seed(250)
Data <-rbind(cbind(rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3)),
cbind(rnorm(100,1, sd=0.3),rnorm(100,1, sd=0.3),rnorm(100,1, sd=0.3)),
cbind(rnorm(100,2, sd=0.3),rnorm(100,2, sd=0.3),rnorm(100,2, sd=0.3)))
clustergram(Data, k.range=2:5 , line.width= .004, add.center.points=T)

The resulting plot for this is the following:

The image shows a clear distinction between three ranks of clusters. There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.

But what if the clusters where different but didn’t have an ordering to them? For example, look at the following 4 dimensional data:

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")# Making sure we can source code from github
source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r")set.seed(250)
Data <-rbind(cbind(rnorm(100,1, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3)),
cbind(rnorm(100,0, sd=0.3),rnorm(100,1, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3)),
cbind(rnorm(100,0, sd=0.3),rnorm(100,1, sd=0.3),rnorm(100,1, sd=0.3),rnorm(100,0, sd=0.3)),
cbind(rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,0, sd=0.3),rnorm(100,1, sd=0.3)))
clustergram(Data, k.range=2:8 , line.width= .004, add.center.points=T)

In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters. But what is interesting, is that through the growing number of clusters, we can notice that there are 4 “strands” of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up). Another hope for handling this might be using the color of the lines in some way, but I haven’t yet figured out how.

Clustergram with ggplot2

Hadley Wickham has kindly played with recreating the clustergram using the ggplot2 engine. You can see the result here: http://gist.github.com/439761 And this is what he wrote about it in the comments:

I’ve broken it down into three components: * run the clustering algorithm and get predictions (many_kmeans and all_hclust) * produce the data for the clustergram (clustergram) * plot it (plot.clustergram) I don’t think I have the logic behind the y-position adjustment quite right though.

Conclusions (some rules of thumb and questions for the future)

In a first look, it would appear that the clustergram can be of use. I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).

The three rules of thumb I have noticed by now are:

Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)

Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters

Run the plot multiple times to observe the stability of the cluster formation (and location)

Yet there is more work to be done and questions to seek answers to:

The code needs to be extended to offer methods to various clustering algorithms.

How can the colors of the lines be used better?

How can this be done using other graphical engines (ggplot2/lattice?) – (Update: look at Hadley’s reply in the comments)

What to do in case the first principal component doesn’t capture enough of the data? (maybe plot this graph to all the relevant components. but then – how do you make conclusions of it?)

What other uses/conclusions can be made based on this graph?

I am looking forward to reading your input/ideas in the comments (or in reply posts).

After Andrew Gelman recently lamented the lack of an easy upgrade process for R, a Stackoverflow thread (by JD Long) invited R users to share their strategies for easily upgrading R.

Upgrading strategy – moving to a global R library

In that thread, Dirk Eddelbuettel suggested another idea for upgrading R. His idea is of using a folder for R’s packages which is outside the standard directory tree of the installation (a different strategy then the one offered on the R FAQ).

The idea of this upgrading strategy is to save us steps in upgrading. So when you wish to upgrade R, instead of doing the following three steps:

download new R and install

copy the “library” content from the old R to the new R

upgrade all of the packages (in the library folder) to the new version of R.

You could instead just have steps 1 and 3, and skip step 2 (thus, saving us time…).

For example, under windows XP, you might have R installed on: C:Program FilesRR-2.11.0 But (in this alternative model for upgrading) you will have your packages library on a “global library folder” (global in the sense of independent of a specific R version): C:Program FilesRlibrary

So in order to use this strategy, you will need to do the following steps (all of them are performed in an R code provided later in the post)-

In the OLD R installation (in the first time you move to the new system of managing the upgrade):

Create a new global library folder (if it doesn’t exist)

Copy to the new “global library folder” all of your packages from the old R installation

After you move to this system – the steps 1 and 2 would not need to be repeated. (hence the advantage)

In the NEW R installation:

Create a new global library folder (if it doesn’t exist – in case this is your first R installation)

Premenantly point to the Global library folder whenever R starts

(Optional) Delete from the “Global library folder” all the packages that already exist in the local library folder of the new R install (no need to have doubles)

Update all packages. (notice that you picked a mirror where the packages are up-to-date, you sometimes need to choose another mirror)

Thanks to help from Dirk, David Winsemius and Uwe Ligges, I was able to write the following R code to perform all the tasks I described

When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables). Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

There are four solution for the point-overlap problem that I know of:

Jitter the data a bit to give a sense of the “density” of the points

Use a color spectrum to represent when a point actually represent “many points”

Use different points sizes to represent when there are “many points” in the location of that point

Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

In this post I will offer the code for the a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

In this post I will provide R code that implement’s the combination of repeated running quantile with the LOESS smoother to create a type of “quantile LOESS” (e.g: “Local Quantile Regression”).

This method is useful when the need arise to fit robust and resistant (Need to be verified) a smoothed line for a quantile (an example for such a case is provided at the end of this post).

If you wish to use the function in your own code, simply run inside your R console the following line:

In recent years, a growing need has arisen in different fields, for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with non-standard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. […] we use a non-standard mix of robust and resistant methods: LOWESS and repeated running median.

The motivation for this technique came from “Path data” (of mice) which is

prone to suffer from noise and outliers. During progression a tracking system might lose track of the animal, inserting (occasionally very large) outliers into the data. During lingering, and even more so during arrests, outliers are rare, but the recording noise is large relative to the actual size of the movement. The statistical implications are that the two types of behavior require different degrees of smoothing and resistance. An additional complication is that the two interchange many times throughout a session. As a result, the statistical solution adopted needs not only to smooth the data, but also to recognize, adaptively, when there are arrests. To the best of our knowledge, no single existing smoothing technique has yet been able to fulfill this dual task. We elaborate on the sources of noise, and propose a mix of LOWESS (Cleveland, 1977) and the repeated running median (RRM; Tukey, 1977) to cope with these challenges

If all we wanted to do was to perform moving average (running average) on the data, using R, we could simply use the rollmean function from the zoo package. But since we wanted also to allow quantile smoothing, we turned to use the rollapply function.

R function for performing Quantile LOESS

Here is the R function that implements the LOESS smoothed repeated running quantile (with implementation for using this with a simple implementation for using average instead of quantile):

In this post I showcase a nice bar-plot and a balloon-plot listing recommended Nutritional supplements , according to how much evidence exists for thier benefits, scroll down to see it(and click here for the data behind it) * * * * The gorgeous blog “Information Is Beautiful” recently publish an eye candy post showing a “balloon race” image (see a static version of the image here) illustrating how much evidence exists for the benefits of various Nutritional supplements (such as: green tea, vitamins, herbs, pills and so on) . The higher the bubble in the Y axis score (e.g: the bubble size) for the supplement the greater the evidence there is for its effectiveness (But only for the conditions listed along side the supplement).

There are two reasons this should be of interest to us:

This shows a fun plot, that R currently doesn’t know how to do (at least I wasn’t able to find an implementation for it). So if anyone thinks of an easy way for making one – please let me know.

The data for the graph is openly (and freely) provided to all of us on this Google Doc.

The advantage of having the data on a google doc means that we can see when the data will be updated. But more then that, it means we can easily extract the data into R and have our way with it (Thanks to David Smith’s post on the subject)

For example, I was wondering what are ALL of the top recommended Nutritional supplements, an answer that is not trivial to get from the plot that was in the original post.

In this post I will supply two plots that present the data: A barplot (that in retrospect didn’t prove to be good enough) and a balloon-plot for a table (that seems to me to be much better).

Barplot (You can click the image to enlarge it)

The R code to produce the barplot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition).

# loading the data
supplements.data.0 <-read.csv("http://spreadsheets.google.com/pub?key=0Aqe2P9sYhZ2ndFRKaU1FaWVvOEJiV2NwZ0JHck12X1E&output=csv")
supplements.data<- supplements.data.0[supplements.data.0[,2]>2,]# let's only look at "good" supplements
supplements.data<- supplements.data[!is.na(supplements.data[,2]),]# and we don't want any missing data
supplement.score<- supplements.data[, 2]
ss <-order(supplement.score, decreasing =F)# sort our data
supplement.score<- supplement.score[ss]
supplement.name<- supplements.data[ss, 1]
supplement.benefits<- supplements.data[ss, 4]
supplement.score.col<-factor(as.character(supplement.score))levels(supplement.score.col)<-c("red", "orange", "blue", "dark green")
supplement.score.col<-as.character(supplement.score.col)# mar: c(bottom, left, top, right) The default is c(5, 4, 4, 2) + 0.1.par(mar =c(5,9,4,13))# taking care of the plot margins
bar.y<-barplot(supplement.score, names.arg= supplement.name, las =1, horiz =T, col= supplement.score.col, xlim =c(0,6.2),
main =c("Nutritional supplements efficacy score","(by evidence for its effectiveness on the listed condition)", "(2010)"))axis(4, labels= supplement.benefits, at = bar.y, las =1)# Add right axisabline(h = bar.y, col= supplement.score.col , lty =2)# add some lines so to easily follow each bar

Also, the nice things is that if the guys at Information Is Beautiful will update there data, we could easily run the code and see the updated list of recommended supplements.

Balloon plot So after some web surfing I came around an implementation of a balloon plot in R (Thanks to R graph gallery) There where two problems with using the command out of the box. The first one was that the colors where non informative (easily fixed), the second one was that the X labels where overlapping one another. Since there is no “las” parameter in the function, I just opened the function up, found where this was plotted and changed it manually (a bit messy, but that’s what you have to do sometimes…)

Here are the result (you can click the image for a larger image):

And here is The R code to produce the Balloon plot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition). (it’s just the copy of the function with a tiny bit of editing in line 146, and then using it)

Daniel Malter just shared on the R mailing list (link to the thread) his code for performing the Siegel-Tukey (Nonparametric) test for equality in variability. Excited about the find, I contacted Daniel asking if I could republish his code here, and he kindly replied “yes”. From here on I copy his note at full.