A new Q&A website for Data-Analysis (based on StackOverFlow engine) – is waiting for you

The bottom line of this post is for you to go to:
Stack Exchange Q&A site proposal: Statistical Analysis
And commit yourself to using the website for asking and answering questions.
144 peoples already committed to using the website, we need 356 more… 🙂
If you are looking for the reasons to do so – read on…

What is the StackOverFlow Q&A website about?

StackOverFlow.com (“SO” for short) is a programming Q & A site that’s free. Free to ask questions, free to answer questions, free to read. Free, And fast.

For the R community, SO offers a growing database of R related questions and answer (click the link to check them out).

You might be asking yourself what’s so special about SO over other available resources such as R mailing lists, R blogs, R wiki and so on?
That is a great question.

The answer is that SO succeeds in doing a great job synthesizing aspects of Wikis, Blogs, Forums, and Digg/Reddit to offer a very powerful Q&A website.

In SO, the new questions are like forum/blog posts (A main text with comments/answers). After someone answers a question, other users can give a thumb-up or a thumb-down to the answer (like digg/reddit). And all content can be edited, like a wiki page, by the users (provided the user has enough “karma points”).
You also get badges (“awards”) for a bunch of actions (like coming to the website every day for a month. Giving an answer that got X amount of thumb-ups and so on). The awards allows someone who is asking a question to see how much the person who had answered him has good reputation (in terms of acceptance/appreciation of his answers by other SO members).
It also offers a small (but effective) ego-boost for the person who gives answers.

So if StackOverFlow is so great – what is this new website you wrote about in the title?

Well, StackOverFlow has one limitation. It deals ONLY with programming questions. Other questions like:

  • Which of the following three graphics best displays this data set? Why?
  • Can you give an example of where I might prefer to use a z-test vs a t-test?
  • What is the relationship between Bayesian and neural networks?

Will not be answered, and the threads will get closed as being “off topic”. Why? because such questions are dealing with: statistics, data analysis, data mining, data visualization – But in no means in programming.

So there is no StackOverFlow-like Q&A website for data analysis… Until now!

In the past few weeks, Rob Hyndman and other users, have made much effort to push the creation of a new website, based on the StackOverFlow engine, to allow for statistically related Q&A.
His proposal for a new website is almost complete. All it need is for you (yes you), to go to the following link:
Stack Exchange Q&A site proposal: Statistical Analysis
And commit yourself to the website (that is, click the button called “commit” – so to declare that you will have interest in reading, asking and answering questions on such a website)

Once a few more tens 379 more people will commit – the website will go online!

Hope to see you there.

Clustergram: visualization and diagnostics for cluster analysis (R code)

About Clustergrams

In 2002, Matthias Schonlau published in “The Stata Journal” an article named “The Clustergram: A graph for visualizing hierarchical and . As explained in the abstract:

In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases.
This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

A similar article was later written and was (maybe) published in “computational statistics”.

Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.

Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.

Clustergram in R (a basic function)

After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit. Therefore, and since I didn’t find any implementation of the graph in R, I went about writing the code to implement it.

The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.

How does the function work: The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns.
The function assumes the data is scaled.
The function then goes about calculating the cluster centers for our data, for varying number of clusters.
For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data. Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won’t go into that in this post).
Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).

My thank goes to Hadley Wickham for offering some good tips on how to prepare the graph.

Here is the code (example follows)

The R function can be downloaded from here
Corrections and remarks can be added in the comments bellow, or on the github code page.

Example on the iris dataset

The iris data set is a favorite example of many R bloggers when writing about R accessors , Data Exporting, Data importing, and for different visualization techniques.
So it seemed only natural to experiment on it here.

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
clustergram(Data, k.range = 2:8, line.width = 0.004) # notice how I am using line.width.  Play with it on your problem, according to the scale of Y.

Here is the output:

Looking at the image we can notice a few interesting things. We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck).
We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not “close” to each other, and that about half the observations goes to each of the new clusters).
And then notice how moving to 5 clusters makes almost no difference.
Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember – this is according the mean of cluster centers by the loading of the first component of the PCA on the data)

If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.

But wait, did our clustering algorithm do a stable job?
Let’s try running the algorithm 6 more times (each run will have a different starting point for the clusters)

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

Resulting with: (press the image to enlarge it)

Repeating the analysis offers even more insights.
First, it would appear that until 3 clusters, the algorithm gives rather stable results.
From 4 onwards we get various outcomes at each iteration.
At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.

Reviewing the new plots, I would prefer to go with the 3 clusters option. Noting how the two “upper” clusters might have similar properties while the lower cluster is quite distinct from the other two.

By the way, the Iris data set is composed of three types of flowers. I imagine the kmeans had done a decent job in distinguishing the three.

Limitation of the method (and a possible way to overcome it?!)

It is worth noting that the current way the algorithm is built has a fundamental limitation: The plot is good for detecting a situation where there are several clusters but each of them is clearly “bigger” then the one before it (on the first principal component of the data).

For example, let’s create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
Data <- rbind(
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),
				cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))
clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

The resulting plot for this is the following:

The image shows a clear distinction between three ranks of clusters. There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.

But what if the clusters where different but didn’t have an ordering to them?
For example, look at the following 4 dimensional data:

source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # Making sure we can source code from github
Data <- rbind(
				cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))
clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters.
But what is interesting, is that through the growing number of clusters, we can notice that there are 4 “strands” of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up).
Another hope for handling this might be using the color of the lines in some way, but I haven’t yet figured out how.

Clustergram with ggplot2

Hadley Wickham has kindly played with recreating the clustergram using the ggplot2 engine. You can see the result here:
And this is what he wrote about it in the comments:

I’ve broken it down into three components:
* run the clustering algorithm and get predictions (many_kmeans and all_hclust)
* produce the data for the clustergram (clustergram)
* plot it (plot.clustergram)
I don’t think I have the logic behind the y-position adjustment quite right though.

Conclusions (some rules of thumb and questions for the future)

In a first look, it would appear that the clustergram can be of use. I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).

The three rules of thumb I have noticed by now are:

  1. Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)
  2. Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters
  3. Run the plot multiple times to observe the stability of the cluster formation (and location)

Yet there is more work to be done and questions to seek answers to:

  • The code needs to be extended to offer methods to various clustering algorithms.
  • How can the colors of the lines be used better?
  • How can this be done using other graphical engines (ggplot2/lattice?) – (Update: look at Hadley’s reply in the comments)
  • What to do in case the first principal component doesn’t capture enough of the data? (maybe plot this graph to all the relevant components. but then – how do you make conclusions of it?)
  • What other uses/conclusions can be made based on this graph?

I am looking forward to reading your input/ideas in the comments (or in reply posts).

June 20, online Registration deadline for useR! 2010

useR!2010 is coming. I am going to give two talks there (I will write more of that soon), but in the meantime, please note that the online registration deadline is coming to an end.

This was published on the R-help mailing list today:


The final registration deadline for the R User Conference is June 20,
2010, one week away.  Later registration will not be possible on site!

Conference webpage:  http://www.R-project.org/useR-2010
Conference program: http://www.R-project.org/useR-2010/program.html


The conference is scheduled for July 21-23, 2010, and will take place at
the campus of the National Institute of Standards and Technology (NIST) in
Gaithersburg, Maryland, USA.

Continue reading “June 20, online Registration deadline for useR! 2010”

Could we run a statistical analysis on iPhone/iPad using R?

Updates (17.07.10 + 13.09.10 + 03.05.11)

03.05.2011: “Satisfaction blog” wrote about the idea to use iPhone with RStudio – great job Julyan!

I now came across David smith’s post on the REvolution blog, pointing to instruction on the R wiki for how to install R on the iPhone!
I didn’t try it myself since it both requires jailbreaking the iPhone, and I don’t have an iPhone. But it is still interesting to know of.

The blog “Computational Mathematics” recently published a post about a package on Cydia to ease R installation on iPhone, you can read it here: R on the iPhone.

Preface – I don’t use Mac

I don’t use Mac! Not that there is anything wrong with that, but I don’t use Mac…

Yet at the same time, wonderful people like my wife, my brother, my thesis advisor and even my mother-in-law – all use mac. So one can’t help but wonder if I might be missing out on something.

Still, for a Windows user like me it is a bit difficult to understand the hype around the iPhone 4 release:

Such releases tend to look to me more like this spoof video about the release of the apple “i”.

So while not using apples product, I have a deep respect for the impact it has made in peoples lives. Which begs the question: Could you use R on an iPhone (or an iPad) ??

Can R be run on iPhone/iPad ?

This question (and the motivation for this post) was raised in an R help mailing list thread a week ago.

After receiving permission from the threads author, I am republishing the content that was presented there in the hopes it might be of interest to other R community members.

And here is what “Marc Schwartz” wrote:
Continue reading “Could we run a statistical analysis on iPhone/iPad using R?”

Syncing files across computers using DropBox


In the past few months I have been using DropBox for syncing my work files between my home and work computer. It has saved me from numerous mistakes and from sending the files to myself via e-mail.

Recently I found this service highly useful for sharing files with 4 other people with whom I am working on a data analysis project. Being so happy with it (and also by gaining more storage space by inviting friends to use it), I thought of sharing my experience here with other R users that might benefit from this cool (free) service.

What is Dropbox?

Dropbox is a Software/Web2.0 file hosting service which enable users to synchronize files and folders between computers across the internet.
This is done by installing a software and then picking a “shared folder” on your computer. From that moment on, that folder will be synced with any computer you choose to install the software on (for example, your home/work computer, your laptop – and so on)

DropBox also enables users to share some of their folders with other DropBox users. This seamless integration of the service with your OS file system (Windows, Mac or Linux) is what’s making this service so comfortable, by allowing me to work with co-workers and have the same “project tree” of folders, all of which are always synced.

You could also share a file “online”, by getting a link to it which you could share with others. So for example, you could write an R code, share it online, and call to it later with source(). This is the easiest way I know of how to do this.

Dropbox is a “cloud computing” Web2.0 file hosting service offering both free and paid services. The free version (which I use) offers 2GB of “shared storage” (unless you invite other users, in which case you get some extended storage space. Which is one of my motivations in writing this post).

Dropbox has other non-trivial uses allowing one to:

The service’s major competitors are Box.net, Sugarsync and Mozy, non of which I have had the chance of trying.

How to start?

Simply go to: DropBox.com
Sign up, install the software, use the new shared folder, and let me know if it helped you 🙂

How to get Extra space?

You can:

  • Earn another 750MB of space by connecting your dropbox to your twitter/facebook account and sending a status update about them. To get this bonus, head over to “Get extra space free!” page.
  • Refer a friend to open a dropbox account (every friend joining earns you another 250MB of space). This bonus is bounded by a total of 8GB of added space (after that, you won’t be allowed any more extra space)
  • Upgrade – pay 10$ a month and get extra 50GB

Helping the blind use R – by exporting R console to Word

Update (2016-01-30): This post is quite old (from 2010), these days it should be easier to have your R output readable by using the knitr package. It allows you to take an R script file and create an HTML output from it using the stitch_rhtml function.

You should also read the article by Jonathan R. Godfrey: Statistical Software from a Blind Person’s Perspective. And have a look at his BrailleR R package.

Preface – R seems a natural fit for the blind statistician

For blind people who wish to do statistics, R can be ideal. R command line interface offers straight forward statistical scripting in the form of question (what is the mean of x) followed by an answer (0.2). That is, instead of point-and-click dialog boxes with jumping windows of results that GUI statistical systems offer.

But there are still more hurdles to face before R can offer a perfect solution to the blind.
In this post I would like to address just one such problem – reading R console output.

Directing R console output to word – to allow blind people to easily navigate in it

Recently, a question was posed in the R-help mailing list by a guy names Faiz, a blind new user of R. Faiz wants to direct R output into word, to allow him to be able to read it. Here is what he wrote:

I would like to read the results of the commands type in the terminal window in Microsoft Word. As a blind user my options are somewhat limited and are time consuming if I want to see the results of the commands that I have type earlier. for example if my first two commands were
and I have typed ten more commands after the first two commands it is not easy for me to see that what was the result of mean(x)
but if I can somehow divert the results of the commands to Microsoft Word it is comparatively easy for me to see what was the result of mean(x) and what were the results of other commands. One another advantage of diverting R’s output to Microsoft Word for me is that from there they can be easily copied into assignments as well.

Faiz later elaborated more on his issue:

I am using Windows XP, and using a screen reader called JAWS. When I type something at the console, I hear once what I have typed, and then the focus is on the next line. Then if I press the up arrow key I get to hear the function I just typed, not its output. For example if I type mean(x) and then I press enter I will hear “[5]” if it is the mean of x. Then I will hear “>”. Now if I want to find out what was the mean of x by pressing the
up arrow key, I will only hear mean(x) and I will not hear [5].
My screen reader does provide options to use different cursors to read command lines.
but if I have typed median(x) sd(x) var(x) length(x) after typing mean(x), it takes a long time before I can move my cursor to the location where I can hear the mean of x. If the results of the commands can be diverted to MS Word it becomes comparatively easy for me to quickly move forward and backward in the document.

Any ideas and suggestions are appreciated.

Since recently I reviewed how one could export R output to MS-Word with R2wd, It was only fitting to try and implement R2wd for this problem.
I went looking on how to direct R console into a txt file, so I could later dump it into word. I found that two commands gave me half of what I wanted. sink() allows me to direct R output to a txt file, and savehistory() can save the command history into a txt file. But I needed something that combines the two and captures all of R console output into a file.
Failing to locate one, I turned to the R mailing list. Among the kind people trying to help (Thank you David Winsemius, Bert Gunter and Duncan Murdoch) Greg Snow came through in supplying the help (not surprisingly…).
Greg directed me to a function he wrote called txtStart() (from the TeachingDemos package), which operates in a similar way as sink(), only it also captures the R commands that where used – exactly what I was looking for!

Based on this, I devised two functions that can be used to redirect R output into word.

Here is how to use them:

# Step 1: reading the functions needed for this task, from the file I uploaded to www.r-statistics.com
# Example:
# Step 2 - start capturing
txtStart.2wd()	# start capturing text.  If you are missing any packages - this function will prompt you to install them
				# IF the installation fails - consider changing your mirror location
# Step 3 - run R code

For me, this worked…

If you would like R to automatically run in the startup the code needed to get the two functions: txtStart.2wd and txtStop.2wd , you can run this in your R console: (once is enough)

# Start of code

Bringing R to the blind: there is much more work a head!

Until this point, it didn’t cross my mind to ask how can R be used by the blind. But once this question was raised – it brings with it many more questions.
Can R be adjusted to easily be read by known aids to sight impaired people? (I am sure Linux users here will have much to add)
Can people in the community think of writing function to turn R output into a more easily read text for the blind?
For example – the summary() command is wonderful for me. But I am trying to imagine how it would look like in the “eyes” of a person who can’t see. Surly there could be some way to turn the wide summary format into a long format.
Perhaps there is room for a more general approach to the question of how to help blind people to be able to use R.
And is there a need? How many blind people choose to pursue studying statistics (or disciplines for which they would need to know statistics/R)?
I hope to read your thoughts on the matter.

On a personal note: My father was on the verge of blindness, prior to his cataract surgery. I saw first hand how the life of the sight-impaired can look like. Giving people in that situation help is a great MITZVA (a.k.a: “good deed” in Hebrew).

useR-2010 is looking for a T-shirt design

Katharine Mullen has just published on the R mailing list a call for designeRs who might be willing to design a T-shirt aRt design for the shirt that will be given in useR 2010.

I consider such contests as one of those good-for-the-community things, and hope regular useRs, R bloggers, and companies that are based on R – will consider spreading the word, participating in it (and maybe even offer more bonuses to the designers).

If you design something and put it on picasa or flickr, please tag it with “useR2010Tshirt” (and consider leaving a comment with a link to the design), so there could later be a follow up on your work. Even if you don’t “win” you will get positive “karma points” from the community 🙂 .

Here are the competition details, as published in the mailing list:
Continue reading “useR-2010 is looking for a T-shirt design”