Archive for the ‘statistics’ Category

You are currently browsing the archives for the statistics category.

Want to join the closed BETA of a new Statistical Analysis Q&A site – NOW is the time!

The bottom line of this post is for you to go to:
Stack Exchange Q&A site proposal: Statistical Analysis
And commit yourself to using the website for asking and answering questions.

(And also consider giving the contender, MetaOptimize a visit)

* * * *

Statistical analysis Q&A website is about to go into BETA

A month ago I invited readers of this blog to commit to using a new Q&A website for Data-Analysis (based on StackOverFlow engine), once it will open (the site was originally proposed by Rob Hyndman).
And now, a month later, I am happy to write that over 500 people have shown interest in the website, and choose to commit themselves. This means we we have reached 100% completion of the website proposal process, and in the next few days we will move to the next step.

The next step is that the website will go into closed BETA for about a week. If you want to be part of this – now is the time to join (<--- call for action people).
From being part in some other closed BETA of similar projects, I can attest that the enthusiasm of the people trying to answer questions in the BETA is very impressive, so I strongly recommend the experience.

If you won't make it by the time you see this post, then no worries - about a week or so after the website will go online, it will be open to the wide public.

(p.s: thanks Romunov for pointing out to me that the BETA is about to open)

p.s: MetaOptimize

I would like to finish this post with mentioning MetaOptimize. This is a Q&A website which is of a more “machine learning” then a “statistical” community. It also started out some short while ago, and already it has around 700 users who have submitted ~160 questions with ~520 answers given. From my experience on the site so far, I have enjoyed the high quality of the questions and answers.
When I first came by the website, I feared that supporting this website will split the R community of users between this website and the area 51 StackExchange website.
But after a lengthy discussion (published recently as a post) with MetaOptimize founder, Joseph Turian, I came to have a more optimistic view of the competition of the two websites. Where at first I was afraid, I am now hopeful that each of the two website will manage to draw a tiny bit of different communities of people (that would otherwise wouldn’t be present in the other website) – thus offering all of us a wider variety of knowledge to tap into.

See you there…


StackOverFlow and MetaOptimize are battling to be the #1 “Statistical Analysis Q&A website” – to whom would you signup?

A new statistical analysis Q&A website launched

While the proposal for a statistical analysis Q&A website on area51 (stackexchange) is taking it’s time, and the website is still collecting people who will commit to it,
Joseph Turian, who seems a nice guy from his various comments online, seem to feel this website is not what the community needs and that we shouldn’t hold up on our questions for the website to go online. Therefore, Joseph is pushing with all his might his newest creation “MetaOptimize QA“, a StackOverFlow like website for (long list follows): machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization.
With all the bells and whistles that the OSQA framework (an open source stackoverflow clone, and more, system) can offer (you know, rankings, badges and so on).

Is this new website better then the area51 website? Will all the people go to just one of the two websites. or will we end up with two places that attracts more people then we had to begin with? These are the questions that come to mind when faced with the story in front of us.

My own suggestion is to try both websites (the stackoverflow statistical analysis website to come and “MetaOptimize QA“) and let time tell.

More info on this story bellow.

MetaOptimize online impact so far

The need for such a Q&A site is clearly evident. With just several days after being promoted online, MetaOptimize has claimed the eyes of almost 300 users, submitting 59 questions and 129 answers.
Already many bloggers in the statistical community have contributed their voices with encouraging posts, here is just a collection of the post I was able to find with some googling:

But is it goos to have two websites?

But wait, didn’t we just start pushing forward another statistical Q&A website two weeks ago?  I am talking about the Stack Exchange Q&A site proposal: Statistical Analysis.

So what should we (the community of statistical minded people) to do the next time we have a question?

Should we wait for Stack Exchange offer for a new website to start?  Or should we start using MetaOptimize?

Update: after lengthy e-mail exchange with Joseph (the person who founded MetaOptimize), I decided to erase what I originally wrote as my doubts, and instead give a Q&A session that him and I have had in the e-mails exchange.  It is a bit edited from what was originally, and some of the content will probably get updated – so if you are into this subject, check in again in a few hours :)


Honestly, I am split in two (and Joseph, I do hope you’ll take this in a positive way, since personally I feel confident you are a good guy).  I very strongly believe in the need and value of such a Q&A website.  Yet I am wondering how I feel about such a website being hosted as MetaOptimize and outside the hands of the stackoverflow guys.
On the one hand, open source lovers (like myself) tend to like decentralization and reliance on OSS (open source software) solutions (such as the one OSQA framework offers).  On the other hand, I do believe that the stackoverflow people  have (much) more experience in handling such websites then Joseph.  I can very easily trust them to do regular database backups, share the websites database dumps with the general community, smoothly test and upgrade to provide new features, and generally speaking perform in a more  experienced way with the online Q&A community.
It doesn’t mean that Joseph won’t do a great job, personally I hope he will.

Q&A session with Joseph Turian (MetaOptimize founder)

Tal: Let’s start with the easy question, should I worry about technical issues in the website (like, for example, backups)?

Joseph:

The OSQA team (backed by DZone) have got my back. They have been very helpful since day one to all OSQA users, and have given me a lot of support. Thanks, especially Rick and Hernani!

They provide email and chat support for OSQA users.

I will commit to putting up regular automatic database dumps, whenever the OSQA team implements it:
http://meta.osqa.net/questions/3120/how-do-i-offer-database-dumps
If, in six months, they don’t have this feature as part of their core, and someone (e.g. you) emails me reminding me that they want a dump, I will manually do a database dump and strip the user table.

Also, I’ve got a scheduled daily database dump that is mirrored to Amazon S3.

Tal: Why did you start MetaOptimize instead of supporting the area51 proposal?
Joseph:

  1. On Area51, people asked to have AI merged with ML, and ML merged with statistical analysis, but their requests seemed to be ignored. This seemed like a huge disservice to these communities.
  2. Area 51 didn’t have academics in ML + NLP. I know from experience it’s hard to get them to buy in to new technology. So why would I risk my reputation getting them to sign up for Area 51, when I know that I will get a 1% conversion? They aren’t early adopters interested in the process, many are late adopters who won’t sign up for something until they have too.
  3. If the Area 51 sites had a strong newbie bent, which is what it seemed like the direction was going, then the academic experts definitely wouldn’t waste their time. It would become a support
    community for newbies, without core expert discussion. So basically, I know that I and a lot of my colleagues wanted the site I built. And I felt like area 51 was shaping the communities really incorrectly in several respects, and was also taking a while.  I could have fought an institutional process and maybe gotten half the results above and it took a few months, or I could just build the site and invite my friends, and shape the community correctly.

Besides that, there are also personal motives:

  • I wanted the recognition for having a good vision for the community, and driving forward something they really like.
  • I wanted to experiment with some NLP and ML extensions for the Q+A software, to help organize the information better. Not possible on a closed platform.

Tal: Me (and maybe some other people) fear that this might fork the people in the field to two websites, instead of bringing them together. What are your thoughts about that?
Joseph:
How am I forking the community? I’m bringing a bunch of people in who wouldn’t have even been part of the Area 51 community.
Area 51 was going to fork it into five communities: stat analysis, ML, NLP, AI, and data mining. And then a lot fewer people would have been involved.

Tal: What are the things that people who support your website are saying?
Joseph:
Here are some quotes about my site:

Philip Resnick (UMD): “Looking at the questions being asked, the people responding, and the quality of the discussion, I can already see this becoming the go-to place for those ‘under the hood’ details
you rarely see in the textbooks or conference papers. This site is going to save a lot of people an awful lot of time and frustration.”

Aria Haghighi (Berkeley): “Both NLP and ML have a lot of folk wisdom about what works and what doesn’t. A site like this is crucial for facilitating the sharing and validation of this collective knowledge.”

Alexandre Passos (Unicamp): “Really thank you for that. As a machine learning phd student from somewhere far from most good research centers (I’m in brazil, and how many brazillian ML papers have you
seen in NIPS/ICML recently?), I struggle a lot with this folk wisdom. Most professors around here haven’t really interacted enough with the international ML community to be up to date”
(http://news.ycombinator.com/item?id=1476247)

Ryan McDonald (Google): “A tool like this will help disseminate and archive the tricks and best practices that are common in NLP/ML, but are rarely written about at length in papers.”

esoom on Reddit: “This is awesome. I’m really impressed by the quality of some of the answers, too. Within five minutes of skimming the site, I learned a neat trick that isn’t widely discussed in the literature.”
(http://www.reddit.com/r/MachineLearning/comments/ckw5k/stackoverflow_for_machine_learning_and_natural/c0tb3gc)

Tal: In order to be fair to area51 work, they have gotten wonderful responses for the “statistical analysis” proposal as well (see it here)
I have also contacted area51 directly and asked them and invited them to come and join the discussion. I’ll update this post with their reply.

So what’s next?

I don’t know.
If the Stack Exchange website where to launch today, I would probably focus on using it and hint to the site for MetaOptimize (for the reasons I just mentioned, and also for some that Rob Hyndman maintained when he first wrote on the subject).
If the stack exchange version of the website where to start in a few weeks, I would probably sit on the fence and see if people are using it.  I suspect that by that time, there wouldn’t be many people left to populate it (but I could always be wrong).
And what if the website where to start in a week, what then?  I have no clue.

Good question.
My current feeling is that I am glad to let this play out.
It seems this is a good case study for some healthy competition between platforms and models (OSQA vs stackoverflow/area51-system) – one that I hope will generate more good features from both companies. And also will make both parties work hard to get people to participate.
It also seems that this situation is getting many people in our field to be approached with the same idea (Q&A website). After Joseph input on the subject, I am starting to think that maybe at the end of the day this will benefit all of us. Instead of forking one community into two, maybe what we’ll end up with is getting more (experienced) people online (into two locations) that would otherwise would have stayed in the shadows.

The verdict is still out, but I am a bit more optimistic than I was when first writing this post. I’ll update this post after getting more input from people.

And as always – I would love to know your thoughts on the subject.


Visualization of regression coefficients (in R)

Update (07.07.10): The function in this post has a more mature version in the “arm” package. See at the end of this post for more details.
* * * *

Imagine you want to give a presentation or report of your latest findings running some sort of regression analysis. How would you do it?

This was exactly the question Wincent Rong-gui HUANG has recently asked on the R mailing list.

One person, Bernd Weiss, responded by linking to the chapter “Plotting Regression Coefficients” on an interesting online book (I have never heard of before) called “Using Graphs Instead of Tables” (I should add this link to the free statistics e-books list…)

Letter in the conversation, Achim Zeileis, has surprised us (well, me) saying the following

I’ve thought about adding a plot() method for the coeftest() function in the “lmtest” package. Essentially, it relies on a coef() and a vcov() method being available – and that a central limit theorem holds. For releasing it as a general function in the package the code is still too raw, but maybe it’s useful for someone on the list. Hence, I’ve included it below.

(I allowed myself to add some bolds in the text)

So for the convenience of all of us, I uploaded Achim’s code in a file for easy access. Here is an example of how to use it:

source("http://www.r-statistics.com/wp-content/uploads/2010/07/coefplot.r.txt")
 
data("Mroz", package = "car")
fm <- glm(lfp ~ ., data = Mroz, family = binomial)
coefplot(fm, parm = -1)

Here is the resulting graph:

I hope Achim will get around to improve the function so he might think it worthy of joining his“lmtest” package. I am glad he shared his code for the rest of us to have something to work with in the meantime :)

* * *

Update (07.07.10):
Thanks to a comment by David Atkins, I found out there is a more mature version of this function (called coefplot) inside the {arm} package. This version offers many features, one of which is the ability to easily stack several confidence intervals one on top of the other.

It works for baysglm, glm, lm, polr objects and a default method is available which takes pre-computed coefficients and associated standard errors from any suitable model.

Example:
(Notice that the Poisson model in comparison with the binomial models does not make much sense, but is enough to illustrate the use of the function)

library("arm")
data("Mroz", package = "car")
M1<-      glm(lfp ~ ., data = Mroz, family = binomial)
M2<- bayesglm(lfp ~ ., data = Mroz, family = binomial)
M3<-      glm(lfp ~ ., data = Mroz, family = binomial(probit))
coefplot(M2, xlim=c(-2, 6),            intercept=TRUE)
coefplot(M1, add=TRUE, col.pts="red",  intercept=TRUE)
coefplot(M3, add=TRUE, col.pts="blue", intercept=TRUE, offset=0.2)

(hat tip goes to Allan Engelhardt for help improving the code, and for Achim Zeileis in extending and improving the narration for the example)

Resulting plot

* * *
Lastly, another method worth mentioning is the Nomogram, implemented by Frank Harrell’a rms package.


Contest: Road Traffic Prediction for Intelligent GPS Navigation

About prize baring contests

Competition with prizes are an amazing thing. If you are not sure of that, I urge you to listened to Peter Diamandis talk about his experience with the X prize (start listening at minute 11:40):

At short – prizes can give up to 1 to 50 ratio of return on investment of the people giving funding to the prize. The money is spent only when results are achieved. And there is a lot of value in terms of public opinion and publicity. And the best of all (for the promoter of the competition) – prizes encourage people to take risks (at their own expense) in order to get results done.

All of that said, I look at prize baring competition as something worth spreading, especially in cases where the results of the winning team will be shared with the public.

About the IEEE ICDM Contest

The IEEE ICDM Contest (“Road Traffic Prediction for Intelligent GPS Navigation”), seems to be one of those cases. Due to a polite request, I am republishing here the details of this new competition, in the hope that some of my R colleagues will bring the community some pride :)
(more…)


The difference between “letters[c(1,NA)]” and “letters[c(NA,NA)]“

In David Smith’s latest blog post (which, in a sense, is a continued response to the latest public attack on R), there was a comment by Barry that caught my eye. Barry wrote:

Even I get caught out on R quirks after 20 years of using it. Compare letters[c(12,NA)] and letters[c(NA,NA)] for the most recent thing that made me bang my head against the wall.

So I did, and here’s the output:

> letters[c(12,NA)]
[1] "l" NA 
>  letters[c(NA,NA)] 
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>

Interesting isn’t it?
I had no clue why this had happened but luckily for us, Barry gave a follow-up reply with an explanation. And here is what he wrote:
(more…)


Parallel Multicore Processing with R (on Windows)

This post offers simple example and installation tips for “doSMP” the new Parallel Processing backend package for R under windows.
* * *

Update:
The required packages are not yet available on CRAN, but until they will get online, you can download them from here:
REvolution foreach windows bundle
(Simply unzip the folders inside your R library folder)

* * *

Recently, REvolution blog announced the release of “doSMP”, an R package which offers support for symmetric multicore processing (SMP) on Windows.
This means you can now speed up loops in R code running iterations in parallel on a multi-core or multi-processor machine, thus offering windows users what was until recently available for only Linux/Mac users through the doMC package.

Installation

For now, doSMP is not available on CRAN, so in order to get it you will need to download the REvolution R distribution “R Community 3.2” (they will ask you to supply your e-mail, but I trust REvolution won’t do anything too bad with it…)
If you already have R installed, and want to keep using it (and not the REvolution distribution, as was the case with me), you can navigate to the library folder inside the REvolution distribution it, and copy all the folders (package folders) from there to the library folder in your own R installation.

If you are using R 2.11.0, you will also need to download (and install) the revoIPC package from here:
revoIPC package – download link (required for running doSMP on windows)
(Thanks to Tao Shi for making this available!)

Usage

Once you got the folders in place, you can then load the packages and do something like this:

require(doSMP)
workers <- startWorkers(2) # My computer has 2 cores
registerDoSMP(workers)
 
# create a function to run in each itteration of the loop
check <-function(n) {
	for(i in 1:1000)
	{
		sme <- matrix(rnorm(100), 10,10)
		solve(sme)
	}
}
 
 
times <- 10	# times to run the loop
 
# comparing the running time for each loop
system.time(x <- foreach(j=1:times ) %dopar% check(j))  #  2.56 seconds  (notice that the first run would be slower, because of R's lazy loading)
system.time(for(j in 1:times ) x <- check(j))  #  4.82 seconds
 
# stop workers
stopWorkers(workers)

Points to notice:

  • You will only benefit from the parallelism if the body of the loop is performing time-consuming operations. Otherwise, R serial loops will be faster
  • Notice that on the first run, the foreach loop could be slow because of R’s lazy loading of functions.
  • I am using startWorkers(2) because my computer has two cores, if your computer has more (for example 4) use more.
  • Lastly – if you want more examples on usage, look at the “ParallelR Lite User’s Guide”, included with REvolution R Community 3.2 installation in the “doc” folder

Updates (15.5.10)

The new R version (2.11.0) doesn’t work with doSMP, and will return you with the following error:

Loading required package: revoIPC
Error: package ‘revoIPC’ was built for i386-pc-intel32

So far, a solution is not found, except using REvolution R distribution, or using R 2.10
A thread on the subject was started recently to report the problem. Updates will be given in case someone would come up with better solutions.

Thanks to Tao Shi, there is now a solution to the problem. You’ll need to download the revoIPC package from here:
revoIPC package – download link (required for running doSMP on windows)
Install the package on your R distribution, and follow all of the other steps detailed earlier in this post. It will now work fine on R 2.11.0

Update 2: Notice that I added, in the beginning of the post, a download link to all the packages required for running parallel foreach with R 2.11.0 on windows. (That is until they will be uploaded to CRAN)


Repeated measures ANOVA with R (functions and tutorials)

Repeated measures ANOVA is a common task for the data analyst.

There are (at least) two ways of performing “repeated measures ANOVA” using R but none is really trivial, and each way has it’s own complication/pitfalls (explanation/solution to which I was usually able to find through searching in the R-help mailing list).

So for future reference, I am starting this page to document links I find to tutorials, explanations (and troubleshooting) of “repeated measure ANOVA” done with R

Functions and packages

(I suggest using the tutorials supplied bellow for how to use these functions)

  • aov {stats} – offers SS type I repeated measures anova, by a call to lm for each stratum. A short example is given in the ?aov help file
  • Anova {car} – Calculates type-II or type-III analysis-of-variance tables for model objects produced by lm, and for various other object. The ?Anova help file offers an example for how to use this for repeated measures
  • ezANOVA {ez} – This function provides easy analysis of data from factorial experiments, including purely within-Ss designs (a.k.a. “repeated measures”), purely between-Ss designs, and mixed within-and-between-Ss designs, yielding ANOVA results and assumption checks. It is a wrapper of the Anova {car} function, and is easier to use. The ez package also offers the functions ezPlot and ezStats to give plot and statistics of the ANOVA analysis. The ?ezANOVA help file gives a good demonstration for the functions use (My thanks goes to Matthew Finkbe for letting me know about this cool package)
  • friedman.test {stats} – Performs a Friedman rank sum test with unreplicated blocked data. That is, a non-parametric one-way repeated measures anova. I also wrote a wrapper function to perform and plot a post-hoc analysis on the friedman test results
  • Non parametric multi way repeated measures anova – I believe such a function could be developed based on the Proportional Odds Model, maybe using the {repolr} or the {ordinal} packages. But I still didn’t come across any function that implements these models (if you do – please let me know in the comments).

Good Tutorials

Troubelshooting

Unbalanced design
Unbalanced design doesn’t work when doing repeated measures ANOVA with aov, it just doesn’t. This situation occurs if there are missing values in the data or that the data is not from a fully balanced design. The way this will show up in your output is that you will see the between subject section showing withing subject variables.

A solution for this might be to use the Anova function from library car with parameter type=”III”. But before doing that, first make sure you understand the difference between SS type I, II and III. Here is a good tutorial for helping you out with that.
By the way, these links are also useful in case you want to do a simple two way ANOVA for unbalanced design

I will “later” add R-help mailing list discussions that I found helpful on the subject.

If you come across good resources, please let me know about them in the comments.


Correlation scatter-plot matrix for ordered-categorical data

When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables).
Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

There are four solution for the point-overlap problem that I know of:

  1. Jitter the data a bit to give a sense of the “density” of the points
  2. Use a color spectrum to represent when a point actually represent “many points”
  3. Use different points sizes to represent when there are “many points” in the location of that point
  4. Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

In this post I will offer the code for the  a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

And here is the code to produce this plot:

(more…)


Quantile LOESS – Combining a moving quantile window with LOESS (R function)

In this post I will provide R code that implement’s the combination of repeated running quantile with the LOESS smoother to create a type of “quantile LOESS” (e.g: “Local Quantile Regression”).

This method is useful when the need arise to fit robust and resistant (Need to be verified) a smoothed line for a quantile (an example for such a case is provided at the end of this post).

If you wish to use the function in your own code, simply run inside your R console the following line:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")

Background

I came a cross this idea in an article titled “High throughput data analysis in behavioral genetics” by Anat Sakov, Ilan Golani, Dina Lipkind and my advisor Yoav Benjamini. From the abstract:

In recent years, a growing need has arisen in different fields, for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with non-standard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. [...] we use a non-standard mix of robust and resistant methods: LOWESS and repeated running median.

The motivation for this technique came from “Path data” (of mice) which is

prone to suffer from noise and outliers. During progression a tracking system might lose track of the animal, inserting (occasionally very large) outliers into the data. During lingering, and even more so during arrests, outliers are rare, but the recording noise is large relative to the actual size of the movement. The statistical implications are that the two types of behavior require different degrees of smoothing and resistance. An additional complication is that the two interchange many times throughout a session. As a result, the statistical solution adopted needs not only to smooth the data, but also to recognize, adaptively, when there are arrests. To the best of our knowledge, no single existing smoothing technique has yet been able to fulfill this dual task. We elaborate on the sources of noise, and propose a mix of LOWESS (Cleveland, 1977) and the repeated running median (RRM; Tukey, 1977) to cope with these challenges

If all we wanted to do was to perform moving average (running average) on the data, using R, we could simply use the rollmean function from the zoo package.
But since we wanted also to allow quantile smoothing, we turned to use the rollapply function.

R function for performing Quantile LOESS

Here is the R function that implements the LOESS smoothed repeated running quantile (with implementation for using this with a simple implementation for using average instead of quantile):

(more…)


Nutritional supplements efficacy score – Graphing plots of current studies results (using R)

In this post I showcase a nice bar-plot and a balloon-plot listing recommended Nutritional supplements , according to how much evidence exists for thier benefits, scroll down to see it(and click here for the data behind it)
* * * *
The gorgeous blog “Information Is Beautiful” recently publish an eye candy post showing a “balloon race” image (see a static version of the image here) illustrating how much evidence exists for the benefits of various Nutritional supplements (such as: green tea, vitamins, herbs, pills and so on) . The higher the bubble in the Y axis score (e.g: the bubble size) for the supplement the greater the evidence there is for its effectiveness (But only for the conditions listed along side the supplement).

There are two reasons this should be of interest to us:

  1. This shows a fun plot, that R currently doesn’t know how to do (at least I wasn’t able to find an implementation for it). So if anyone thinks of an easy way for making one – please let me know.
  2. The data for the graph is openly (and freely) provided to all of us on this Google Doc.

The advantage of having the data on a google doc means that we can see when the data will be updated. But more then that, it means we can easily extract the data into R and have our way with it (Thanks to David Smith’s post on the subject)

For example, I was wondering what are ALL of the top recommended Nutritional supplements, an answer that is not trivial to get from the plot that was in the original post.

In this post I will supply two plots that present the data: A barplot (that in retrospect didn’t prove to be good enough) and a balloon-plot for a table (that seems to me to be much better).

Barplot
(You can click the image to enlarge it)

The R code to produce the barplot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 
# loading the data
supplements.data.0 < - read.csv("http://spreadsheets.google.com/pub?key=0Aqe2P9sYhZ2ndFRKaU1FaWVvOEJiV2NwZ0JHck12X1E&output=csv")
supplements.data <- supplements.data.0[supplements.data.0[,2] >2,] # let's only look at "good" supplements
supplements.data < - supplements.data[!is.na(supplements.data[,2]),] # and we don't want any missing data
 
supplement.score <- supplements.data[, 2]
ss <- order(supplement.score, decreasing  = F)	# sort our data
supplement.score <- supplement.score[ss]
supplement.name <- supplements.data[ss, 1]
supplement.benefits <- supplements.data[ss, 4]
supplement.score.col <- factor(as.character(supplement.score))
	levels(supplement.score.col) <-  c("red", "orange", "blue", "dark green")
	supplement.score.col <- as.character(supplement.score.col)
 
# mar: c(bottom, left, top, right) The default is c(5, 4, 4, 2) + 0.1.
par(mar = c(5,9,4,13))	# taking care of the plot margins
bar.y <- barplot(supplement.score, names.arg= supplement.name, las = 1, horiz = T, col = supplement.score.col, xlim = c(0,6.2),
				main = c("Nutritional supplements efficacy score","(by evidence for its effectiveness on the listed condition)", "(2010)"))
axis(4, labels = supplement.benefits, at = bar.y, las = 1) # Add right axis
abline(h = bar.y, col = supplement.score.col , lty = 2) # add some lines so to easily follow each bar

Also, the nice things is that if the guys at Information Is Beautiful will update there data, we could easily run the code and see the updated list of recommended supplements.

Balloon plot
So after some web surfing I came around an implementation of a balloon plot in R (Thanks to R graph gallery)
There where two problems with using the command out of the box. The first one was that the colors where non informative (easily fixed), the second one was that the X labels where overlapping one another. Since there is no “las” parameter in the function, I just opened the function up, found where this was plotted and changed it manually (a bit messy, but that’s what you have to do sometimes…)

Here are the result (you can click the image for a larger image):

And here is The R code to produce the Balloon plot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition).
(it’s just the copy of the function with a tiny bit of editing in line 146, and then using it)

(more…)