fun | R-statistics blog

R-Node: a web front-end to R with Protovis

Update (April 6 – 2010) : R-Node now has it’s own a website, with a dedicated google group (you can join it here)

* * * *

The integration of R into online web services is (for me) one of the more exciting prospects in R’s future. That is way I was very excited coming across Jamie Love’s recent creation: R-Node.

What is R-Node

R-Node is a (open source) web front-end to R (the statistical analysis package).

Using this front-end, you can from any web browser connect to an R instance running on a remote (or local) server, and interact with it, sending commands and receiving the responses. In particular, graphing commands such as plot() and hist() will execute in the browser, drawing the graph as an SVG image.

You can see a live demonstration of this interface by visiting:
http://69.164.204.238:2904/
And using the following user/password login info:
User: pvdemouser
Password: svL35NmPwMnt
(This link was originally posted here)

Here are some screenshots:

In the second screenshot you see the results of the R command ‘plot(x, y)’ (with the reimplementation of plot doing the actual plotting), and in the fourth screenshot you see a similar plot command along with a subsequent best fit line (data points calculated with ‘lowess()’) drawn in.

Once in, you can try out R by typing something like:

x <- rnorm(100)
plot(x, main="Random numbers")
l <- lowess(x)
lines (l$y)

The plot and lines commands will bring up a graph - you can escape out of it, download the graph as a SVG file, and change the graph type (e.g. do: plot (x, type="o") ).
Many R commands will work, though only the hist(), plot() and lines() work for graphing.
Please don't type the R command q() - it will quit the server, stopping it working for everyone! Also, as everyone shares the same session for now, using more unique variable name than 'x' and 'l' will help you.

Currently there is only limited error checking but the code continues to be improved and developed. You can download it from:
http://gitorious.org/r-node

How do you may imagine yourself using something like this? Feel invited to share with me and everyone else in the comments.

Here are some of the more technical details of R-Node:
Continue reading "R-Node: a web front-end to R with Protovis"

Fun interpretive dances for common statistical plots

My wife is a big lover of dance (especially Dance In Israel), and while reading through the NYtimes article: “To Impress, Tufts Prospects Turn to YouTube“, she found me a pearl: A woman performing interpretive dances for math/statistical plots. That includes small dance for: scatter plots, boxplots, barplots and a few others. Enjoy:

http://www.youtube.com/watch?v=CNPXUWsMdIo

Is it harder to advertise to the more educated? Correlation in US States data will not be enough to answer us…

“Chitika research” published today a fun small dataset (you can download it from here) in a post titled “The Educated are Harder to Advertise To”.

In this post I have three goals in mind:

Suggesting another plot instead of the one used in the original post.
Emphasizing the “Correlation does not imply causation” rule.
Inviting other R lovers (as myself) to find fun things to do with this (and similar) dataset.

The Data

The data set is comprised of 51 rows, one for each US states with the two variables (columns):

CTR – The CTR means “Click Through Rate” and is from chitika data base and collected from over two random days in January (a total of 31,667,158 total impressions), and is from the full range of Internet users (they don’t have traditional demographic data – every impression is completely anonymous).
Percent of the population who graduated college.

Super basic analysis and plot

This data presents a stunning -0.63 correlation between the two measurements. Hinting that “The Educated are Harder to Advertise To” (as the original post suggested). The data can be easily visualized using a scatter plot:

Created using just a few lines of R code:

aa <- read.table("https://www.r-statistics.com/wp-content/uploads/2010/02/State_CTR_Date.txt", sep = "t", header = T)
aa[,2:3] <- aa[,2:3] * 100
plot(aa[,2] ~ aa[,3], sub = paste("Correlation: ", round(cor(aa[,2], aa[,3]), 2)),
	main = "Scatter plot of %CTR VS %College_Grad per State",
	xlab = "%College_Grad per State",
	ylab = "%CTR per State"
	)
abline(lm(aa[,2] ~ aa[,3]), col = "blue")

My conclusion from the analysis

I was asked in the comments (by Eyal) to add my own conclusions to the analysis. Does higher intelligence imply lower chances of clicking ads, my answer (under the present data) is simple "I don't know". The only real conclusion I can make of the data is that there might be a point in checking this effect in a more rigorous way (which I am sure is already being done).

What should we have done in order to know? When doing scientific research, we often ask ourselves how sure are we of our results. The rule of thumb for this type of question is called "the pyramid of evidence". It is a way to organize various ways of getting "information" about the world, in an hierarchy of reliability. Here is a picture of this pyramid:

(Credit: image source)

We can see that the most reliable source is a systematic review of randomized controlled trials. In our case, that would mean having controlled experiments where you take groups of people with different levels of "intelligence" (how would you measure that?), and check their CTR (click through rates) on banner ads. This should be done in various ways, correcting for various confounders , and later the results and conclusions (from several such experiments) should be systematically reviewed by experts on the subject.

All of this should be done in order to make a real assessment of the underlying question - how does smarts effects banner clicking.
And the reason we need all of this work is because of what is said in the title of the next section:

Correlation does not imply causation

As is written in the article on wikipedia:

"Correlation does not imply causation" is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other (though it does not remove the fact that correlation can still be a hint, whether powerful or otherwise). The opposite belief, correlation proves causation, is a logical fallacy by which two events that occur together are claimed to have a cause-and-effect relationship.

But a much clearer explenation of it was given by the following XKCD comic strip:
Correlation on XKCD

Next step: other resources to play with

The motivation for my post is based on this digg post trying to hint how Religiousness is connected to "negative" things such as crimes, poverty and so on. That post was based on the following links:

http://www.gallup.com/poll/114022/state-states-importance-religion.aspx#2
http://www.top50states.com/average-iq-score.html
http://www.census.gov/cgi-bin/saipe/national.cgi?year=2008&ascii=
http://www.census.gov/compendia/statab/cats/law_enforcement_courts_prisons/crimes_and_crime_rates.html
http://www.infoplease.com/ipa/a0923080.html
http://www.fraserinstitute.org/researchandpublications/publications/7071.aspx
http://www.gallup.com/poll/122333/political-ideologt-conservative-label-prevails-south.aspx#2
http://www.ahiphiwire.org/wellbeing/display.aspx?doc_code=RWBStateRanks

If someone is motivated, he/she can extract that data and combine it with the current provided data.

In conclusion: this simplistic dataset, combined with other data resources, provides opportunity for various fun demonstrations of pairs correlation plots and of nice spatial plots (of states colored by their matching variable). It is a good opportunity to emphasize (to students, friends and the like) that "Correlation does not imply causation!".
And finally - If you are an R lover/blogger and feel like playing with this - please let me know 🙂 .