visualization

Interactive Graphics with the iplots Package (from “R in Action”)

Posted in R, visualization on January 24th, 2012 by Tal Galili – Be the first to comment

The followings introductory post is intended for new users of R.  It deals with interactive visualization using R through the iplots package.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…). In previous guest posts by Kabacoff we introduced data.frame objects in R and dealt with the Aggregation and Restructuring of data (using base R functions and the reshape package).

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about Interactive Graphics with the iplots Package:

read more »

Diagram for a Bernoulli process (using R)

Posted in R, statistics, visualization on November 10th, 2011 by Tal Galili – 3 Comments

A Bernoulli process is a sequence of Bernoulli trials (the realization of n binary random variables), taking two values (0/1, Heads/Tails, Boy/Girl, etc…). It is often used in teaching introductory probability/statistics classes about the binomial distribution.

When visualizing a Bernoulli process, it is common to use a binary tree diagram in order to show the progression of the process, as well as the various consequences of the trial. We might also include the number of “successes”, and the probability for reaching a specific terminal node.

I wanted to be able to create such a diagram using R. For this purpose I composed some code which uses the {diagram} R package. The final function should allow one to create different sizes of diagrams, while allowing flexibility with regards to the text which is used in the tree.

Here is an example of the simplest use of the function:

1
2
source("http://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function
binary.tree.for.binomial.game(2) # creating a tree for B(2,0.5)

The resulting diagram will look like this:

The same can be done for creating larger trees. For example, here is the code for a 4 stage Bernoulli process:

1
2
source("http://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function
binary.tree.for.binomial.game(4) # creating a tree for B(4,0.5)

The resulting diagram will look like this:

The function can also be tweaked in order to describe a more specific story. For example, the following code describes a 3 stage Bernoulli process where an unfair coin is tossed 3 times (with probability of it giving heads being 0.8):

1
2
3
4
5
6
source("http://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function
binary.tree.for.binomial.game(3, 0.8, first_box_text = c("Tossing an unfair coin", "(3 times)"), left_branch_text = c("Failure", "Playing again"), right_branch_text = c("Success", "Playing again"), 
    left_leaf_text = c("Failure", "Game ends"), right_leaf_text = c("Success", 
        "Game ends"), cex = 0.8, rescale_radx = 1.2, rescale_rady = 1.2, 
    box_color = "lightgrey", shadow_color = "darkgrey", left_arrow_text = c("Tails \n(P = 0.2)"), 
    right_arrow_text = c("Heads \n(P = 0.8)"), distance_from_arrow = 0.04)

The resulting diagram is:

If you make up neat examples of using the code (or happen to find a bug), or for any other reason – you are welcome to leave a comment.

(note: the images above are licensed under CC BY-SA)

Engineering Data Analysis (with R and ggplot2) – a Google Tech Talk given by Hadley Wickham

Posted in R, R links, visualization on June 17th, 2011 by Tal Galili – 1 Comment

It appears that just days ago, Google Tech Talk released a new, one hour long, video of a presentation (from June 6, 2011) made by one of R’s community more influential contributors, Hadley Wickham.

This seems to be one of the better talks to send a programmer friend who is interested in getting into R.

Talk abstract

Data analysis, the process of converting data into knowledge, insight and understanding, is a critical part of statistics, but there’s surprisingly little research on it. In this talk I’ll introduce some of my recent work, including a model of data analysis. I’m a passionate advocate of programming that data analysis should be carried out using a programming language, and I’ll justify this by discussing some of the requirement of good data analysis (reproducibility, automation and communication). With these in mind, I’ll introduce you to a powerful set of tools for better understanding data: the statistical programming language R, and the ggplot2 domain specific language (DSL) for visualisation.

The video

More resources

Beeswarm Boxplot (and plotting it with R)

Posted in R, visualization on March 10th, 2011 by Tal Galili – 11 Comments

(The image above is called a “Beeswarm Boxplot” , the code for producing this image is provided at the end of this post)

The above plot is implemented under different names in different softwares. This “Scatter Dot Beeswarm Box Violin – plot” (in the lack of an agreed upon term) is a one-dimensional scatter plot which is like “stripchart”, but with closely-packed, non-overlapping points; the positions of the points are corresponding to the frequency in a similar way as the violin-plot. The plot can be superimposed with a boxplot to give a very rich description of the underlaying distribution.

This plot has been implemented in various statistical packages, in this post I will list the few I came by so far. And if you know of an implementation I’ve missed please tell me about it in the comments.

read more »

How to label all the outliers in a boxplot

Posted in R, visualization on January 27th, 2011 by Tal Galili – 22 Comments

In this post I present a function that helps to label outlier observations When plotting a boxplot using R.

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Identifying these points in R is very simply when dealing with only one boxplot and a few outliers. That can easily be done using the “identify” function in R. For example, running the code bellow will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable you to pick the outlier point and have it’s label (in this case, that number id) plotted beside the point:

1
2
3
4
set.seed(482)
y <- rnorm(100)
boxplot(y)
identify(rep(1, length(y)), y, labels = seq_along(y))

However, this solution is not scalable when dealing with:

  • Many outliers
  • Overlapping data-points, and
  • Multiple boxplots in the same graphic window

For such cases I recently wrote the function “boxplot.with.outlier.label” (which you can download from here). This function will plot operates in a similar way as “boxplot” (formula) does, with the added option of defining “label_name”. When outliers are presented, the function will then progress to mark all the outliers using the label_name variable. This function can handle interaction terms and will also try to space the labels so that they won’t overlap (my thanks goes to Greg Snow for his function “spread.labs” from the {TeachingDemos} package, and helpful comments in the R-help mailing list).

Here is some example code you can try out for yourself:

1
2
3
4
5
6
7
8
9
source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
# sample some points and labels for us:
set.seed(492)
y <- rnorm(2000)
x1 <- sample(letters[1:2], 2000,T)
x2 <- sample(letters[1:2], 2000,T)
lab_y <- sample(letters[1:4], 2000,T)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x2*x1, lab_y)

Here is the resulting graph:

You can also have a try and run the following code to see how it handles simpler cases:

1
2
3
4
5
# plot a boxplot without interactions:
boxplot.with.outlier.label(y~x1, lab_y, ylim = c(-5,5))
# plot a boxplot of y only
boxplot.with.outlier.label(y, lab_y, ylim = c(-5,5))
boxplot.with.outlier.label(y, lab_y, spread_text = F) # here the labels will overlap (because I turned spread_text off)

Here is the output of the last example, showing how the plot looks when we allow for the text to overlap.

Regarding package dependencies: notice that this function requires you to first install the packages {TeachingDemos} (by Greg Snow) and {plyr} (by Hadley Wickham)

Updates:

  • 19.04.2011 – I’ve added support to the boxplot “names” and “at” parameters.
  • 31.10.2011 – I’ve fixed a bug report (my thanks goes to Josh O’Brien for the heads up). There is now also support for two arguments allowing to easily change the distance of the labels/segments from the outliers.

You are very much invited to leave your comments if you find a bug, think of ways to improve the function, or simply enjoyed it and would like to share it with me.

R GUI now offers interactive graphics – Deducer 0.4-2 connects with iplots

Posted in R, visualization on October 24th, 2010 by Tal Galili – 2 Comments

Earlier today, Ian Fwllows has announced the release of Deducer 0.4-2 and DeducerExtras 1.2 to CRAN (I copy his announcement here):

Deducer 0.4-2 contains a few bug fixes, and an interface to the iplots package. With the new iplots interface it is now possible to do interactive plots with Deducer. An introductory example screen cast (by Ian) is available on the tube:

DeducerExtras 1.2 contains a few new dialogs including ‘load data from package’, and ‘t-test power’.

Additionally, a new Windows R/JGR/Deducer installer is available which installs R-2.12.0, JGR with it’s launcher, Deducer, DeducerExtras, and DeducerPlugInScaling. It is available on the Deducer website:

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.WindowsInstallation

Rose plot using Deducers ggplot2 plot builder

Posted in R, visualization on August 16th, 2010 by Tal Galili – Be the first to comment

The (excellent!) LearnR blog had a post today about making a rose plot in
ggplot2.

Following today’s announcement, by Ian Fellows, regarding the release of the new version of Deducer (0.4) offering a strong support for ggplot2 using a GUI plot builder, Ian also sent an e-mail where he shows how to create a rose plot using the new ggplot2 GUI included in the latest version of Deducer. After the template is made, the plot can be generated with 4 clicks of the mouse.

Here is a video tutorial (Ian published) to show how this can be used:

The generated template file is available at:
http://neolab.stat.ucla.edu/cranstats/rose.ggtmpl

I am excited about the work Ian is doing, and hope to see more people publish use cases with Deducer.

ggplot2 plot builder is now on CRAN! (through Deducer 0.4 GUI for R)

Posted in R, statistics, visualization on August 16th, 2010 by Tal Galili – 9 Comments

Ian fellows, a hard working contributer to the R community (and a cool guy), has announced today the release of Deducer (0.4) to CRAN (scheduled to update in the next day or so).
This major update also includes the release of a new plug-in package (DeducerExtras), containing additional dialogs and functionality.

Following is the e-mail he sent out with all the details and demo videos.

read more »

New versions for ggplot2 (0.8.8) and plyr (1.0) were released today

Posted in R, visualization on July 6th, 2010 by Tal Galili – Be the first to comment

As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages.
plyr image
And today (through twitter) Hadley has updates the rest of us with the news:

just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats

Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23. And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements.
ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.

Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists). So Hadley, thank you!

Here is the complete change-log list for both packages:
read more »

Visualization of regression coefficients (in R)

Posted in R, statistics, visualization on July 2nd, 2010 by Tal Galili – Be the first to comment

Update (07.07.10): The function in this post has a more mature version in the “arm” package.  (more details are available at the end of this post.)

Update (04.01.12): There is a new package called Coefplot that offers a more general solution for plotting coeffs. (more details are available at the end of this post.)
* * * *

Imagine you want to give a presentation or report of your latest findings running some sort of regression analysis. How would you do it?

This was exactly the question Wincent Rong-gui HUANG has recently asked on the R mailing list.

One person, Bernd Weiss, responded by linking to the chapter “Plotting Regression Coefficients” on an interesting online book (I have never heard of before) called “Using Graphs Instead of Tables” (I should add this link to the free statistics e-books list…)

Letter in the conversation, Achim Zeileis, has surprised us (well, me) saying the following

I’ve thought about adding a plot() method for the coeftest() function in the “lmtest” package. Essentially, it relies on a coef() and a vcov() method being available – and that a central limit theorem holds. For releasing it as a general function in the package the code is still too raw, but maybe it’s useful for someone on the list. Hence, I’ve included it below.

(I allowed myself to add some bolds in the text)

So for the convenience of all of us, I uploaded Achim’s code in a file for easy access. Here is an example of how to use it:

1
2
3
4
source("http://www.r-statistics.com/wp-content/uploads/2010/07/coefplot.r.txt")
 
data("Mroz", package = "car")
fm

Here is the resulting graph:

I hope Achim will get around to improve the function so he might think it worthy of joining his“lmtest” package. I am glad he shared his code for the rest of us to have something to work with in the meantime :)

* * *

Update (07.07.10):
Thanks to a comment by David Atkins, I found out there is a more mature version of this function (called coefplot) inside the {arm} package. This version offers many features, one of which is the ability to easily stack several confidence intervals one on top of the other.

It works for baysglm, glm, lm, polr objects and a default method is available which takes pre-computed coefficients and associated standard errors from any suitable model.

Example:
(Notice that the Poisson model in comparison with the binomial models does not make much sense, but is enough to illustrate the use of the function)

1
2
3
library("arm")
data("Mroz", package = "car")
M1

(hat tip goes to Allan Engelhardt for help improving the code, and for Achim Zeileis in extending and improving the narration for the example)

Resulting plot

* * *
Another method worth mentioning is the Nomogram, implemented by Frank Harrell’a {rms} package.

* * *

Update (04.01.12):

The package {Coefplot}, by Jared Lander, plots coefficients from lm and glm models as well as from models generated by RevoScaleR’s rxLinMod and rxLogit functions.  The package is built on top of ggplot2 graphics, you can see an example for its use here.