R-bloggers: an example of how interest networks propel viral events

A guest post by Jeff Hemsley, who has co-authored with Karine Nahon a new book titled Going Viral.
————————-

In Going Viral (Polity Press, 2013) we explore the topic of virality, the process of sharing messages that results in a fast, broad spread of information. What does that have to do R, or the R-bloggers community? First and foremost, we use the R-bloggers community as an example of the role of interest networks (see description below) in driving viral events. But we also used R as our go-to tool for our research that went into the book. Even the cover art, pictured here, was created with R, using the iGraph package. Included below is an excerpt from chapter 4 that includes the section on interest networks and R-bloggers.

GoingViral

Continue reading “R-bloggers: an example of how interest networks propel viral events”

100 most read R posts for 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages

R-bloggers.com is now three years young. The site is an (unofficial) online journal of the R statistical programming environment, written by bloggers who agreed to contribute their R articles to the site.

Last year, I posted on the top 24 R posts of 2011. In this post I wish to celebrate R-bloggers’ third birthmounth by sharing with you:

  1. Links to the top 100 most read R posts of 2012
  2. Statistics on “how well” R-bloggers did this year
  3. My wishlist for the R community for 2013 (blogging about R, guest posts, and sponsors)

1. Top 100 R posts of 2012

R-bloggers’ success is thanks to the content submitted by the over 400 R bloggers who have joined r-bloggers.  The R community currently has around 245 active R bloggers (links to the blogs are clearly visible in the right navigation bar on the R-bloggers homepage).  In the past year, these bloggers wrote around 3200 posts about R!

Here is a list of the top visited posts on the site in 2012 (you can see the number of unique visitors in parentheses, while the list is ordered by the number of total page views):

Continue reading “100 most read R posts for 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages”

Printing nested tables in R – bridging between the {reshape} and {tables} packages

This post shows how to print a prettier nested pivot table, created using the {reshape} package (similar to what you would get with Microsoft Excel), so you could print it either in the R terminal or as a LaTeX table. This task is done by bridging between the cast_df object produced by the {reshape} package, […]

This post shows how to print a prettier nested pivot table, created using the {reshape} package (similar to what you would get with Microsoft Excel), so you could print it either in the R terminal or as a LaTeX table. This task is done by bridging between the cast_df object produced by the {reshape} package, and the tabular function introduced by the new {tables} package.

Here is an example of the type of output we wish to produce in the R terminal:

1
2
3
4
5
6
7
       ozone       solar.r        wind         temp
 month mean  sd    mean    sd     mean   sd    mean  sd
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356

Or in a latex document:

Motivation: creating pretty nested tables

In a recent post we learned how to use the {reshape} package (by Hadley Wickham) in order to aggregate and reshape data (in R) using the melt and cast functions.

The cast function is wonderful but it has one problem – the format of the output. As opposed to a pivot table in (for example) MS excel, the output of a nested table created by cast is very “flat”. That is, there is only one row for the header, and only one column for the row names. So for both the R terminal, or an Sweave document, when we deal with a more complex reshaping/aggregating, the result is not something you would be proud to send to a journal.

The opportunity: the {tables} package

The good news is that Duncan Murdoch have recently released a new package to CRAN called {tables}. The {tables} package can compute and display complex tables of summary statistics and turn them into nice looking tables in Sweave (LaTeX) documents. For using the full power of this package, you are invited to read through its detailed (and well written) 23 pages Vignette. However, some of us might have preferred to keep using the syntax of the {reshape} package, while also benefiting from the great formatting that is offered by the new {tables} package. For this purpose, I devised a function that bridges between cast_df (from {reshape}) and the tabular function (from {tables}).

The bridge: between the {tables} and the {reshape} packages

The code for the function is available on my github (link: tabular.cast_df.r on github) and it seems to works fine as far as I can see (though I wouldn’t run it on larger data files since it relies on melting a cast_df object.)

Here is an example for how to load and use the function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
######################
# Loading the functions
######################
# Making sure we can source code from github
source("https://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")
 
# Reading in the function for using tabular on a cast_df object:
source_https("https://raw.github.com/talgalili/R-code-snippets/master/tabular.cast_df.r")
 
 
 
######################
# example:
######################
 
############
# Loading and preparing some data
require(reshape)
names(airquality) <- tolower(names(airquality))
airquality2 <- airquality
airquality2$temp2 <- ifelse(airquality2$temp > median(airquality2$temp), "hot", "cold")
aqm <- melt(airquality2, id=c("month", "day","temp2"), na.rm=TRUE)
colnames(aqm)[4] <- "variable2"	# because otherwise the function is having problem when relying on the melt function of the cast object
head(aqm,3)
#  month day temp2 variable2 value
#1     5   1  cold     ozone    41
#2     5   2  cold     ozone    36
#3     5   3  cold     ozone    12
 
############
# Running the example:
tabular.cast_df(cast(aqm, month ~ variable2, c(mean,sd)))
tabular(cast(aqm, month ~ variable2, c(mean,sd))) # notice how we turned tabular to be an S3 method that can deal with a cast_df object
Hmisc::latex(tabular(cast(aqm, month ~ variable2, c(mean,sd)))) # this is what we would have used for an Sweave document

And here are the results in the terminal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>
> tabular.cast_df(cast(aqm, month ~ variable2, c(mean,sd)))
 
       ozone       solar.r        wind         temp
 month mean  sd    mean    sd     mean   sd    mean  sd
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356
> tabular(cast(aqm, month ~ variable2, c(mean,sd))) # notice how we turned tabular to be an S3 method that can deal with a cast_df object
 
       ozone       solar.r        wind         temp
 month mean  sd    mean    sd     mean   sd    mean  sd
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356

And in an Sweave document:

Here is an example for the Rnw file that produces the above table:
cast_df to tabular.Rnw

I will finish with saying that the tabular function offers more flexibility then the one offered by the function I provided. If you find any bugs or have suggestions of improvement, you are invited to leave a comment here or inside the code on github.

(Link-tip goes to Tony Breyal for putting together a solution for sourcing r code from github.)

Interactive Graphics with the iplots Package (from “R in Action”)

The followings introductory post is intended for new users of R.  It deals with interactive visualization using R through the iplots package.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…). In previous guest posts by Kabacoff we introduced data.frame objects in R and dealt with the Aggregation and Restructuring of data (using base R functions and the reshape package).

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about Interactive Graphics with the iplots Package:

Interactive Graphics with the iplots Package

The base installation of R provides limited interactivity with graphs. You can modify graphs by issuing additional program statements, but there’s little that you can do to modify them or gather new information from them using the mouse. However, there are contributed packages that greatly enhance your ability to interact with the graphs you create—playwith, latticist, iplots, and rggobi. In this article, we’ll focus on functions provided by the iplots package. Be sure to install it before first use.

While playwith and latticist allow you to interact with a single graph, the iplots package takes interaction in a different direction. This package provides interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots, and histograms that can be linked together and color brushed. This means that you can select and identify observations using the mouse, and highlighting observations in one graph will automatically highlight the same observations in all other open graphs. You can also use the mouse to obtain information about graphic objects such as points, bars, lines, and box plots.

The iplots package is implemented through Java and the primary functions are listed in table 1.

Table 1 iplot functions

Function

Description

ibar()Interactive bar chart
ibox()Interactive box plot
ihist()Interactive histogram
imap()Interactive map
imosaic()Interactive mosaic plot
ipcp()Interactive parallel coordinates plot
iplot()Interactive scatter plot

To understand how iplots works, execute the code provided in listing 1.

Listing 1 iplots demonstration

1
2
3
4
5
6
7
8
9
10
11
12
library(iplots)
attach(mtcars)
cylinders <- factor(cyl)
gears <- factor(gear)
transmission <- factor(am)
ihist(mpg)
ibar(gears)
iplot(mpg, wt)
ibox(mtcars[c("mpg", "wt", "qsec", "disp", "hp")])
ipcp(mtcars[c("mpg", "wt", "qsec", "disp", "hp")])
imosaic(transmission, cylinders)
detach(mtcars)

Six windows containing graphs will open. Rearrange them on the desktop so that each is visible (each can be resized if necessary). A portion of the display is provided in figure 1.

Figure 1 An iplots demonstration created by listing 1. Only four of the six windows are displayed to save room. In these graphs, the user has clicked on the three-gear bar in the bar chart window.

Now try the following:

  • Click on the three-gear bar in the Barchart (gears) window. The bar will turn red. In addition, all cars with three-gear engines will be highlighted in the other graph windows.
  • Mouse down and drag to select a rectangular region of points in the Scatter plot (wt vs mpg) window. These points will be highlighted and the corresponding observations in every other graph window will also turn red.
  • Hold down the Ctrl key and move the mouse pointer over a point, bar, box plot, or line in one of the graphs. Details about that object will appear in a pop-up window.
  • Right-click on any object and note the options that are offered in the context menu. For example, you can right-click on the Boxplot (mpg) window and change the graph to a parallel coordinates plot (PCP).
  • You can drag to select more than one object (point, bar, and so on) or use Shift-click to select noncontiguous objects. Try selecting both the three- and five-gear bars in the Barchart (gears) window.

The functions in the iplots package allow you to explore the variable distributions and relationships among variables in subgroups of observations that you select interactively. This can provide insights that would be difficult and time-consuming to obtain in other ways. For more information on the iplots package, visit the project website at http://rosuda.org/iplots/.

Summary

In this article, we explored one of the several packages for dynamically interacting with graphs, iplots. This package allows you to interact directly with data in graphs, leading to a greater intimacy with your data and expanded opportunities for developing insights.


This article first appeared as chapter 16.4.4 from the “R in action book, and is published with permission from Manning publishing house.  Other books in this serious which you might be interested in are (see the beginning of this post for a discount code):

Merging two data.frame objects while preserving the rows’ order

Merging two data.frame objects in R is very easily done by using the merge function. While being very powerful, the merge function does not (as of yet) offer to return a merged data.frame that preserved the original order of, one of the two merged, data.frame objects. In this post I describe this problem, and offer […]

Update (2017-02-03) the dplyr package offers a great solution for this issue, see the document Two-table verbs for more details.

Merging two data.frame objects in R is very easily done by using the merge function. While being very powerful, the merge function does not (as of yet) offer to return a merged data.frame that preserved the original order of, one of the two merged, data.frame objects.
In this post I describe this problem, and offer some easy to use code to solve it.

Let us start with a simple example:

    x <- data.frame(
           ref = c( 'Ref1', 'Ref2' )
         , label = c( 'Label01', 'Label02' )
         )
    y <- data.frame(
          id = c( 'A1', 'C2', 'B3', 'D4' )
        , ref = c( 'Ref1', 'Ref2' , 'Ref3','Ref1' )
        , val = c( 1.11, 2.22, 3.33, 4.44 )
        )
 
#######################
# having a look at the two data.frame objects:
> x
   ref   label
1 Ref1 Label01
2 Ref2 Label02
> y
  id  ref  val
1 A1 Ref1 1.11
2 C2 Ref2 2.22
3 B3 Ref3 3.33
4 D4 Ref1 4.44

If we will now merge the two objects, we will find that the order of the rows is different then the original order of the “y” object. This is true whether we use “sort =T” or “sort=F”. You can notice that the original order was an ascending order of the “val” variable:

> merge( x, y, by='ref', all.y = T, sort= T)
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
> merge( x, y, by='ref', all.y = T, sort=F )
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33

This is explained in the help page of ?merge:

The rows are by default lexicographically sorted on the common columns, but for ‘sort = FALSE’ are in an unspecified order.

Or put differently: sort=FALSE doesn’t preserve the order of any of the two entered data.frame objects (x or y); instead it gives us an
unspecified (potentially random) order.

However, it can so happen that we want to make sure the order of the resulting merged data.frame objects ARE ordered according to the order of one of the two original objects. In order to make sure of that, we could add an extra “id” (row index number) sequence on the dataframe we wish to sort on. Then, we can merge the two data.frame objects, sort by the sequence, and delete the sequence. (this was previously mentioned on the R-help mailing list by Bart Joosen).

Following is a function that implements this logic, followed by an example for its use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
############## function:
	merge.with.order <- function(x,y, ..., sort = T, keep_order)
	{
		# this function works just like merge, only that it adds the option to return the merged data.frame ordered by x (1) or by y (2)
		add.id.column.to.data <- function(DATA)
		{
			data.frame(DATA, id... = seq_len(nrow(DATA)))
		}
		# add.id.column.to.data(data.frame(x = rnorm(5), x2 = rnorm(5)))
		order.by.id...and.remove.it <- function(DATA)
		{
			# gets in a data.frame with the "id..." column.  Orders by it and returns it
			if(!any(colnames(DATA)=="id...")) stop("The function order.by.id...and.remove.it only works with data.frame objects which includes the 'id...' order column")
 
			ss_r <- order(DATA$id...)
			ss_c <- colnames(DATA) != "id..."
			DATA[ss_r, ss_c]
		}
 
		# tmp <- function(x) x==1; 1	# why we must check what to do if it is missing or not...
		# tmp()
 
		if(!missing(keep_order))
		{
			if(keep_order == 1) return(order.by.id...and.remove.it(merge(x=add.id.column.to.data(x),y=y,..., sort = FALSE)))
			if(keep_order == 2) return(order.by.id...and.remove.it(merge(x=x,y=add.id.column.to.data(y),..., sort = FALSE)))
			# if you didn't get "return" by now - issue a warning.
			warning("The function merge.with.order only accepts NULL/1/2 values for the keep_order variable")
		} else {return(merge(x=x,y=y,..., sort = sort))}
	}
 
######### example:
>     merge( x.labels, x.vals, by='ref', all.y = T, sort=F )
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
>     merge.with.order( x.labels, x.vals, by='ref', all.y = T, sort=F ,keep_order = 1)
   ref   label id  val
1 Ref1 Label01 A1 1.11
2 Ref1 Label01 D4 4.44
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
>     merge.with.order( x.labels, x.vals, by='ref', all.y = T, sort=F ,keep_order = 2) # yay - works as we wanted it to...
   ref   label id  val
1 Ref1 Label01 A1 1.11
3 Ref2 Label02 C2 2.22
4 Ref3    <NA> B3 3.33
2 Ref1 Label01 D4 4.44

Here is a description for how to use the keep_order parameter:

keep_order can accept the numbers 1 or 2, in which case it will make sure the resulting merged data.frame will be ordered according to the original order of rows of the data.frame entered to x (if keep_order=1) or to y (if keep_order=2). If keep_order is missing, merge will continue working as usual. If keep_order gets some input other then 1 or 2, it will issue a warning that it doesn’t accept these values, but will continue working as merge normally would. Notice that the parameter “sort” is practically overridden when using keep_order (with the value 1 or 2).

The same code can be used to modify the original merge.data.frame function in base R, so to allow the use of the keep_order, here is a link to the patched merge.data.frame function (on github). If you can think of any ways to improve the function (or happen to notice a bug) please let me know either on github or in the comments. (also saying that you found the function to be useful will be fun to know about :) )

Update: Thanks to KY’s comment, I noticed the ?join function in the {plyr} library. This function is similar to merge (with less features, yet faster), and also automatically keeps the order of the x (first) data.frame used for merging, as explained in the ?join help page:

Unlike merge, (join) preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom. Join is often faster than merge, although it is somewhat less featureful – it currently offers no way to rename output or merge on different variables in the x and y data frames.

Aggregation and Restructuring data (from “R in Action”)

The followings introductory post is intended for new users of R.  It deals with the restructuring of data: what it is and how to perform it using base R functions and the {reshape} package. This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff […]

The followings introductory post is intended for new users of R.  It deals with the restructuring of data: what it is and how to perform it using base R functions and the {reshape} package.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…). The previous guest post by Kabacoff introduced data.frame objects in R.

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about the Aggregation and Restructuring of data in R:

Aggregation and Restructuring

R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This article describes a variety of methods for accomplishing these tasks.

We’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars).

Transpose

The transpose (reversing rows and columns) is perhaps the simplest method of reshaping a dataset. Use the t() function to transpose a matrix or a data frame. In the latter case, row names become variable (column) names. An example is presented in the next listing.

Listing 1 Transposing a dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> cars <- mtcars[1:5,1:4]
> cars
                  mpg  cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108 93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
> t(cars)
     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
mpg         21        21           22.8           21.4              18.7
cyl          6         6            4.0            6.0               8.0
disp       160       160          108.0          258.0             360.0
hp         110       110           93.0           110.0            175.0

Listing 1 uses a subset of the mtcars dataset in order to conserve space on the page. You’ll see a more flexible way of transposing data when we look at the reshape package later in this article.

Aggregating data

It’s relatively easy to collapse data in R using one or more by variables and a defined function. The format is

1
aggregate(x, by, FUN)

where x is the data object to be collapsed, by is a list of variables that will be crossed to form the new observations, and FUN is the scalar function used to calculate summary statistics that will make up the new observation values.

As an example, we’ll aggregate the mtcars data by number of cylinders and gears, returning means on each of the numeric variables (see the next listing).

Listing 2 Aggregating data

1
2
3
4
5
6
7
8
9
10
11
12
13
> options(digits=3)
> attach(mtcars)
> aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE)
> aggdata
  Group.1 Group.2  mpg cyl disp  hp drat   wt qsec  vs   am gear carb
1       4       3 21.5   4  120  97 3.70 2.46 20.0 1.0 0.00    3 1.00
2       6       3 19.8   6  242 108 2.92 3.34 19.8 1.0 0.00    3 1.00
3       8       3 15.1   8  358 194 3.12 4.10 17.1 0.0 0.00    3 3.08
4       4       4 26.9   4  103  76 4.11 2.38 19.6 1.0 0.75    4 1.50
5       6       4 19.8   6  164 116 3.91 3.09 17.7 0.5 0.50    4 4.00
6       4       5 28.2   4  108 102 4.10 1.83 16.8 0.5 1.00    5 2.00
7       6       5 19.7   6  145 175 3.62 2.77 15.5 0.0 1.00    5 6.00
8       8       5 15.4   8  326 300 3.88 3.37 14.6 0.0 1.00    5 6.00

In these results, Group.1 represents the number of cylinders (4, 6, or 8) and Group.2 represents the number of gears (3, 4, or 5). For example, cars with 4 cylinders and 3 gears have a mean of 21.5 miles per gallon (mpg).

When you’re using the aggregate() function , the by variables must be in a list (even if there’s only one). You can declare a custom name for the groups from within the list, for instance, using by=list(Group.cyl=cyl, Group.gears=gear).

The function specified can be any built-in or user-provided function. This gives the aggregate command a great deal of power. But when it comes to power, nothing beats the reshape package.

The reshape package

The reshape package is a tremendously versatile approach to both restructuring and aggregating datasets. Because of this versatility, it can be a bit challenging to learn.

We’ll go through the process slowly and use a small dataset so that it’s clear what’s happening. Because reshape isn’t included in the standard installation of R, you’ll need to install it one time, using install.packages(“reshape”).

Basically, you’ll “melt” data so that each row is a unique ID-variable combination. Then you’ll “cast” the melted data into any shape you desire. During the cast, you can aggregate the data with any function you wish. The dataset you’ll be working with is shown in table 1.

Table 1 The original dataset (mydata)

ID

Time

X1

X2

1156
1235
2161
2224

 

In this dataset, the measurements are the values in the last two columns (5, 6, 3, 5, 6, 1, 2, and 4). Each measurement is uniquely identified by a combination of ID variables (in this case ID, Time, and whether the measurement is on X1 or X2). For example, the measured value 5 in the first row is uniquely identified by knowing that it’s from observation (ID) 1, at Time 1, and on variable X1.

Melting

When you melt a dataset, you restructure it into a format where each measured variable is in its own row, along with the ID variables needed to uniquely identify it. If you melt the data from table 1, using the following code

1
2
library(reshape)
md <- melt(mydata, id=(c("id", "time")))

You end up with the structure shown in table 2.

Table 2 The melted dataset

ID

Time

Variable

Value

11X15
12X13
21X16
22X12
11X26
12X25
21X21
22X24

 

Note that you must specify the variables needed to uniquely identify each measurement (ID and Time) and that the variable indicating the measurement variable names (X1 or X2) is created for you automatically.

Now that you have your data in a melted form, you can recast it into any shape, using the cast() function.

Casting

The cast() function starts with melted data and reshapes it using a formula that you provide and an (optional) function used to aggregate the data. The format is

1
newdata <- cast(md, formula, FUN)

Where md is the melted data, formula describes the desired end result, and FUN is the (optional) aggregating function. The formula takes the form

1
rowvar1 + rowvar2 + …  ~  colvar1 + colvar2 +

In this formula, rowvar1 + rowvar2 + … define the set of crossed variables that define the rows, and colvar1 + colvar2 + … define the set of crossed variables that define the columns. See the examples in figure 1. (click to enlarge the image)

Figure 1 Reshaping data with the melt() and cast() functions

Because the formulas on the right side (d, e, and f) don’t include a function, the data is reshaped. In contrast, the examples on the left side (a, b, and c) specify the mean as an aggregating function. Thus the data are not only reshaped but aggregated as well. For example, (a) gives the means on X1 and X2 averaged over time for each observation. Example (b) gives the mean scores of X1 and X2 at Time 1 and Time 2, averaged over observations. In (c) you have the mean score for each observation at Time 1 and Time 2, averaged over X1 and X2.

As you can see, the flexibility provided by the melt() and cast() functions is amazing. There are many times when you’ll have to reshape or aggregate your data prior to analysis. For example, you’ll typically need to place your data in what’s called long format resembling table 2 when analyzing repeated measures data (data where multiple measures are recorded for each observation).

Summary

Chapter 5 of R in Action reviews many of the dozens of mathematical, statistical, and probability functions that are useful for manipulating data. In this article, we have briefly explored several ways of aggregating and restructuring data.

 

This article first appeared as chapter 5.6 from the “R in action book, and is published with permission from Manning publishing house.  Other books in this serious which you might be interested in are (see the beginning of this post for a discount code):

Top 20 R posts of 2011 (and some R-bloggers statistics)

R-bloggers.com is now two years young. The site is an (unofficial) online R journal written by bloggers who agreed to contribute their R articles to the site. In this post I wish to celebrate R-bloggers’ second birthmounth by sharing with you: Links to the top 20 posts of 2011 Statistics on “how well” R-bloggers did […]

R-bloggers.com is now two years young. The site is an (unofficial) online R journal written by bloggers who agreed to contribute their R articles to the site.
In this post I wish to celebrate R-bloggers’ second birthmounth by sharing with you:

  1. Links to the top 20 posts of 2011
  2. Statistics on “how well” R-bloggers did this year
  3. An invitation for sponsors/supporters to help keep the site alive

1. Top 24 R posts of 2011

R-bloggers’ success is largely owed to the content submitted by the R bloggers themselves.  The R community currently has almost 300 active R bloggers (links to the blogs are clearly visible in the right navigation bar on the R-bloggers homepage).  In the past year, these bloggers wrote over 2800 posts about R.

Here is a list of the top visited posts on the site in 2011:

  1. How much of r is written in r
  2. Cpu and gpu trends over time
  3. Select operations on r data frames
  4. Getting started with sweave r latex eclipse statet texlipse
  5. Delete rows from r data frame
  6. Amanda cox on how the new york times graphics department uses r
  7. Hipster programming languages
  8. Opendata r google easy maps
  9. New r generated video has stackoverflow posting behavior changed over time
  10. SNA visualising an email box with r
  11. 100 prisoners 100 lines of code
  12. Google ai challenge languages used by the best programmers
  13. Basics on markov chain for parents
  14. Top 10 algorithms in data mining
  15. A million random digits review of reviews
  16. Character occurrence in passwords
  17. Setting graph margins in r using the par function and lots of cow milk
  18. The new r compiler package in r 2 13 0 some first experiments
  19. Tutorial principal components analysis pca in r
  20. Making guis using c and r with the help of r net

2. Statistics – how well did R-bloggers do this year

There are several matrices one can consider when evaluating the success of a website.  I’ll present a few of them here and will begin by talking about the visitors to the site.

This year, the site was visited by over 665,000 “Unique Visitors.”  There was a total of over 1.4 million visits and over 2.8 million page-views.  People have surfed the site from over 200 countries, with the greatest number of visitors coming from the United States (~40%) and then followed by the United Kingdom (6.9%), Germany (6.6%), Canada (4.7%), France (3.3%), and other countries.

The site has received between 15,000 to 45,000 visits a week in the past few months, and I suspect this number will remain stable in the next few months (unless something very interesting will happen).

I believe this number will stay constant thanks to visitors’ loyalty: 55% of the site’s visits came from returning users.

Another indicator of reader loyalty is the number of subscribers to R-bloggers as counted by feedburner, which includes both RSS readers and e-mail subscribers.  The range of subscribers is estimated to be between 5600 to 5900.

Thus, I am very happy to see that R-bloggers continues to succeed in offering a real service to the global R users community.

3. Invitation to sponsor/advertise on R-bloggers

This year I was sadly accused by google adsense of click fraud (which I did not do, but have no way of proving my innocence).  Therefor, I am no longer able to use google adsense to sustain R-bloggers high monthly bills, and I turned to rely on direct  sponsoring of R-bloggers.

If you are interested in sponsoring/placing-ads/supporting R-bloggers, then you are welcome to contact me.

Happy new year!
Yours,
Tal Galili

data.frame objects in R (via “R in Action”)

The followings introductory post is intended for new users of R.  It deals with R data frames: what they are, and how to create, view, and update them. This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R.  Kabacoff has recently published the book ”R […]

The followings introductory post is intended for new users of R.  It deals with R data frames: what they are, and how to create, view, and update them.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R.  Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…)

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about data frames:

Data Frames


A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). It’s similar to the datasets you’d typically see in SAS, SPSS, and Stata. Data frames are the most common data structure you’ll deal with in R.

The patient dataset in table 1 consists of numeric and character data.

Table 1: A patient dataset

PatientID

AdmDate

Age

Diabetes

Status

110/15/200925Type1Poor
211/01/200934Type2Improved
310/21/200928Type1Excellent
410/28/200952Type1Poor

Because there are multiple modes of data, you can’t contain this data in a matrix. In this case, a data frame would be the structure of choice.

A data frame is created with the data.frame() function:

1
mydata <- data.frame(col1, col2, col3,…)

where col1, col2, col3, … are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function.

The following listing makes this clear.

Listing 1 Creating a data frame

1
2
3
4
5
6
7
8
9
10
11
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
  patientID age diabetes status
1         1  25    Type1 Poor
2         2  34    Type2 Improved
3         3  28    Type1 Excellent
4         4  52    Type1 Poor

Each column must have only one mode, but you can put columns of different modes together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames.

There are several ways to identify the elements of a data frame. You can use the subscript notation or you can specify column names. Using the patientdata data frame created earlier, the following listing demonstrates these approaches.

Listing 2 Specifying elements of a data frame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> patientdata[1:2]
  patientID age
1         1  25
2         2  34
3         3  28
4         4  52
> patientdata[c("diabetes", "status")]
  diabetes status
1    Type1 Poor
2    Type2 Improved
3    Type1 Excellent
4    Type1 Poor
> patientdata$age    #age variable in the patient data frame
[1] 25 34 28 52

The $ notation in the third example is used to indicate a particular variable from a given data frame. For example, if you want to cross-tabulate diabetes type by status, you could use the following code:

1
2
3
4
5
> table(patientdata$diabetes, patientdata$status)
 
        Excellent Improved Poor
  Type1         1        0    2
  Type2         0        1    0

It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code.

attach, detach, and with

The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked in order to locate the variable. Using a sample (mtcars) data frame, you could use the following code to obtain summary statistics for automobile mileage (mpg), and plot this variable against engine displacement (disp), and weight (wt):

1
2
3
summary(mtcars$mpg)
plot(mtcars$mpg, mtcars$disp)
plot(mtcars$mpg, mtcars$wt)

This could also be written as

1
2
3
4
5
attach(mtcars)
  summary(mpg)
  plot(mpg, disp)
  plot(mpg, wt)
detach(mtcars)

The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely.

The limitations with this approach are evident when more than one object can have the same name. Consider the following code:

1
2
3
4
5
6
7
8
9
> mpg <- c(25, 36, 47)
> attach(mtcars)
 
The following object(s) are masked _by_ ‘.GlobalEnv: mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  ‘x’ and ‘y’ lengths differ
> mpg
[1] 25 36 47

Here we already have an object named mpg in our environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn’t what you want. The plot statement fails because mpg has 3 elements and disp has 32 elements. The attach() and detach() functions are best used when you’re analyzing a single data frame and you’re unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked.

An alternative approach is to use the with() function. You could write the previous example as

1
2
3
4
5
with(mtcars, {
  summary(mpg, disp, wt)
  plot(mpg, disp)
  plot(mpg, wt)
})

In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don’t have to worry about name conflicts here. If there’s only one statement (for example, summary(mpg)), the {} brackets are optional.

The limitation of the with() function is that assignments will only exist within the function brackets. Consider the following:

1
2
3
4
5
6
7
8
> with(mtcars, {
   stats <- summary(mpg)
   stats
  })
   Min. 1st Qu. Median Mean 3rd Qu. Max.
  10.40 15.43 19.20 20.09 22.80 33.90
> stats
Error: object ‘stats’ not found

If you need to create objects that will exist outside of the with() construct, use the special assignment operator <<- instead of the standard one (<-). It will save the object to the global environment outside of the with() call. This can be demonstrated with the following code:

1
2
3
4
5
6
7
8
9
> with(mtcars, {
   nokeepstats <- summary(mpg)
   keepstats <<- summary(mpg)
})
> nokeepstats
Error: object ‘nokeepstats’ not found
> keepstats
   Min. 1st Qu. Median Mean 3rd Qu. Max.
    10.40 15.43 19.20 20.09 22.80 33.90

Most books on R recommend using with() over attach(). I think that ultimately the choice is a matter of preference and should be based on what you’re trying to achieve and your understanding of the implications.

Case identifiers

In the patient data example, patientID is used to identify individuals in the dataset. In R, case identifiers can be specified with a rowname option in the data frame function. For example, the statement

1
2
patientdata <- data.frame(patientID, age, diabetes, status,
   row.names=patientID)

specifies patientID as the variable to use in labeling cases on various printouts and graphs produced by R.

Summary

One of the most challenging tasks in data analysis is data preparation. R provides various structures for holding data and many methods for importing data from both keyboard and external sources. One of those structures is data frames, which we covered here. Your ability to specify elements of these structures via the bracket notation is particularly important in selecting, subsetting, and transforming data.

R offers a wealth of functions for accessing external data. This includes data from flat files, web files, statistical packages, spreadsheets, and databases. Note that you can also export data from R into these external formats. We showed you how to use either the attach() and detach() or with() functions to simplify your code.

This article first appeared as chapter 2.2.4 from the “R in action book, and is published with permission from Manning publishing house.

UseR! 2011 slides and videos – on one page

Links to slides and talks from useR 2011 – all organized in one page.

I was recently reminded that the wonderful team at warwick University made sure to put online many of the slides (and some videos) of talks from the recent useR 2011 conference.  You can browse through the talks by going between the timetables (where it will be the most updated, if more slides will be added later), but I thought it might be more convenient for some of you to have the links to all the talks (with slides/videos) in one place.

I am grateful for all of the wonderful people who put their time in making such an amazing event (organizers, speakers, attendees), and also for the many speakers who made sure to share their talk/slides online for all of us to reference.  I hope to see this open-slides trend will continue in the upcoming useR conferences…

Bellow are all the links:

Tuesday 16th August

09:50 – 10:50

Kaleidoscope Ia, MS.03, Chair: Dieter Menne
Claudia BeleitesSpectroscopic Data in R and Validation of Soft Classifiers: Classifying Cells and Tissues by Raman Spectroscopy[Slides]
Jonathan RosenblattRevisiting Multi-Subject Random Effects in fMRI[Slides]
Zoe HoarePutting the R into Randomisation[Slides]
Kaleidoscope Ib, MS.01, Chair: Simon Urbanek
Markus GesmannUsing the Google Visualisation API with R[Slides]
Kaleidoscope Ic, MS.02, Chair: Achim Zeileis
David SmithThe R Ecosystem[Slides]
E. James HarnerRc2: R collaboration in the cloud[Slides]

11:15 – 12:35

Portfolio Management, B3.02, Chair: Patrick Burns
Jagrata MinardiR in the Practice of Risk Management Today[Slides]
Bioinformatics and High-Throughput Data, B3.03, Chair: Hervé Pagès
Thierry OnkelinxAFLP: generating objective and repeatable genetic data[Slides]
High Performance Computing, MS.03, Chair: Stefan Theussl
Willem LigtenbergGPU computing and R[Slides]
Manuel QuesadaOBANSoft: integrated software for Bayesian statistics and high performance computing with R[Slides]
Reporting Technologies and Workflows, MS.01, Chair: Martin Mächler
Andreas LehaThe Emacs Org-mode: Reproducible Research and Beyond[Slides]
Teaching, MS.02, Chair: Jay G. Kerns
Ian HollidayTeaching Statistics to Psychology Students using Reproducible Computing package RC and supporting Peer Review Framework[Slides]
Achim ZeileisAutomatic generation of exams in R[Slides]

14:00 – 14:45

Invited Talk, MS.01/MS.02, Chair: David Firth
Ulrike GrömpingDesign of Experiments in R[Slides] [Video]

14:45 – 15:30

Invited Talk, MS.01/MS.02, Chair: David Firth
Jonathan RougierNomograms for visualising relationships between three variables[Slides] [Video]

16:00 – 17:00

Modelling Systems and Networks, B3.02, Chair: Jonathan Rougier
Rachel OxladeAn S4 Object structure for emulation – the approximation of complex functions[Slides]
Christophe DutangComputation of generalized Nash equilibria[Slides]
Visualisation, MS.04, Chair: Antony Unwin
Andrej BlejecanimatoR: dynamic graphics in R[Slides]
Richard M. HeibergerGraphical Syntax for Structables and their Mosaic Plots[Slides]
Dimensionality Reduction and Variable Selection, MS.01, Chair: Matthias Schmid
Marie ChaventClustOfVar: an R package for the clustering of variables[Slides]
Jürg SchelldorferVariable Screening and Parameter Estimation for High-Dimensional Generalized Linear Mixed Models Using l1-Penalization[Slides]
Benjamin HofnergamboostLSS: boosting generalized additive models for location, scale and shape[Slides]
Business Management, MS.02, Chair: Enrico Branca
Marlene S. MarchenaSCperf: An inventory management package for R[Slides]
Pairach PiboonrungrojUsing R to test transaction cost measurement for supply chain relationship: A structural equation model[Slides]
Fabrizio OrtolaniIntegrating R and Excel for automatic business forecasting

17:05 – 18:05

Lightning Talks(see bellow)

Lightning Talks

  • Community and Communication, MS.02, Chair: Ashley Ford
    • George Zhang: China R user conference [Slides]
    • Tal Galili: Blogging and R – present and future [Link]
    • Markus Schmidberger: Get your R application onto a powerful and fully-configured Cloud Computing environment in less than 5 minutes. [Slides]
    • Eirini Koutoumanou: Teaching R to Non Package Literate Users [Slides]
    • Randall Pruim: Teaching Statistics using the mosaic Package [Slides]
  • Statistics and Programming, MS.01, Chair: Elke Thönnes
    • Toby Dylan Hocking: Fast, named capture regular expressions in R2.14 [Slides]
    • John C. Nash: Developments in optimization tools for R [Slides]
    • Christophe Dutang: A Unified Approach to fit probability distributions [Slides]
  • Package Showcase, MS.03, Chair: Jennifer Rogers
    • James Foadi: cRy: statistical applications in macromolecular crystallography [Slides]
    • Emilio López: Six Sigma is possible with R [Slides]
    • Jonathan Clayden: Medical image processing with TractoR [Slides]
    • Richard A. Bilonick: Using merror 2.0 to Analyze Measurement Error and Determine Calibration Curves [Slides]

Wednesday 17th August

09:00 – 09:50

Invited Talk, MS.01/MS.02, Chair: Ioannis Kosmidis
Lee E. EdlefsenScalable Data Analysis in R[Slides] [Video]

11:15 – 12:35

Spatio-Temporal Statistics, B3.02, Chair: Julian Stander
Nikolaus UmlaufStructured Additive Regression Models: An R Interface to BayesX[Slides]
Molecular and Cell Biology, B3.03, Chair: Andrea Foulkes
Matthew NunesSummary statistics selection for ABC inference in R[Slides]
Maarten van ItersonPower and minimal sample size for multivariate analysis of microarrays[Slides]
Mixed Effect Models, MS.03, Chair: Douglas Bates
Ulrich HalekohKenward-Roger modification of the F-statistic for some linear mixed models fitted with lmer[Slides]
Marco Geracilqmm: Estimating Quantile Regression Models for Independent and Hierarchical Data with R[Slides]
Kenneth KnoblauchMixed-effects Maximum Likelihood Difference Scaling[Slides]
Programming, MS.01, Chair: Uwe Ligges
Ray BrownriggTricks and Traps for Young Players[Slides]
Friedrich SchusterSoftware design patterns in R[Slides]
Patrick BurnsRandom input testing with R[Slides]
Data Mining Applications, MS.02, Chair: Przemysaw Biecek
Stephan StahlschmidtPredicting the offender’s age
Daniel ChapskyLeveraging Online Social Network Data and External Data Sources to Predict Personality[Slides]

14:45 – 15:30

Invited Talk, MS.01/MS.02, Chair: John Aston
Brandon WhitcherQuantitative Medical Image Analysis[Slides] [Video]

16:00 – 17:00

Development of R, B3.02, Chair: John C. Nash
Andrew R. RunnallsInterpreter Internals: Unearthing Buried Treasure with CXXR[Slides]
Geospatial Techniques, B3.03, Chair: Roger Bivand
Binbin LuConverting a spatial network to a graph in R[Slides]
Rainer M KrugSpatial modelling with the R-GRASS Interface[Slides]
Daniel Nüstsos4R – Accessing SensorWeb Data from R[Slides]
Genomics and Bioinformatics, MS.03, Chair: Ramón Diaz-Uriarte
Sebastian GibbMALDIquant: Quantitative Analysis of MALDI-TOF Proteomics Data[Slides]
Regression Modelling, MS.01, Chair: Cristiano Varin
Bettina GrünBeta Regression: Shaken, Stirred, Mixed, and Partitioned[Slides]
Rune Haubo B. ChristensenRegression Models for Ordinal Data: Introducing R-package ordinal[Slides]
Giuseppe BrunoMultiple choice models: why not the same answer? A comparison among LIMDEP, R, SAS and Stata[Slides]
R in the Business World, MS.02, Chair: David Smith
Derek McCrae NortonOdysseus vs. Ajax: How to build an R presence in a corporate SAS environment[Slides]

17:05 – 18:05

Hydrology and Soil Science, B3.02, Chair: Thomas Petzoldt
Wayne JonesGWSDAT (GroundWater Spatiotemporal Data Analysis Tool)[Slides]
Pierre RoudierVisualisation and modelling of soil data using the aqp package[Slides]
Biostatistical Modelling, B3.03, Chair: Holger Hoefling
Annamaria GuoloHigher-order likelihood inference in meta-analysis using R[Slides]
Cristiano VarinGaussian copula regression using R[Slides]
Psychometrics, MS.03, Chair: Yves Rosseel
Florian WickelmaierMultinomial Processing Tree Models in R[Slides]
Basil Abou El-KombozDetecting Invariance in Psychometric Models with the psychotree Package[Slides]
Multivariate Data, MS.01, Chair: Peter Dalgaard
John FoxTests for Multivariate Linear Models with the car Package[Slides]
Julie JossemissMDA: a package to handle missing values in and with multivariate exploratory data analysis methods[Slides]
António Pedro Duarte SilvaMAINT.DATA: Modeling and Analysing Interval Data in R[Slides]
Interfaces, MS.02, Chair: Matthew Shotwell
Xavier de Pedro PuenteWeb 2.0 for R scripts and workflows: Tiki and PluginR[Slides]
Sheri GilleyA new task-based GUI for R[Slides]

Thursday 18th August

09:00 – 09:45

Invited Talk, MS.01/MS.02, Chair: Julia Brettschneider
Wolfgang HuberGenomes and phenotypes[Slides] [Video]

09:50 – 10:50

Financial Models, B3.02, Chair: Giovanni Petris
Peter Ruckdeschel(Robust) Online Filtering in Regime Switching Models and Application to Investment Strategies for Asset Allocation[Slides]
Ecology and Ecological Modelling, B3.03, Chair: Karline Soetaert
Christian KampichlerUsing R for the Analysis of Bird Demography on a Europe-wide Scale[Slides]
John C. NashAn effort to improve nonlinear modeling practice[Slides]
Generalized Linear Models, MS.03, Chair: Kenneth Knoblauch
Ioannis Kosmidisbrglm: Bias reduction in generalized linear models[Slides]
Merete K. HansenThe binomTools package: Performing model diagnostics on binomial regression models[Slides]
Reporting Data, MS.01, Chair: Martyn Plummer
Sina RüegeruniPlot – A package to uniform and customize R graphics[Slides]
Alexander KowariksparkTable: Generating Graphical Tables for Websites and Documents with R[Slides]
Isaac SubiranacompareGroups package, updated and improved[Slides]
Process Optimization, MS.02, Chair: Tobias Verbeke
Emilio LópezSix Sigma Quality Using R: Tools and Training[Slides]
Thomas RothProcess Performance and Capability Statistics for Non-Normal Distributions in R[Slides]

11:15 – 12:35

Inference, B3.02, Chair: Peter Ruckdeschel
Henry DengDensity Estimation Packages in R[Slides]
Population Genetics and Genetics Association Studies, B3.03, Chair: Martin Morgan
Benjamin FrenchSimple haplotype analyses in R[Slides]
Neuroscience, MS.03, Chair: Brandon Whitcher
Karsten TabelowStatistical Parametric Maps for Functional MRI Experiments in R: The Package fmri[Slides]
Data Management, MS.01, Chair: Barry Rowlingson
Susan RanneyIt’s a Boy! An Analysis of Tens of Millions of Birth Records Using R[Slides]
Joanne DemmlerChallenges of working with a large database of routinely collected health data: Combining SQL and R[Slides]
Interactive Graphics in R, MS.02, Chair: Paul Murrell
Richard CottonEasy Interactive ggplots[Slides]

14:00 – 15:00

Kaleidoscope IIIa, MS.03, Chair: Adrian Bowman
Thomas PetzoldtUsing R for systems understanding – a dynamic approach[Slides]
David L. MillerUsing multidimensional scaling with Duchon splines for reliable finite area smoothing[Slides]
Alastair SandersonStudying galaxies in the nearby Universe, using R and ggplot2[Slides]
Kaleidoscope IIIb, MS.02, Chair: Frank Harrell
Paul MurrellVector Image Processing[Slides]

 

Diagram for a Bernoulli process (using R)

A Bernoulli process is a sequence of Bernoulli trials (the realization of n binary random variables), taking two values (0/1, Heads/Tails, Boy/Girl, etc…). It is often used in teaching introductory probability/statistics classes about the binomial distribution. When visualizing a Bernoulli process, it is common to use a binary tree diagram in order to show the […]

A Bernoulli process is a sequence of Bernoulli trials (the realization of n binary random variables), taking two values (0/1, Heads/Tails, Boy/Girl, etc…). It is often used in teaching introductory probability/statistics classes about the binomial distribution.

When visualizing a Bernoulli process, it is common to use a binary tree diagram in order to show the progression of the process, as well as the various consequences of the trial. We might also include the number of “successes”, and the probability for reaching a specific terminal node.

I wanted to be able to create such a diagram using R. For this purpose I composed some code which uses the {diagram} R package. The final function should allow one to create different sizes of diagrams, while allowing flexibility with regards to the text which is used in the tree.

Here is an example of the simplest use of the function:

source("https://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function
binary.tree.for.binomial.game(2) # creating a tree for B(2,0.5)

The resulting diagram will look like this:

The same can be done for creating larger trees. For example, here is the code for a 4 stage Bernoulli process:

source("https://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function
binary.tree.for.binomial.game(4) # creating a tree for B(4,0.5)

The resulting diagram will look like this:

The function can also be tweaked in order to describe a more specific story. For example, the following code describes a 3 stage Bernoulli process where an unfair coin is tossed 3 times (with probability of it giving heads being 0.8):

source("https://www.r-statistics.com/wp-content/uploads/2011/11/binary.tree_.for_.binomial.game_.r.txt") # loading the function

binary.tree.for.binomial.game(3, 0.8, first_box_text = c("Tossing an unfair coin", "(3 times)"), left_branch_text = c("Failure", "Playing again"), right_branch_text = c("Success", "Playing again"),
    left_leaf_text = c("Failure", "Game ends"), right_leaf_text = c("Success",
        "Game ends"), cex = 0.8, rescale_radx = 1.2, rescale_rady = 1.2,
    box_color = "lightgrey", shadow_color = "darkgrey", left_arrow_text = c("Tails n(P = 0.2)"),
    right_arrow_text = c("Heads n(P = 0.8)"), distance_from_arrow = 0.04)

The resulting diagram is:

If you make up neat examples of using the code (or happen to find a bug), or for any other reason – you are welcome to leave a comment.

(note: the images above are licensed under CC BY-SA)