Category Archives: R

100 most read R posts for 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages

R-bloggers.com is now three years young. The site is an (unofficial) online journal of the R statistical programming environment, written by bloggers who agreed to contribute their R articles to the site.

Last year, I posted on the top 24 R posts of 2011. In this post I wish to celebrate R-bloggers’ third birthmounth by sharing with you:

  1. Links to the top 100 most read R posts of 2012
  2. Statistics on “how well” R-bloggers did this year
  3. My wishlist for the R community for 2013 (blogging about R, guest posts, and sponsors)

1. Top 100 R posts of 2012

R-bloggers’ success is thanks to the content submitted by the over 400 R bloggers who have joined r-bloggers.  The R community currently has around 245 active R bloggers (links to the blogs are clearly visible in the right navigation bar on the R-bloggers homepage).  In the past year, these bloggers wrote around 3200 posts about R!

Here is a list of the top visited posts on the site in 2012 (you can see the number of unique visitors in parentheses, while the list is ordered by the number of total page views):

Continue reading

Generation of E-Learning Exams in R for Moodle, OLAT, etc.

(Guest post by Achim Zeileis)
Development of the R package exams for automatic generation of (statistical) exams in R started in 2006 and version 1 was published in JSS by Grün and Zeileis (2009). It was based on standalone Sweave exercises, that can be combined into exams, and then rendered into different kinds of PDF output (exams, solutions, self-study materials, etc.). Now, a major revision of the package has been released that extends the capabilities and adds support for learning management systems. It is still based on the same type of
Sweave files for each exercise but can also render them into output formats like HTML (with various options for displaying mathematical content) and XML specifications for online exams in learning management systems such as Moodle or OLAT. Supplementary files such as graphics or data are
handled automatically. Here, I give a brief overview of the new capabilities. A detailed discussion is in the working paper by Zeileis, Umlauf, and Leisch (2012) that is also contained in the package as a vignette.
Continue reading

Comparing Shiny with gWidgetsWWW2.rapache

(A guest post by John Verzani)

A few days back the RStudio blog announced Shiny, a new product for easily creating interactive web applications (http://www.rstudio.com/shiny/). I wanted to compare this new framework to one I’ve worked on, gWidgetsWWW2.rapache – a version of the gWidgets API for use with Jeffrey Horner’s rapache module for the Apache web server (available at GitHub). The gWidgets API has a similar aim to make it easy for R users to create interactive applications.

I don’t want to worry here about deployment of apps, just the writing side. The shiny package uses websockets to transfer data back and forth from browser to server. Though this may cause issues with wider deployment, the industrious RStudio folks have a hosting program in beta for internet-wide deployment. For local deployment, no problems as far as I know – as long as you avoid older versions of internet explorer.

Now, Shiny seems well suited for applications where the user can parameterize a resulting graphic, so that was the point of comparison. Peter Dalgaard’s tcltk package ships with a classic demo tkdensity.R. I use that for inspiration below. That GUI allows the user a few selections to modify a density plot of a random sample.
Continue reading

How to load the {rJava} package after the error “JAVA_HOME cannot be determined from the Registry”

In case you tried loading a package that depends on the {rJava} package (by Simon Urbanek), you might came across the following error:

Loading required package: rJava
library(rJava)
Error : .onLoad failed in loadNamespace() for ‘rJava’, details:
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry

The error tells us that there is no entry in the Registry that tells R where Java is located. It is most likely that Java was not installed (or that the registry is corrupt).

This error is often resolved by installing a Java version (i.e. 64-bit Java or 32-bit Java) that fits to the type of R version that you are using (i.e. 64-bit R or 32-bit R). This problem can easily effect Windows 7 users, since they might have installed a version of Java that is different than the version of R they are using.

Note that it is necessary to ‘manually download and install’ the 64 bit version of JAVA. By default, the download page gives a 32 bit version .

You can pick the exact version of Java you wish to install from this link. If you might (for some reason) work on both versions of R, you can install both version of Java (Installing the “Java Runtime Environment” is probably good enough for your needs).
(Source: Uwe Ligges)

Other possible solutions is trying to re-install rJava.

If that doesn’t work, you could also manually set the directory of your Java location by setting it before loading the library:

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version
Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version
library(rJava)

(Source: “nograpes” from Stackoverflow, which also describes the find.java in the rJava:::.onLoad function)

data.table version 1.8.1 – now allowed numeric columns and big-number (via bit64) in keys!

This is a guest post written by Branson Owen, an enthusiastic R and data.table user.

Wow, a long time desired feature of data.table finally came true in version 1.8.1! data.table now allowed numeric columns and big number (via bit64) in keys! This is quite a big thing to me and I believe to many other R users too. Now I can hardly think any weakiness of data.table. Oh, did I mention it also started to support character column in the keys (rather than coerce to factor)?

For people who are not familiar with but interested in data.table package, data.table is an enhanced data.frame for high-speed indexing, ordered joins, assignment, grouping and list columns in a short and flexible syntax. You can take a look at some task examples here:

News from datatable-help mailing list:

* New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. They are about 4 times faster than match() on the example in ?chmatch.

* New function set(DT,i,j,value) allows fast assignment to elements of DT.

 
M = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(M)
DT = as.data.table(M)
system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s
system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s
system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s
system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s

* Numeric columns (type ‘double’) are now allowed in keys and ad hoc by. Other types which use ‘double’ (such as POSIXct and bit64) can now be fully supported.

For advanced and creative users, it also officially supported list columns awhile ago (rather than support it by accident). For example, your column could be a list of vectors, where each of the vector has different length. This can allow very flexible and creative ways to manipulate data.

The code example below use “function column”, i.e. a list of functions

> DT = data.table(ID=1:4,A=rnorm(4),B=rnorm(4),fn=list(min,max))
> str(DT)
Classes ‘data.table’ and 'data.frame': 4 obs. of 4 variables:
$ ID: int 1 2 3 4
$ A : num -0.7135 -2.5217 0.0265 1.0102
$ B : num -0.4116 0.4032 0.1098 0.0669
$ fn:List of 4
..$ :function (..., na.rm = FALSE)
..$ :function (..., na.rm = FALSE)
..$ :function (..., na.rm = FALSE)
..$ :function (..., na.rm = FALSE)
 
> DT[,fn[[1]](A,B),by=ID]
ID V1
[1,] 1 -0.71352508
[2,] 2 0.40322625
[3,] 3 0.02648949
[4,] 4 1.01022266

[ref] https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1302&group_id=240&atid=978

 

Speed up your R code using a just-in-time (JIT) compiler

This post is about speeding up your R code using the JIT (just in time) compilation capabilities offered by the new (well, now a year old) {compiler} package. Specifically, dealing with the practical difference between enableJIT and the cmpfun functions.

If you do not want to read much, you can just skip to the example part.

As always, I welcome any comments to this post, and hope to update it when future JIT solutions will come along.

Continue reading

Do more with dates and times in R with lubridate 1.1.0

This is a guest post by Garrett Grolemund (mentored by Hadley Wickham)

Lubridate is an R package that makes it easier to work with dates and times. The newest release of lubridate (v 1.1.0) comes with even more tools and some significant changes over past versions. Below is a concise tour of some of the things lubridate can do for you. At the end of this post, I list some of the differences between lubridate (v 0.2.4) and lubridate (v 1.1.0). If you are an old hand at lubridate, please read this section to avoid surprises!

Lubridate was created by Garrett Grolemund and Hadley Wickham.

Parsing dates and times

Getting R to agree that your data contains the dates and times you think it does can be a bit tricky. Lubridate simplifies that. Identify the order in which the year, month, and day appears in your dates. Now arrange “y”, “m”, and “d” in the same order. This is the name of the function in lubridate that will parse your dates. For example,

library(lubridate)
ymd("20110604"); mdy("06-04-2011"); dmy("04/06/2011")
## "2011-06-04 UTC"
## "2011-06-04 UTC"
## "2011-06-04 UTC"

Parsing functions automatically handle a wide variety of formats and separators, which simplifies the parsing process.

If your date includes time information, add h, m, and/or s to the name of the function. ymd_hms() is probably the most common date time format. To read the dates in with a certain time zone, supply the official name of that time zone in the tz argument.

arrive < - ymd_hms("2011-06-04 12:00:00", tz = "Pacific/Auckland")
## "2011-06-04 12:00:00 NZST"
leave <- ymd_hms("2011-08-10 14:00:00", tz = "Pacific/Auckland")
## "2011-08-10 14:00:00 NZST"

Setting and Extracting information

Extract information from date times with the functions second(), minute(), hour(), day(), wday(), yday(), week(), month(), year(), and tz(). You can also use each of these to set (i.e, change) the given information. Notice that this will alter the date time. wday() and month() have an optional label argument, which replaces their numeric output with the name of the weekday or month.

second(arrive)
## 0
second(arrive) < - 25
arrive
## "2011-06-04 12:00:25 NZST"
second(arrive) <- 0
wday(arrive)
## 7
wday(arrive, label = TRUE)
## Sat

Time Zones

There are two very useful things to do with dates and time zones. First, display the same moment in a different time zone. Second, create a new moment by combining a given clock time with a new time zone. These are accomplished by with_tz() and force_tz().

For example, I spent last summer researching in Auckland, New Zealand. I arranged to meet with my advisor, Hadley, over skype at 9:00 in the morning Auckland time. What time was that for Hadley who was back in Houston, TX?

meeting < - ymd_hms("2011-07-01 09:00:00", tz = "Pacific/Auckland")
## "2011-07-01 09:00:00 NZST"
with_tz(meeting, "America/Chicago")
## "2011-06-30 16:00:00 CDT"

So the meetings occurred at 4:00 Hadley’s time (and the day before no less). Of course, this was the same actual moment of time as 9:00 in New Zealand. It just appears to be a different day due to the curvature of the Earth.

What if Hadley made a mistake and signed on at 9:00 his time? What time would it then be my time?

mistake < - force_tz(meeting, "America/Chicago")
## "2011-07-01 09:00:00 CDT"
with_tz(mistake, "Pacific/Auckland")
## "2011-07-02 02:00:00 NZST"

His call would arrive at 2:00 am my time! Luckily he never did that.

Continue reading

Printing nested tables in R – bridging between the {reshape} and {tables} packages

This post shows how to print a prettier nested pivot table, created using the {reshape} package (similar to what you would get with Microsoft Excel), so you could print it either in the R terminal or as a LaTeX table. This task is done by bridging between the cast_df object produced by the {reshape} package, and the tabular function introduced by the new {tables} package.

Here is an example of the type of output we wish to produce in the R terminal:

1
2
3
4
5
6
7
       ozone       solar.r        wind         temp       
 month mean  sd    mean    sd     mean   sd    mean  sd   
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356

Or in a latex document:

Motivation: creating pretty nested tables

In a recent post we learned how to use the {reshape} package (by Hadley Wickham) in order to aggregate and reshape data (in R) using the melt and cast functions.

The cast function is wonderful but it has one problem – the format of the output. As opposed to a pivot table in (for example) MS excel, the output of a nested table created by cast is very “flat”. That is, there is only one row for the header, and only one column for the row names. So for both the R terminal, or an Sweave document, when we deal with a more complex reshaping/aggregating, the result is not something you would be proud to send to a journal.

The opportunity: the {tables} package

The good news is that Duncan Murdoch have recently released a new package to CRAN called {tables}. The {tables} package can compute and display complex tables of summary statistics and turn them into nice looking tables in Sweave (LaTeX) documents. For using the full power of this package, you are invited to read through its detailed (and well written) 23 pages Vignette. However, some of us might have preferred to keep using the syntax of the {reshape} package, while also benefiting from the great formatting that is offered by the new {tables} package. For this purpose, I devised a function that bridges between cast_df (from {reshape}) and the tabular function (from {tables}).

The bridge: between the {tables} and the {reshape} packages

The code for the function is available on my github (link: tabular.cast_df.r on github) and it seems to works fine as far as I can see (though I wouldn’t run it on larger data files since it relies on melting a cast_df object.)

Here is an example for how to load and use the function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
######################
# Loading the functions
######################
# Making sure we can source code from github
source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt")
 
# Reading in the function for using tabular on a cast_df object:
source_https("https://raw.github.com/talgalili/R-code-snippets/master/tabular.cast_df.r")
 
 
 
######################
# example:
######################
 
############
# Loading and preparing some data
require(reshape)
names(airquality) <- tolower(names(airquality))
airquality2 <- airquality
airquality2$temp2 <- ifelse(airquality2$temp > median(airquality2$temp), "hot", "cold")
aqm <- melt(airquality2, id=c("month", "day","temp2"), na.rm=TRUE)
colnames(aqm)[4] <- "variable2"	# because otherwise the function is having problem when relying on the melt function of the cast object
head(aqm,3)
#  month day temp2 variable2 value
#1     5   1  cold     ozone    41
#2     5   2  cold     ozone    36
#3     5   3  cold     ozone    12
 
############
# Running the example:
tabular.cast_df(cast(aqm, month ~ variable2, c(mean,sd)))
tabular(cast(aqm, month ~ variable2, c(mean,sd))) # notice how we turned tabular to be an S3 method that can deal with a cast_df object
Hmisc::latex(tabular(cast(aqm, month ~ variable2, c(mean,sd)))) # this is what we would have used for an Sweave document

And here are the results in the terminal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> 
> tabular.cast_df(cast(aqm, month ~ variable2, c(mean,sd)))
 
       ozone       solar.r        wind         temp       
 month mean  sd    mean    sd     mean   sd    mean  sd   
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356
> tabular(cast(aqm, month ~ variable2, c(mean,sd))) # notice how we turned tabular to be an S3 method that can deal with a cast_df object
 
       ozone       solar.r        wind         temp       
 month mean  sd    mean    sd     mean   sd    mean  sd   
 5     23.62 22.22 181.3   115.08 11.623 3.531 65.55 6.855
 6     29.44 18.21 190.2    92.88 10.267 3.769 79.10 6.599
 7     59.12 31.64 216.5    80.57  8.942 3.036 83.90 4.316
 8     59.96 39.68 171.9    76.83  8.794 3.226 83.97 6.585
 9     31.45 24.14 167.4    79.12 10.180 3.461 76.90 8.356

And in an Sweave document:

Here is an example for the Rnw file that produces the above table:
cast_df to tabular.Rnw

I will finish with saying that the tabular function offers more flexibility then the one offered by the function I provided. If you find any bugs or have suggestions of improvement, you are invited to leave a comment here or inside the code on github.

(Link-tip goes to Tony Breyal for putting together a solution for sourcing r code from github.)