packages | R-statistics blog

R 3.0.2 and RStudio 0.9.8 are released!

R 3.0.2 (codename “Frisbee Sailing”) was released yesterday. The full list of new features and bug fixes is provided below.

Also, RStudio v0.98 (in a “secret” preview) was announced two days ago with MANY new features, including:

Amazing new debugging tools(!)
An engine for creating R presentations. You can see a detailed example for using the new R presentation capabilities in THIS LINK (linking to a presentation on Creating beautiful trees of clusterings with R using the dendextend R package, here is a link to the Rpres source file).
Various other enhancements (I like the new code folding for markdown headings/sections) and bug-fixes. Follow THIS LINK for a complete list of new features in this recent RStudio release.

Upgrading to R 3.0.2

You can download the latest version from here. Or, if you are using Windows, you can upgrade to the latest version using the installr package (also available on CRAN and github). Simply run the following code:

# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr

updateR(to_checkMD5sums = FALSE) # the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.2. This issue is already resolved in the installr version on github, and will be released into CRAN in about a month from now..

I try to keep the installr package updated and useful. If you have any suggestions or remarks on the package, you’re invited to leave a comment below.

If you use the global library system (as I do), you can run the following in the new version of R:

source("https://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

p.s: you can also use the installr package to quickly install the new RStudio by using:

# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr

install.RStudio()

Continue reading “R 3.0.2 and RStudio 0.9.8 are released!”

Tailor Your Tables with stargazer: New Features for LaTeX and Text Output

Guest post by Marek Hlavac

Since its first introduction on this blog, stargazer, a package for turning R statistical output into beautiful LaTeX and ASCII text tables, has made a great deal of progress. Compared to available alternatives (such as apsrtable or texreg), the latest version (4.0) of stargazer supports the broadest range of model objects. In particular, it can create side-by-side regression tables from statistical model objects created by packages AER, betareg, dynlm, eha, ergm, gee, gmm, lme4, MASS, mgcv, nlme, nnet, ordinal, plm, pscl, quantreg, relevent, rms, robustbase, spdep, stats, survey, survival and Zelig. You can install stargazer from CRAN in the usual way:

install.packages(“stargazer”)

New Features: Text Output and Confidence Intervals

In this blog post, I would like to draw attention to two new features of stargazer that make the package even more useful:

stargazer can now produce ASCII text output, in addition to LaTeX code. As a result, users can now create beautiful tables that can easily be inserted into Microsoft Word documents, published on websites, or sent via e-mail. Sharing your regression results has never been easier. Users can also use this feature to preview their LaTeX tables before they use the stargazer-generated code in their .tex documents.
In addition to standard errors, stargazer can now report confidence intervals at user-specified confidence levels (with a default of 95 percent). This possibility might be especially appealing to researchers in public health and biostatistics, as the reporting of confidence intervals is very common in these disciplines.

In the reproducible example presented below, I demonstrate these two new features in action.

Reproducible Example

I begin by creating model objects for two Ordinary Least Squares (OLS) models (using the lm() command) and a probit model (using glm() ). Note that I use data from attitude, one of the standard data frames that should be provided with your installation of R.

## 2 OLS models

linear.1 <- lm(rating ~ complaints + privileges + learning + raises + critical, data=attitude)
linear.2 <- lm(rating ~ complaints + privileges + learning, data=attitude)

## create an indicator dependent variable, and run a probit model

attitude$high.rating <- (attitude$rating > 70)
probit.model <- glm(high.rating ~ learning + critical + advance, data=attitude, family = binomial(link = "probit"))

I then use stargazer to create a ‘traditional’ LaTeX table with standard errors. With the sole exception of the argument no.space – which I use to save space by removing all empty lines in the table – both the command call and the resulting table should look familiar from earlier versions of the package:

stargazer(linear.1, linear.2, probit.model, title="Regression Results", align=TRUE, dep.var.labels=c("Overall Rating","High Rating"), covariate.labels=c("Handling of Complaints","No Special Privileges", "Opportunity to Learn","Performance-Based Raises","Too Critical","Advancement"), omit.stat=c("LL","ser","f"), no.space=TRUE)

Continue reading "Tailor Your Tables with stargazer: New Features for LaTeX and Text Output"

R 3.0.1 is released

R 3.0.1 (codename “Good Sport”) was released last week. As mentioned earlier by David, this version improves serialization performance with big objects, improves reliability for parallel programming and fixes a few minor bugs.

Upgrading to R 3.0.1

You can download the latest version from here. Or, if you are using windows, you can upgrade to the latest version using the installr package (also available on CRAN and github). Simply run the following code:

# installing/loading the package:
if(!require(installr)) {
install.packages("installr"); require(installr)} #load / install+load installr

updateR(to_checkMD5sums = FALSE) # the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.1. Soon this should get resolved and you could go back to using updateR(), install.R() or the menu upgrade system.

I try to keep the installr package updated and useful. If you have any suggestions or remarks on the package, you’re invited to leave a comment below.

If you use the global library system (as I do), you can run the following in the new version of R:

source("https://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

Continue reading “R 3.0.1 is released”

Generation of E-Learning Exams in R for Moodle, OLAT, etc.

(Guest post by Achim Zeileis)
Development of the R package exams for automatic generation of (statistical) exams in R started in 2006 and version 1 was published in JSS by Grün and Zeileis (2009). It was based on standalone Sweave exercises, that can be combined into exams, and then rendered into different kinds of PDF output (exams, solutions, self-study materials, etc.). Now, a major revision of the package has been released that extends the capabilities and adds support for learning management systems. It is still based on the same type of
Sweave files for each exercise but can also render them into output formats like HTML (with various options for displaying mathematical content) and XML specifications for online exams in learning management systems such as Moodle or OLAT. Supplementary files such as graphics or data are
handled automatically. Here, I give a brief overview of the new capabilities. A detailed discussion is in the working paper by Zeileis, Umlauf, and Leisch (2012) that is also contained in the package as a vignette.
Continue reading “Generation of E-Learning Exams in R for Moodle, OLAT, etc.”

How to load the {rJava} package after the error "JAVA_HOME cannot be determined from the Registry"

In case you tried loading a package that depends on the {rJava} package (by Simon Urbanek), you might came across the following error:

Loading required package: rJava
library(rJava)
Error : .onLoad failed in loadNamespace() for ‘rJava’, details:
call: fun(libname, pkgname)
error: JAVA_HOME cannot be determined from the Registry

The error tells us that there is no entry in the Registry that tells R where Java is located. It is most likely that Java was not installed (or that the registry is corrupt).

This error is often resolved by installing a Java version (i.e. 64-bit Java or 32-bit Java) that fits to the type of R version that you are using (i.e. 64-bit R or 32-bit R). This problem can easily effect Windows 7 users, since they might have installed a version of Java that is different than the version of R they are using.

Note that it is necessary to ‘manually download and install’ the 64 bit version of JAVA. By default, the download page gives a 32 bit version .

You can pick the exact version of Java you wish to install from this link. If you might (for some reason) work on both versions of R, you can install both version of Java (Installing the “Java Runtime Environment” is probably good enough for your needs).
(Source: Uwe Ligges)

Other possible solutions is trying to re-install rJava.

If that doesn’t work, you could also manually set the directory of your Java location by setting it before loading the library:

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version
Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version
library(rJava)

(Source: “nograpes” from Stackoverflow, which also describes the find.java in the rJava:::.onLoad function)

Aggregation and Restructuring data (from “R in Action”)

The followings introductory post is intended for new users of R. It deals with the restructuring of data: what it is and how to perform it using base R functions and the {reshape} package. This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff […]

The followings introductory post is intended for new users of R. It deals with the restructuring of data: what it is and how to perform it using base R functions and the {reshape} package.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R. Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…). The previous guest post by Kabacoff introduced data.frame objects in R.

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about the Aggregation and Restructuring of data in R:

Aggregation and Restructuring

R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This article describes a variety of methods for accomplishing these tasks.

We’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars).

Transpose

The transpose (reversing rows and columns) is perhaps the simplest method of reshaping a dataset. Use the t() function to transpose a matrix or a data frame. In the latter case, row names become variable (column) names. An example is presented in the next listing.

Listing 1 Transposing a dataset

> cars <- mtcars[1:5,1:4]
> cars
                  mpg  cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108 93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
> t(cars)
     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
mpg         21        21           22.8           21.4              18.7
cyl          6         6            4.0            6.0               8.0
disp       160       160          108.0          258.0             360.0
hp         110       110           93.0           110.0            175.0

Listing 1 uses a subset of the mtcars dataset in order to conserve space on the page. You’ll see a more flexible way of transposing data when we look at the reshape package later in this article.

Aggregating data

It’s relatively easy to collapse data in R using one or more by variables and a defined function. The format is

1	aggregate(x, by, FUN)

where x is the data object to be collapsed, by is a list of variables that will be crossed to form the new observations, and FUN is the scalar function used to calculate summary statistics that will make up the new observation values.

As an example, we’ll aggregate the mtcars data by number of cylinders and gears, returning means on each of the numeric variables (see the next listing).

Listing 2 Aggregating data

> options(digits=3)
> attach(mtcars)
> aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE)
> aggdata
  Group.1 Group.2  mpg cyl disp  hp drat   wt qsec  vs   am gear carb
1       4       3 21.5   4  120  97 3.70 2.46 20.0 1.0 0.00    3 1.00
2       6       3 19.8   6  242 108 2.92 3.34 19.8 1.0 0.00    3 1.00
3       8       3 15.1   8  358 194 3.12 4.10 17.1 0.0 0.00    3 3.08
4       4       4 26.9   4  103  76 4.11 2.38 19.6 1.0 0.75    4 1.50
5       6       4 19.8   6  164 116 3.91 3.09 17.7 0.5 0.50    4 4.00
6       4       5 28.2   4  108 102 4.10 1.83 16.8 0.5 1.00    5 2.00
7       6       5 19.7   6  145 175 3.62 2.77 15.5 0.0 1.00    5 6.00
8       8       5 15.4   8  326 300 3.88 3.37 14.6 0.0 1.00    5 6.00

In these results, Group.1 represents the number of cylinders (4, 6, or and Group.2 represents the number of gears (3, 4, or 5). For example, cars with 4 cylinders and 3 gears have a mean of 21.5 miles per gallon (mpg).

When you’re using the aggregate() function , the by variables must be in a list (even if there’s only one). You can declare a custom name for the groups from within the list, for instance, using by=list(Group.cyl=cyl, Group.gears=gear).

The function specified can be any built-in or user-provided function. This gives the aggregate command a great deal of power. But when it comes to power, nothing beats the reshape package.

The reshape package

The reshape package is a tremendously versatile approach to both restructuring and aggregating datasets. Because of this versatility, it can be a bit challenging to learn.

We’ll go through the process slowly and use a small dataset so that it’s clear what’s happening. Because reshape isn’t included in the standard installation of R, you’ll need to install it one time, using install.packages(“reshape”).

Basically, you’ll “melt” data so that each row is a unique ID-variable combination. Then you’ll “cast” the melted data into any shape you desire. During the cast, you can aggregate the data with any function you wish. The dataset you’ll be working with is shown in table 1.

Table 1 The original dataset (mydata)

ID	Time	X1	X2
1	1	5	6
1	2	3	5
2	1	6	1
2	2	2	4

In this dataset, the measurements are the values in the last two columns (5, 6, 3, 5, 6, 1, 2, and 4). Each measurement is uniquely identified by a combination of ID variables (in this case ID, Time, and whether the measurement is on X1 or X2). For example, the measured value 5 in the first row is uniquely identified by knowing that it’s from observation (ID) 1, at Time 1, and on variable X1.

Melting

When you melt a dataset, you restructure it into a format where each measured variable is in its own row, along with the ID variables needed to uniquely identify it. If you melt the data from table 1, using the following code

1 2	library(reshape) md <- melt(mydata, id=(c("id", "time")))

You end up with the structure shown in table 2.

Table 2 The melted dataset

ID	Time	Variable	Value
1	1	X1	5
1	2	X1	3
2	1	X1	6
2	2	X1	2
1	1	X2	6
1	2	X2	5
2	1	X2	1
2	2	X2	4

Note that you must specify the variables needed to uniquely identify each measurement (ID and Time) and that the variable indicating the measurement variable names (X1 or X2) is created for you automatically.

Now that you have your data in a melted form, you can recast it into any shape, using the cast() function.

Casting

The cast() function starts with melted data and reshapes it using a formula that you provide and an (optional) function used to aggregate the data. The format is

1	newdata <- cast(md, formula, FUN)

Where md is the melted data, formula describes the desired end result, and FUN is the (optional) aggregating function. The formula takes the form

1	rowvar1 + rowvar2 + … ~ colvar1 + colvar2 + …

In this formula, rowvar1 + rowvar2 + … define the set of crossed variables that define the rows, and colvar1 + colvar2 + … define the set of crossed variables that define the columns. See the examples in figure 1. (click to enlarge the image)

Figure 1 Reshaping data with the melt() and cast() functions

Because the formulas on the right side (d, e, and f) don’t include a function, the data is reshaped. In contrast, the examples on the left side (a, b, and c) specify the mean as an aggregating function. Thus the data are not only reshaped but aggregated as well. For example, (a) gives the means on X1 and X2 averaged over time for each observation. Example (b) gives the mean scores of X1 and X2 at Time 1 and Time 2, averaged over observations. In (c) you have the mean score for each observation at Time 1 and Time 2, averaged over X1 and X2.

As you can see, the flexibility provided by the melt() and cast() functions is amazing. There are many times when you’ll have to reshape or aggregate your data prior to analysis. For example, you’ll typically need to place your data in what’s called long format resembling table 2 when analyzing repeated measures data (data where multiple measures are recorded for each observation).

Summary

Chapter 5 of R in Action reviews many of the dozens of mathematical, statistical, and probability functions that are useful for manipulating data. In this article, we have briefly explored several ways of aggregating and restructuring data.

This article first appeared as chapter 5.6 from the “R in action“ book, and is published with permission from Manning publishing house. Other books in this serious which you might be interested in are (see the beginning of this post for a discount code):

Machine Learning in Action by Peter Harrington

Gnuplot in Action (Understanding Data with Graphs) by Philipp K. Janert

Article about plyr published in JSS, and the citation was added to the new plyr (version 1.5)

The plyr package (by Hadley Wickham) is one of the few R packages for which I can claim to have used for all of my statistical projects. So whenever a new version of plyr comes out I tend to be excited about it (as was when version 1.2 came out with support for parallel processing)

So it is no surprise that the new release of plyr 1.5 got me curious. While going through the news file with the new features and bug fixes, I noticed how (quietly) Hadley has also released (6 days ago) another version of plyr prior to 1.5 which was numbered 1.4.1. That version included only one more function, but a very important one – a new citation reference for when using the plyr package. Here is how to use it:

install.packages("plyr") # so to upgrade to the latest release
citation("plyr")

The output gives both a simple text version as well as a BibTeX entry for LaTeX users. Here it is (notice the download link for yourself to read):

To cite plyr in publications use:
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data
Analysis. Journal of Statistical Software, 40(1), 1-29. URL
http://www.jstatsoft.org/v40/i01/.

I hope to see more R contributers and users will make use of the ?citation() function in the future.

A competition to recommend "relevant" R packages – and the future of R

Update: the competition was just launched.
* * *

What is the competition about?

Drew Conway and John Myles Whyte have collected data from (52) R users about the packages they have installed. The data is now available on github for download and the contest will be run on the kaggle platform.

For more details, head over to dataists.

And for fun, here is the dependency graph for R packages they have assembled so far:

A graphical visualization of packages’ “suggestion” relationships. Affectionately referred to as the R Flying Spaghetti Monster. More info below.

A tiny bit more on R bloggers virality

Continue reading “A competition to recommend "relevant" R packages – and the future of R”

A new version of ff released (version 2.2.0)

A few hours ago, Jens Oehlschlägel has announced on the R-help mailing list of the release of a new version of the ff package.

The ff package provides data structures that are stored on disk but behave (almost) as if they were in RAM by transparently mapping only a section (pagesize) in main memory – the effective virtual memory consumption per ff object.

Here are the new features of ff, as Jens wrote in his announcement:

—-
Dear R community,

The next release of package ff is available on CRAN. With kind help of Brian Ripley it now supports the Win64 and Sun versions of R. It has three major functional enhancements:

a) new fast in-memory sorting and ordering functions (single-threaded)
b) ff now supports on-disk sorting and ordering of ff vectors and ffdf dataframes
c) ff integer vectors now can be used as subscripts of ff vectors and ffdf dataframes

a) is achieved by careful implementation of NA-handling and exploiting context information
b) although permanently stored, sorting and ordering of ff objects can be faster than the standard routines in R
c) applying an order to ff vectors and ffdf dataframes is substantially slower than in pure R because it involves disk-access AND sorting index positions (to avoid random access).

There is still room for improvement, however, the current status should already be useful. I run some comparisons with SAS (see end of mail):
– both could sort German census size (81e6 rows) on a 3GB notebook
– ff sorts and orders faster on single columns
– sorting big multicolumn-tables is faster in SAS

Continue reading “A new version of ff released (version 2.2.0)”