data.frame objects in R (via “R in Action”)

The followings introductory post is intended for new users of R.  It deals with R data frames: what they are, and how to create, view, and update them. This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R.  Kabacoff has recently published the book ”R […]

The followings introductory post is intended for new users of R.  It deals with R data frames: what they are, and how to create, view, and update them.

This is a guest article by Dr. Robert I. Kabacoff, the founder of (one of) the first online R tutorials websites: Quick-R.  Kabacoff has recently published the book ”R in Action“, providing a detailed walk-through for the R language based on various examples for illustrating R’s features (data manipulation, statistical methods, graphics, and so on…)

For readers of this blog, there is a 38% discount off the “R in Action” book (as well as all other eBooks, pBooks and MEAPs at Manning publishing house), simply by using the code rblogg38 when reaching checkout.

Let us now talk about data frames:

Data Frames


A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). It’s similar to the datasets you’d typically see in SAS, SPSS, and Stata. Data frames are the most common data structure you’ll deal with in R.

The patient dataset in table 1 consists of numeric and character data.

Table 1: A patient dataset

PatientID

AdmDate

Age

Diabetes

Status

110/15/200925Type1Poor
211/01/200934Type2Improved
310/21/200928Type1Excellent
410/28/200952Type1Poor

Because there are multiple modes of data, you can’t contain this data in a matrix. In this case, a data frame would be the structure of choice.

A data frame is created with the data.frame() function:

1
mydata <- data.frame(col1, col2, col3,…)

where col1, col2, col3, … are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function.

The following listing makes this clear.

Listing 1 Creating a data frame

1
2
3
4
5
6
7
8
9
10
11
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
  patientID age diabetes status
1         1  25    Type1 Poor
2         2  34    Type2 Improved
3         3  28    Type1 Excellent
4         4  52    Type1 Poor

Each column must have only one mode, but you can put columns of different modes together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames.

There are several ways to identify the elements of a data frame. You can use the subscript notation or you can specify column names. Using the patientdata data frame created earlier, the following listing demonstrates these approaches.

Listing 2 Specifying elements of a data frame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> patientdata[1:2]
  patientID age
1         1  25
2         2  34
3         3  28
4         4  52
> patientdata[c("diabetes", "status")]
  diabetes status
1    Type1 Poor
2    Type2 Improved
3    Type1 Excellent
4    Type1 Poor
> patientdata$age    #age variable in the patient data frame
[1] 25 34 28 52

The $ notation in the third example is used to indicate a particular variable from a given data frame. For example, if you want to cross-tabulate diabetes type by status, you could use the following code:

1
2
3
4
5
> table(patientdata$diabetes, patientdata$status)
 
        Excellent Improved Poor
  Type1         1        0    2
  Type2         0        1    0

It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code.

attach, detach, and with

The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked in order to locate the variable. Using a sample (mtcars) data frame, you could use the following code to obtain summary statistics for automobile mileage (mpg), and plot this variable against engine displacement (disp), and weight (wt):

1
2
3
summary(mtcars$mpg)
plot(mtcars$mpg, mtcars$disp)
plot(mtcars$mpg, mtcars$wt)

This could also be written as

1
2
3
4
5
attach(mtcars)
  summary(mpg)
  plot(mpg, disp)
  plot(mpg, wt)
detach(mtcars)

The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely.

The limitations with this approach are evident when more than one object can have the same name. Consider the following code:

1
2
3
4
5
6
7
8
9
> mpg <- c(25, 36, 47)
> attach(mtcars)
 
The following object(s) are masked _by_ ‘.GlobalEnv: mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  ‘x’ and ‘y’ lengths differ
> mpg
[1] 25 36 47

Here we already have an object named mpg in our environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn’t what you want. The plot statement fails because mpg has 3 elements and disp has 32 elements. The attach() and detach() functions are best used when you’re analyzing a single data frame and you’re unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked.

An alternative approach is to use the with() function. You could write the previous example as

1
2
3
4
5
with(mtcars, {
  summary(mpg, disp, wt)
  plot(mpg, disp)
  plot(mpg, wt)
})

In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don’t have to worry about name conflicts here. If there’s only one statement (for example, summary(mpg)), the {} brackets are optional.

The limitation of the with() function is that assignments will only exist within the function brackets. Consider the following:

1
2
3
4
5
6
7
8
> with(mtcars, {
   stats <- summary(mpg)
   stats
  })
   Min. 1st Qu. Median Mean 3rd Qu. Max.
  10.40 15.43 19.20 20.09 22.80 33.90
> stats
Error: object ‘stats’ not found

If you need to create objects that will exist outside of the with() construct, use the special assignment operator <<- instead of the standard one (<-). It will save the object to the global environment outside of the with() call. This can be demonstrated with the following code:

1
2
3
4
5
6
7
8
9
> with(mtcars, {
   nokeepstats <- summary(mpg)
   keepstats <<- summary(mpg)
})
> nokeepstats
Error: object ‘nokeepstats’ not found
> keepstats
   Min. 1st Qu. Median Mean 3rd Qu. Max.
    10.40 15.43 19.20 20.09 22.80 33.90

Most books on R recommend using with() over attach(). I think that ultimately the choice is a matter of preference and should be based on what you’re trying to achieve and your understanding of the implications.

Case identifiers

In the patient data example, patientID is used to identify individuals in the dataset. In R, case identifiers can be specified with a rowname option in the data frame function. For example, the statement

1
2
patientdata <- data.frame(patientID, age, diabetes, status,
   row.names=patientID)

specifies patientID as the variable to use in labeling cases on various printouts and graphs produced by R.

Summary

One of the most challenging tasks in data analysis is data preparation. R provides various structures for holding data and many methods for importing data from both keyboard and external sources. One of those structures is data frames, which we covered here. Your ability to specify elements of these structures via the bracket notation is particularly important in selecting, subsetting, and transforming data.

R offers a wealth of functions for accessing external data. This includes data from flat files, web files, statistical packages, spreadsheets, and databases. Note that you can also export data from R into these external formats. We showed you how to use either the attach() and detach() or with() functions to simplify your code.

This article first appeared as chapter 2.2.4 from the “R in action book, and is published with permission from Manning publishing house.

UseR! 2011 slides and videos – on one page

Links to slides and talks from useR 2011 – all organized in one page.

I was recently reminded that the wonderful team at warwick University made sure to put online many of the slides (and some videos) of talks from the recent useR 2011 conference.  You can browse through the talks by going between the timetables (where it will be the most updated, if more slides will be added later), but I thought it might be more convenient for some of you to have the links to all the talks (with slides/videos) in one place.

I am grateful for all of the wonderful people who put their time in making such an amazing event (organizers, speakers, attendees), and also for the many speakers who made sure to share their talk/slides online for all of us to reference.  I hope to see this open-slides trend will continue in the upcoming useR conferences…

Bellow are all the links:

Tuesday 16th August

09:50 – 10:50

Kaleidoscope Ia, MS.03, Chair: Dieter Menne
Claudia BeleitesSpectroscopic Data in R and Validation of Soft Classifiers: Classifying Cells and Tissues by Raman Spectroscopy[Slides]
Jonathan RosenblattRevisiting Multi-Subject Random Effects in fMRI[Slides]
Zoe HoarePutting the R into Randomisation[Slides]
Kaleidoscope Ib, MS.01, Chair: Simon Urbanek
Markus GesmannUsing the Google Visualisation API with R[Slides]
Kaleidoscope Ic, MS.02, Chair: Achim Zeileis
David SmithThe R Ecosystem[Slides]
E. James HarnerRc2: R collaboration in the cloud[Slides]

11:15 – 12:35

Portfolio Management, B3.02, Chair: Patrick Burns
Jagrata MinardiR in the Practice of Risk Management Today[Slides]
Bioinformatics and High-Throughput Data, B3.03, Chair: Hervé Pagès
Thierry OnkelinxAFLP: generating objective and repeatable genetic data[Slides]
High Performance Computing, MS.03, Chair: Stefan Theussl
Willem LigtenbergGPU computing and R[Slides]
Manuel QuesadaOBANSoft: integrated software for Bayesian statistics and high performance computing with R[Slides]
Reporting Technologies and Workflows, MS.01, Chair: Martin Mächler
Andreas LehaThe Emacs Org-mode: Reproducible Research and Beyond[Slides]
Teaching, MS.02, Chair: Jay G. Kerns
Ian HollidayTeaching Statistics to Psychology Students using Reproducible Computing package RC and supporting Peer Review Framework[Slides]
Achim ZeileisAutomatic generation of exams in R[Slides]

14:00 – 14:45

Invited Talk, MS.01/MS.02, Chair: David Firth
Ulrike GrömpingDesign of Experiments in R[Slides] [Video]

14:45 – 15:30

Invited Talk, MS.01/MS.02, Chair: David Firth
Jonathan RougierNomograms for visualising relationships between three variables[Slides] [Video]

16:00 – 17:00

Modelling Systems and Networks, B3.02, Chair: Jonathan Rougier
Rachel OxladeAn S4 Object structure for emulation – the approximation of complex functions[Slides]
Christophe DutangComputation of generalized Nash equilibria[Slides]
Visualisation, MS.04, Chair: Antony Unwin
Andrej BlejecanimatoR: dynamic graphics in R[Slides]
Richard M. HeibergerGraphical Syntax for Structables and their Mosaic Plots[Slides]
Dimensionality Reduction and Variable Selection, MS.01, Chair: Matthias Schmid
Marie ChaventClustOfVar: an R package for the clustering of variables[Slides]
Jürg SchelldorferVariable Screening and Parameter Estimation for High-Dimensional Generalized Linear Mixed Models Using l1-Penalization[Slides]
Benjamin HofnergamboostLSS: boosting generalized additive models for location, scale and shape[Slides]
Business Management, MS.02, Chair: Enrico Branca
Marlene S. MarchenaSCperf: An inventory management package for R[Slides]
Pairach PiboonrungrojUsing R to test transaction cost measurement for supply chain relationship: A structural equation model[Slides]
Fabrizio OrtolaniIntegrating R and Excel for automatic business forecasting

17:05 – 18:05

Lightning Talks(see bellow)

Lightning Talks

  • Community and Communication, MS.02, Chair: Ashley Ford
    • George Zhang: China R user conference [Slides]
    • Tal Galili: Blogging and R – present and future [Link]
    • Markus Schmidberger: Get your R application onto a powerful and fully-configured Cloud Computing environment in less than 5 minutes. [Slides]
    • Eirini Koutoumanou: Teaching R to Non Package Literate Users [Slides]
    • Randall Pruim: Teaching Statistics using the mosaic Package [Slides]
  • Statistics and Programming, MS.01, Chair: Elke Thönnes
    • Toby Dylan Hocking: Fast, named capture regular expressions in R2.14 [Slides]
    • John C. Nash: Developments in optimization tools for R [Slides]
    • Christophe Dutang: A Unified Approach to fit probability distributions [Slides]
  • Package Showcase, MS.03, Chair: Jennifer Rogers
    • James Foadi: cRy: statistical applications in macromolecular crystallography [Slides]
    • Emilio López: Six Sigma is possible with R [Slides]
    • Jonathan Clayden: Medical image processing with TractoR [Slides]
    • Richard A. Bilonick: Using merror 2.0 to Analyze Measurement Error and Determine Calibration Curves [Slides]

Wednesday 17th August

09:00 – 09:50

Invited Talk, MS.01/MS.02, Chair: Ioannis Kosmidis
Lee E. EdlefsenScalable Data Analysis in R[Slides] [Video]

11:15 – 12:35

Spatio-Temporal Statistics, B3.02, Chair: Julian Stander
Nikolaus UmlaufStructured Additive Regression Models: An R Interface to BayesX[Slides]
Molecular and Cell Biology, B3.03, Chair: Andrea Foulkes
Matthew NunesSummary statistics selection for ABC inference in R[Slides]
Maarten van ItersonPower and minimal sample size for multivariate analysis of microarrays[Slides]
Mixed Effect Models, MS.03, Chair: Douglas Bates
Ulrich HalekohKenward-Roger modification of the F-statistic for some linear mixed models fitted with lmer[Slides]
Marco Geracilqmm: Estimating Quantile Regression Models for Independent and Hierarchical Data with R[Slides]
Kenneth KnoblauchMixed-effects Maximum Likelihood Difference Scaling[Slides]
Programming, MS.01, Chair: Uwe Ligges
Ray BrownriggTricks and Traps for Young Players[Slides]
Friedrich SchusterSoftware design patterns in R[Slides]
Patrick BurnsRandom input testing with R[Slides]
Data Mining Applications, MS.02, Chair: Przemysaw Biecek
Stephan StahlschmidtPredicting the offender’s age
Daniel ChapskyLeveraging Online Social Network Data and External Data Sources to Predict Personality[Slides]

14:45 – 15:30

Invited Talk, MS.01/MS.02, Chair: John Aston
Brandon WhitcherQuantitative Medical Image Analysis[Slides] [Video]

16:00 – 17:00

Development of R, B3.02, Chair: John C. Nash
Andrew R. RunnallsInterpreter Internals: Unearthing Buried Treasure with CXXR[Slides]
Geospatial Techniques, B3.03, Chair: Roger Bivand
Binbin LuConverting a spatial network to a graph in R[Slides]
Rainer M KrugSpatial modelling with the R-GRASS Interface[Slides]
Daniel Nüstsos4R – Accessing SensorWeb Data from R[Slides]
Genomics and Bioinformatics, MS.03, Chair: Ramón Diaz-Uriarte
Sebastian GibbMALDIquant: Quantitative Analysis of MALDI-TOF Proteomics Data[Slides]
Regression Modelling, MS.01, Chair: Cristiano Varin
Bettina GrünBeta Regression: Shaken, Stirred, Mixed, and Partitioned[Slides]
Rune Haubo B. ChristensenRegression Models for Ordinal Data: Introducing R-package ordinal[Slides]
Giuseppe BrunoMultiple choice models: why not the same answer? A comparison among LIMDEP, R, SAS and Stata[Slides]
R in the Business World, MS.02, Chair: David Smith
Derek McCrae NortonOdysseus vs. Ajax: How to build an R presence in a corporate SAS environment[Slides]

17:05 – 18:05

Hydrology and Soil Science, B3.02, Chair: Thomas Petzoldt
Wayne JonesGWSDAT (GroundWater Spatiotemporal Data Analysis Tool)[Slides]
Pierre RoudierVisualisation and modelling of soil data using the aqp package[Slides]
Biostatistical Modelling, B3.03, Chair: Holger Hoefling
Annamaria GuoloHigher-order likelihood inference in meta-analysis using R[Slides]
Cristiano VarinGaussian copula regression using R[Slides]
Psychometrics, MS.03, Chair: Yves Rosseel
Florian WickelmaierMultinomial Processing Tree Models in R[Slides]
Basil Abou El-KombozDetecting Invariance in Psychometric Models with the psychotree Package[Slides]
Multivariate Data, MS.01, Chair: Peter Dalgaard
John FoxTests for Multivariate Linear Models with the car Package[Slides]
Julie JossemissMDA: a package to handle missing values in and with multivariate exploratory data analysis methods[Slides]
António Pedro Duarte SilvaMAINT.DATA: Modeling and Analysing Interval Data in R[Slides]
Interfaces, MS.02, Chair: Matthew Shotwell
Xavier de Pedro PuenteWeb 2.0 for R scripts and workflows: Tiki and PluginR[Slides]
Sheri GilleyA new task-based GUI for R[Slides]

Thursday 18th August

09:00 – 09:45

Invited Talk, MS.01/MS.02, Chair: Julia Brettschneider
Wolfgang HuberGenomes and phenotypes[Slides] [Video]

09:50 – 10:50

Financial Models, B3.02, Chair: Giovanni Petris
Peter Ruckdeschel(Robust) Online Filtering in Regime Switching Models and Application to Investment Strategies for Asset Allocation[Slides]
Ecology and Ecological Modelling, B3.03, Chair: Karline Soetaert
Christian KampichlerUsing R for the Analysis of Bird Demography on a Europe-wide Scale[Slides]
John C. NashAn effort to improve nonlinear modeling practice[Slides]
Generalized Linear Models, MS.03, Chair: Kenneth Knoblauch
Ioannis Kosmidisbrglm: Bias reduction in generalized linear models[Slides]
Merete K. HansenThe binomTools package: Performing model diagnostics on binomial regression models[Slides]
Reporting Data, MS.01, Chair: Martyn Plummer
Sina RüegeruniPlot – A package to uniform and customize R graphics[Slides]
Alexander KowariksparkTable: Generating Graphical Tables for Websites and Documents with R[Slides]
Isaac SubiranacompareGroups package, updated and improved[Slides]
Process Optimization, MS.02, Chair: Tobias Verbeke
Emilio LópezSix Sigma Quality Using R: Tools and Training[Slides]
Thomas RothProcess Performance and Capability Statistics for Non-Normal Distributions in R[Slides]

11:15 – 12:35

Inference, B3.02, Chair: Peter Ruckdeschel
Henry DengDensity Estimation Packages in R[Slides]
Population Genetics and Genetics Association Studies, B3.03, Chair: Martin Morgan
Benjamin FrenchSimple haplotype analyses in R[Slides]
Neuroscience, MS.03, Chair: Brandon Whitcher
Karsten TabelowStatistical Parametric Maps for Functional MRI Experiments in R: The Package fmri[Slides]
Data Management, MS.01, Chair: Barry Rowlingson
Susan RanneyIt’s a Boy! An Analysis of Tens of Millions of Birth Records Using R[Slides]
Joanne DemmlerChallenges of working with a large database of routinely collected health data: Combining SQL and R[Slides]
Interactive Graphics in R, MS.02, Chair: Paul Murrell
Richard CottonEasy Interactive ggplots[Slides]

14:00 – 15:00

Kaleidoscope IIIa, MS.03, Chair: Adrian Bowman
Thomas PetzoldtUsing R for systems understanding – a dynamic approach[Slides]
David L. MillerUsing multidimensional scaling with Duchon splines for reliable finite area smoothing[Slides]
Alastair SandersonStudying galaxies in the nearby Universe, using R and ggplot2[Slides]
Kaleidoscope IIIb, MS.02, Chair: Frank Harrell
Paul MurrellVector Image Processing[Slides]