To visually explore relations between two related variables and an outcome using contour plots. We use the contour function in Base R to produce contour plots that are well-suited for initial investigations into three dimensional data. We then develop visualizations using ggplot2 to gain more control over the graphical output. We also describe several data transformations needed to accomplish this visual exploration.

The mtcars dataset provided with Base R contains results from Motor Trend road tests of 32 cars that took place between 1973 and 1974. We focus on the following three variables: wt (weight, 1000lbs), hp (gross horsepower), qsec (time required to travel a quarter mile). qsec is a measure of acceleration with shorter times representing faster acceleration. It is reasonable to believe that weight and horsepower are jointly related to acceleration, possibly in a nonlinear fashion.

head(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb

## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

To start, we look at a simple scatter plot of weight by horsepower, with each data point colored according to quartiles of acceleration. We first create a new variable to represent quartiles of acceleration using the cut and quantile functions.

mtcars$quart <- cut(mtcars$qsec, quantile(mtcars$qsec))

From here, we use ggplot to visualize the data. We selected colors that were sequential and color blind friendly using ColorBrewer and manually added them to the scale_colour_manual() argument within the ggplot() call below. Labels were also manually added to improve interpretation.

ggplot(mtcars, aes(x = wt, y = hp, color = factor(quart))) +

geom_point(shape = 16, size = 5) +

theme(legend.position = c(0.80, 0.85),

legend.background = element_rect(colour = “black”),

panel.background = element_rect(fill = “black”)) +

labs(x = “Weight (1,000lbs)”, y = “Horsepower”) +

scale_colour_manual(values = c(“#fdcc8a”, “#fc8d59”, “#e34a33”, “#b30000”),

name = “Quartiles of qsec”,

labels = c(“14.5-16.9s”, “17.0-17.7s”, “17.8-18.9s”, “19.0-22.9s”))

This plot provides a first look at the interrelationships of the three variable of interest. To get a different representation of these relations, we use contour plots.

Preparing the Data for Contour Plots in Base R

The contour function requires three dimensional data as an input. We are interested in estimating acceleration for all possible combinations of weight and horsepower using the available data, thereby generating three dimensional data. To compute the estimates, a two-dimensional loess model is fit to the data using the following call:

data.loess <- loess(qsec ~ wt * hp, data = mtcars)

The model contained within the resulting loess object is then used to output the three-dimensional dataset needed for plotting. We do that by generating a sequence of values with uniform spacing over the range of wt and hp. An arbitrarily chosen distance of 0.3 between sequence elements was used to give a relatively fine resolution to the data. Using the predict function, the loess model object is used to estimate a qsec value for each combination of values in the two sequences. These estimates are stored in a matrix where each element of the wt sequence is represented by a row and each element of the hp sequence is represented by a column.

# Create a sequence of incrementally increasing (by 0.3 units) values for both wt and hp

xgrid <- seq(min(mtcars$wt), max(mtcars$wt), 0.3)

ygrid <- seq(min(mtcars$hp), max(mtcars$hp), 0.3)

# Generate a dataframe with every possible combination of wt and hp

data.fit <- expand.grid(wt = xgrid, hp = ygrid)

# Feed the dataframe into the loess model and receive a matrix output with estimates of

# acceleration for each combination of wt and hp

mtrx3d <- predict(data.loess, newdata = data.fit)

# Abbreviated display of final matrix

mtrx3d[1:4, 1:4]

## hp

## wt hp= 52.0 hp= 52.3 hp= 52.6 hp= 52.9

## wt=1.513 19.04237 19.03263 19.02285 19.01302

## wt=1.813 19.25566 19.24637 19.23703 19.22764

## wt=2.113 19.55298 19.54418 19.53534 19.52645

## wt=2.413 20.06436 20.05761 20.05077 20.04383

We then visualize the resulting three dimensional data using the contour function.

contour(x = xgrid, y = ygrid, z = mtrx3d, xlab = “Weight (1,000lbs)”, ylab = “Horsepower”)

Preparing the Data for Contour Plots in GGPlots

To use ggplot, we manipulate the data into “long format” using the melt function from the reshape2 package. We add names for all of the resulting columns for clarity. An unfortunate side effect of the predict function used to populate the initial 3d dataset is that all of the row values and column values of the resulting matrix are of type char, in the form of “variable = value“. The character portion of these values need to first be removed then the remaining values converted to numeric. This is done using str_locate (from the stringR package) to locate the “=” character, then use str_sub (also from stringR) to extract only the numerical portion of the character string. Finally, as.numeric is used to convert results to the appropriate class.

# Transform data to long form

mtrx.melt <- melt(mtrx3d, id.vars = c(“wt”, “hp”), measure.vars = “qsec”)

names(mtrx.melt) <- c(“wt”, “hp”, “qsec”)

# Return data to numeric form

mtrx.melt$wt <- as.numeric(str_sub(mtrx.melt$wt, str_locate(mtrx.melt$wt, “=”)[1,1] + 1))

mtrx.melt$hp <- as.numeric(str_sub(mtrx.melt$hp, str_locate(mtrx.melt$hp, “=”)[1,1] + 1))

head(mtrx.melt)

## wt hp qsec

## 1 1.513 52 19.04237

## 2 1.813 52 19.25566

## 3 2.113 52 19.55298

## 4 2.413 52 20.06436

## 5 2.713 52 20.65788

## 6 3.013 52 20.88378

Using GGPlots2 to Create Contour Plots

With the data transformed into “long” form, we can make contour plots with ggplot2. With the most basic parameters in place, we see:

plot1 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +

stat_contour()

The resulting plot is not very descriptive and has no indication of the values of qsec.

Contour plot with plot region colored using a continuous outcome variable (qsec).

To aid in our plot’s descriptive value, we add color to the contour plot based on values of qsec.

plot2 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +

stat_contour(geom = “polygon”, aes(fill = ..level..)) +

geom_tile(aes(fill = qsec)) +

stat_contour(bins = 15) +

xlab(“Weight (1,000lbs)”) +

ylab(“Horsepower”) +

guides(fill = guide_colorbar(title = “¼ Mi. Time (s)”))

Contour plot with plot region colored using discrete levels

Another option could be to add colored regions between contour lines. In this case, we will split qsec into 10 equal segments using the cut function.

# Create ten segments to be colored in

mtrx.melt$equalSpace <- cut(mtrx.melt$qsec, 10)

# Sort the segments in ascending order

breaks <- levels(unique(mtrx.melt$equalSpace))

# Plot

plot3 <- ggplot() +

geom_tile(data = mtrx.melt, aes(wt, hp, qsec, fill = equalSpace)) +

geom_contour(color = “white”, alpha = 0.5) +

theme_bw() +

xlab(“Weight (1,000lbs)”) +

ylab(“Horsepower”) +

scale_fill_manual(values = c(“#35978f”, “#80cdc1”, “#c7eae5”, “#f5f5f5”,

“#f6e8c3”, “#dfc27d”, “#bf812d”, “#8c510a”,

“#543005”, “#330000”),

name = “¼ Mi. Time (s)”, breaks = breaks, labels = breaks)

## Warning in max(vapply(evaled, length, integer(1))): no non-missing

## arguments to max; returning -Inf

Note: in the lower right hand corner of the graph above, there is a region where increasing weight is associated with decreasing ¼ mile times, which is not characteristic of the true relation between weight and acceleration. This is due to extrapolation that the predict function performed while creating predictions for qsec for combinations of weight and height that did not exist in the raw data. This cannot be avoided using the methods described above. A well-placed rectangle (geom_rect) or placing the legend over the offending area can conceal this region (see example below).

Contour plot with contour lines colored using a continuous outcome variable (qsec)

Instead of coloring the whole plot, it may be more desirable to color just the contour lines of the plot. This can be achieved by using the stat_contour aesthetic over the scale_fill_manual aesthetic. We also chose to move the legend in the area of extrapolation.

plot4 <- ggplot() +

theme_bw() +

xlab(“Weight (1,000lbs)”) +

ylab(“Horspower”) +

stat_contour(data = mtrx.melt, aes(x = wt, y = hp, z = qsec, colour = ..level..),

breaks = round(quantile(mtrx.melt$qsec, seq(0, 1, 0.1)), 0), size = 1) +

scale_color_continuous(name = “¼ Mi. Time (s)”) +

theme(legend.justification=c(1, 0), legend.position=c(1, 0))

Contour plot with contour lines colored using a continuous outcome variable and overlaying scatterplot of weight and horsepower.

We can also overlay the raw data from mtcars onto the previous plot.

plot5 <- plot4 +

geom_point(data = mtcars, aes(x = wt, y = hp), shape = 1, size = 2.5, color = “red”)

Contour plot with contour lines colored using a continuous outcome variable and labeled using direct.labels()

With color-coded contour lines, as seen in the previous example, it may be difficult to differentiate the values of qsec that each line represents. Although we supplied a legend to the preceding plot, using direct.labels from the “directlabels” package can clarify values of qsec.

plot6 <- direct.label(plot5, “bottom.pieces”)

We hope that these examples were of help to you and that you are better able to visualize your data as a result.

For questions, corrections, or suggestions for improvement, contact John at JBellettiere@ucsd.edu or using @JohnBellettiere via Twitter.

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) # only for R versions older than 3.3.0 installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

`R CMD INSTALL`

and hence`install.packages()`

gave an internal error installing a package called description from a tarball on a case-insensitive file system.`match(x, t)`

(and hence`x %in% t`

) failed when`x`

was of length one, and either`character`

and`x`

and`t`

only differed in their`Encoding`

or when`x`

and`t`

where`complex`

with`NA`

s or`NaN`

s. (PR#16885.)`unloadNamespace(ns)`

also works again when`ns`

is a ‘namespace’, as from`getNamespace()`

.`rgamma(1,Inf)`

or`rgamma(1, 0,0)`

no longer give`NaN`

but the correct limit.`length(baseenv())`

is correct now.`pretty(d, ..)`

for date-time`d`

rarely failed when`"halfmonth"`

time steps were tried (PR#16923) and on ‘inaccurate’ platforms such as 32-bit windows or a configuration with`--disable-long-double`

; see comment #15 of PR#16761.- In
`text.default(x, y, labels)`

, the rarely(?) used default for`labels`

is now correct also for the case of a 2-column matrix`x`

and missing`y`

. `as.factor(c(a = 1L))`

preserves`names()`

again as in**R**< 3.1.0.`strtrim(""[0], 0[0])`

now works.- Use of
`Ctrl-C`

to terminate a reverse incremental search started by`Ctrl-R`

in the`readline`

-based Unix terminal interface is now supported for`readline`

>= 6.3 (`Ctrl-G`

always worked). (PR#16603) `diff(<difftime>)`

now keeps the`"units"`

attribute, as subtraction already did, PR#16940.

By running the following 3 lines of code:

install.packages("heatmaply") library(heatmaply) heatmaply(mtcars, k_col = 2, k_row = 3) %>% layout(margin = list(l = 130, b = 40)) |

You will get this output in your browser (or RStudio console):

You can see more example in the online vignette on CRAN. **For issue reports or feature requests, please visit the GitHub repo.**

A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by draging a rectangle around the relevant area.

This work is based on the ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), the ability to zoom from the dendrogram (thanks to the dendextend R package), and the possibility of seeing new features in the future (such as sidebar bars).

The heatmaply package is designed to have a familiar features and user interface as heatmap, gplots::heatmap.2 and other functions for static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way. heatmaply includes the following features:

- Shows the row/column/value under the mouse cursor (and includes a legend on the side)
- Drag a rectangle over the heatmap image, or the dendrograms, in order to zoom in (the dendrogram coloring relies on integration with the dendextend package)
- Works from the R console, in RStudio, with R Markdown, and with Shiny

The package is similar to the d3heatmap package (developed by the brilliant Joe Cheng), but is based on the plotly R package. Performance-wise it can handle larger matrices. Furthermore, since it is based on ggplot2+plotly, it is expected to have more features in the future (as it is more easily extendable by also non-JavaScript experts). I choose to build heatmaply on top of plotly.js since it is a free, open source, JavaScript library that can translate ggplot2 figures into self-contained interactive JavaScript objects (which can be viewed in your browser or RStudio).

The default color palette for the heatmap is based on the beautiful viridis package. Also, by using the dendextend package (see the open-access two-page bioinformatics paper), you can customize dendrograms before sending them to heatmaply (via Rowv and Colv).

You can see some more eye-candy in the online Vignette on CRAN, for example:

**For issue reports or feature requests, please visit the GitHub repo.**

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

`nchar(x, *)`

‘s argument`keepNA`

governing how the result for`NA`

s in`x`

is determined, gets a new default`keepNA = NA`

which returns`NA`

where`x`

is`NA`

, except for`type = "width"`

which still returns`2`

, the formatting / printing width of`NA`

.- All builds have support for https: URLs in the default methods for
`download.file()`

,`url()`

and code making use of them.Unfortunately that cannot guarantee that any particular https: URL can be accessed. For example, server and client have to successfully negotiate a cryptographic protocol (TLS/SSL, …) and the server’s identity has to be verifiable*via*the available certificates. Different access methods may allow different protocols or use private certificate bundles: we encountered a https: CRAN mirror which could be accessed by one browser but not by another nor by`download.file()`

on the same Linux machine.

- The
`print`

method for`methods()`

gains a`byclass`

argument. - New functions
`validEnc()`

and`validUTF8()`

to give access to the validity checks for inputs used by`grep()`

and friends. - Experimental new functionality for S3 method checking, notably
`isS3method()`

.Also, the names of the**R**‘language elements’ are exported as character vector`tools::langElts`

. `str(x)`

now displays`"Time-Series"`

also for matrix (multivariate) time-series, i.e. when`is.ts(x)`

is true.- (Windows only) The GUI menu item to install local packages now accepts ‘*.tar.gz’ files as well as ‘*.zip’ files (but defaults to the latter).
- New programmeR’s utility function
`chkDots()`

. `D()`

now signals an error when given invalid input, rather than silently returning`NA`

. (Request of John Nash.)`formula`

objects are slightly more “first class”: e.g.,`formula()`

or`new("formula", y ~ x)`

are now valid. Similarly, for`"table"`

,`"ordered"`

and`"summary.table"`

. Packages defining S4 classes with the above S3/S4 classes as slots should be reinstalled.- New function
`strrep()`

for repeating the elements of a character vector. `rapply()`

preserves attributes on the list when`how = "replace"`

.- New S3 generic function
`sigma()`

with methods for extracting the estimated standard deviation aka “residual standard deviation” from a fitted model. `news()`

now displays**R**and package news files within the HTML help system if it is available. If no news file is found, a visible`NULL`

is returned to the console.`as.raster(x)`

now also accepts`raw`

arrays`x`

assuming values in`0:255`

.- Subscripting of matrix/array objects of type
`"expression"`

is now supported. `type.convert("i")`

now returns a factor instead of a complex value with zero real part and missing imaginary part.- Graphics devices
`cairo_pdf()`

and`cairo_ps()`

now allow non-default values of the cairographics ‘fallback resolution’ to be set.This now defaults to 300 on all platforms: that is the default documented by cairographics, but apparently was not used by all system installations. `file()`

gains an explicit`method`

argument rather than implicitly using`getOption("url.method", "default")`

.- Thanks to a patch from Tomas Kalibera,
`x[x != 0]`

is now typically faster than`x[which(x != 0)]`

(in the case where`x`

has no NAs, the two are equivalent). `read.table()`

now always uses the names for a named`colClasses`

argument (previously names were only used when`colClasses`

was too short). (In part, wish ofPR#16478.)- (Windows only)
`download.file()`

with default`method = "auto"`

and a ftps:// URL chooses`"libcurl"`

if that is available. - The out-of-the box Bioconductor mirror has been changed to one using https://: use
`chooseBioCmirror()`

to choose a http:// mirror if required. - The data frame and formula methods for
`aggregate()`

gain a`drop`

argument. `available.packages()`

gains a`repos`

argument.- The undocumented switching of methods for
`url()`

on https: and ftps: URLs is confined to`method = "default"`

(and documented). `smoothScatter()`

gains a`ret.selection`

argument.`qr()`

no longer has a`...`

argument to pass additional arguments to methods.`[`

has a method for class`"table"`

.- It is now possible (again) to
`replayPlot()`

a display list snapshot that was created by`recordPlot()`

in a different**R**session.It is still not a good idea to use snapshots as a persistent storage format for**R**plots, but it is now not completely silly to use a snapshot as a format for transferring an R plot between two R sessions.The underlying changes mean that packages providing graphics devices (e.g., Cairo, RSvgDevice, cairoDevice, tikzDevice) will need to be reinstalled.

Code for restoring snapshots was contributed by Jeroen Ooms and JJ Allaire.

Some testing code is available at https://github.com/pmur002/R-display-list.

`tools::undoc(dir = D)`

and`codoc(dir = D)`

now also work when`D`

is a directory whose`normalizePath()`

ed version does not end in the package name, e.g. from a symlink.`abbreviate()`

has more support for multi-byte character sets – it no longer removes bytes within characters and knows about Latin vowels with accents. It is still only really suitable for (most) European languages, and still warns on non-ASCII input.`abbreviate(use.classes = FALSE)`

is now implemented, and that is more suitable for non-European languages.`match(x, table)`

is faster (sometimes by an order of magnitude) when`x`

is of length one and`incomparables`

is unchanged, thanks to Peter Haverty (PR#16491).- More consistent, partly not back-compatible behavior of
`NA`

and`NaN`

coercion to complex numbers, operations less often resulting in complex`NA`

(`NA_complex_`

). `lengths()`

considers methods for`length`

and`[[`

on`x`

, so it should work automatically on any objects for which appropriate methods on those generics are defined.- The logic for selecting the default screen device on OS X has been simplified: it is now
`quartz()`

if that is available even if environment variable DISPLAY has been set by the user.The choice can easily be overridden*via*environment variable R_INTERACTIVE_DEVICE. - On Unix-like platforms which support the
`getline`

C library function,`system(*,intern = TRUE)`

no longer truncates (output) lines longer than 8192 characters, thanks to Karl Millar. (PR#16544) `rank()`

gains a`ties.method = "last"`

option, for convenience (and symmetry).`regmatches(invert = NA)`

can now be used to extract both non-matched and matched substrings.`data.frame()`

gains argument`fix.empty.names`

;`as.data.frame.list()`

gets new`cut.names`

,`col.names`

and`fix.empty.names`

.`plot(x ~ x, *)`

now warns that it is the same as`plot(x ~ 1, *)`

.`recordPlot()`

has new arguments`load`

and`attach`

to allow package names to be stored as part of a recorded plot.`replayPlot()`

has new argument`reloadPkgs`

to load/attach any package names that were stored as part of a recorded plot.- S4 dispatch works within calls to
`.Internal()`

. This means explicit S4 generics are no longer needed for`unlist()`

and`as.vector()`

. - Only font family names starting with “Hershey” (and not “Her” as before) are given special treatment by the graphics engine.
- S4 values are automatically coerced to vector (via
`as.vector`

) when subassigned into atomic vectors. `findInterval()`

gets a`left.open`

option.- The version of LAPACK included in the sources has been updated to 3.6.0, including those ‘deprecated’ routines which were previously included.
*Ca*40 double-complex routines have been added at the request of a package maintainer.As before, the details of what is included are in ‘src/modules/lapack/README’ and this now gives information on earlier additions. `tapply()`

has been made considerably more efficient without changing functionality, thanks to proposals from Peter Haverty and Suharto Anggono. (PR#16640)`match.arg(arg)`

(the one-argument case) is faster; so is`sort.int()`

. (PR#16640)- The
`format`

method for`object_size`

objects now also accepts “binary” units such as`"KiB"`

and e.g.,`"Tb"`

. (Partly from PR#16649.) - Profiling now records calls of the form
`foo::bar`

and some similar cases directly rather than as calls to`<Anonymous>`

. Contributed by Winston Chang. - New string utilities
`startsWith(x, prefix)`

and`endsWith(x, suffix)`

. Also provide speedups for some`grepl("^...",*)`

uses (related to proposals in PR#16490). - Reference class finalizers run at exit, as well as on garbage collection.
- Avoid parallel dependency on stats for port choice and random number seeds. (PR#16668)
- The radix sort algorithm and implementation from data.table (
`forder`

) replaces the previous radix (counting) sort and adds a new method for`order()`

. Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see`?sort`

). - The
`order()`

function gains a`method`

argument for choosing between`"shell"`

and`"radix"`

. - New function
`grouping()`

returns a permutation that stably rearranges data so that identical values are adjacent. The return value includes extra partitioning information on the groups. The implementation came included with the new radix sort. `rhyper(nn, m, n, k)`

no longer returns`NA`

when one of the three parameters exceeds the maximal integer.`switch()`

now warns when no alternatives are provided.`parallel::detectCores()`

now has default`logical = TRUE`

on all platforms – as this was the default on Windows, this change only affects Sparc Solaris.Option`logical = FALSE`

is now supported on Linux and recent versions of OS X (for the latter, thanks to a suggestion of Kyaw Sint).`hist()`

for`"Date"`

or`"POSIXt"`

objects would sometimes give misleading labels on the breaks, as they were set to the day before the start of the period being displayed. The display format has been changed, and the shift of the start day has been made conditional on`right = TRUE`

(the default). (PR#16679)**R**now uses a new version of the logo (donated to the R Foundation by RStudio). It is defined in ‘.svg’ format, so will resize without unnecessary degradation when displayed on HTML pages—there is also a vector PDF version. Thanks to Dirk Eddelbuettel for producing the corresponding X11 icon.- New function
`.traceback()`

returns the stack trace which`traceback()`

prints. `lengths()`

dispatches internally.`dotchart()`

gains a`pt.cex`

argument to control the size of points separately from the size of plot labels. Thanks to Michael Friendly and Milan Bouchet-Valat for ideas and patches.`as.roman(ch)`

now correctly deals with more diverse character vectors`ch`

; also arithmetic with the resulting roman numbers works in more cases. (PR#16779)`prcomp()`

gains a new option`rank.`

allowing to directly aim for less than`min(n,p)`

PC’s. The`summary()`

and its`print()`

method have been amended, notably for this case.`gzcon()`

gains a new option`text`

, which marks the connection as text-oriented (so e.g.`pushBack()`

works). It is still always opened in binary mode.- The
`import()`

namespace directive now accepts an argument`except`

which names symbols to exclude from the imports. The`except`

expression should evaluate to a character vector (after substituting symbols for strings). See Writing R Extensions. - New convenience function
`Rcmd()`

in package tools for invoking`R CMD`

tools from within**R**. - New functions
`makevars_user()`

and`makevars_site()`

in package tools to determine the location of the user and site specific ‘Makevars’ files for customizing package compilation.

`R CMD check`

has a new option –ignore-vignettes for use with non-Sweave vignettes whose VignetteBuilder package is not available.`R CMD check`

now by default checks code usage (*via*codetools) with only the base package attached. Functions from default packages other than base which are used in the package code but not imported are reported as undefined globals, with a suggested addition to the`NAMESPACE`

file.`R CMD check --as-cran`

now also checks DOIs in package ‘CITATION’ and Rd files.`R CMD Rdconv`

and`R CMD Rd2pdf`

each have a new option –RdMacros=pkglist which allows Rd macros to be specified before processing.

- The previously included versions of
`zlib`

,`bzip2`

,`xz`

and PCRE have been removed, so suitable external (usually system) versions are required (see the ‘R Installation and Administration’ manual). - The unexported and undocumented Windows-only devices
`cairo_bmp()`

,`cairo_png()`

and`cairo_tiff()`

have been removed. (These devices should be used as e.g.`bmp(type = "cairo")`

.) - (Windows only) Function
`setInternet2()`

has no effect and will be removed in due course. The choice between methods`"internal"`

and`"wininet"`

is now made by the`method`

arguments of`url()`

and`download.file()`

and their defaults can be set*via*options. The out-of-the-box default remains`"wininet"`

(as it has been since**R**3.2.2). `[<-`

with an S4 value into a list currently embeds the S4 object into its own list such that the end result is roughly equivalent to using`[[<-`

. That behavior is deprecated. In the future, the S4 value will be coerced to a list with`as.list()`

.- Package tools‘ functions
`package.dependencies()`

,`pkgDepends()`

, etc are deprecated now, mostly in favor of`package_dependencies()`

which is both more flexible and efficient.

- Support for very old versions of
`valgrind`

(e.g., 3.3.0) has been removed. - The included
`libtool`

script (generated by`configure`

) has been updated to version 2.4.6 (from 2.2.6a). `libcurl`

version 7.28.0 or later with support for the`https`

protocol is required for installation (except on Windows).- BSD networking is now required (except on Windows) and so
`capabilities("http/ftp")`

is always true. `configure`

uses`pkg-config`

for PNG, TIFF and JPEG where this is available. This should work better with multiple installs and with those using static libraries.- The minimum supported version of OS X is 10.6 (‘Snow Leopard’): even that has been unsupported by Apple since 2012.
- The
`configure`

default on OS X is –disable-R-framework: enable this if you intend to install under ‘/Library/Frameworks’ and use with`R.app`

. - The minimum preferred version of PCRE has since
**R**3.0.0 been 8.32 (released in Nov 2012). Versions 8.10 to 8.31 are now deprecated (with warnings from`configure`

), but will still be accepted until**R**3.4.0. `configure`

looks for C functions`__cospi`

,`__sinpi`

and`__tanpi`

and uses these if`cospi`

*etc*are not found. (OS X is the main instance.)- (Windows) R is now built using
`gcc`

4.9.3. This build will require recompilation of at least those packages that include C++ code, and possibly others. A build of R-devel using the older toolchain will be temporarily available for comparison purposes.During the transition, the environment variable R_COMPILED_BY has been defined to indicate which toolchain was used to compile R (and hence, which should be used to compile code in packages). The`COMPILED_BY`

variable described below will be a permanent replacement for this. - (Windows) A
`make`

and`R CMD config`

variable named`COMPILED_BY`

has been added. This indicates which toolchain was used to compile R (and hence, which should be used to compile code in packages).

- The
`make`

macro`AWK`

which used to be made available to files such as ‘src/Makefile’ is no longer set.

- The API call
`logspace_sum`

introduced in**R**3.2.0 is now remapped as an entry point to`Rf_logspace_sum`

, and its first argument has gained a`const`

qualifier. (PR#16470)Code using it will need to be reinstalled.Similarly, entry point

`log1pexp`

also defined in ‘Rmath.h’ is remapped there to`Rf_log1pexp`

`R_GE_version`

has been increased to`11`

.- New API call
`R_orderVector1`

, a faster one-argument version of`R_orderVector`

. - When
**R**headers such as ‘R.h’ and ‘Rmath.h’ are called from C++ code in packages they include the C++ versions of system headers such as ‘<cmath>’ rather than the legacy headers such as ‘<math.h>’. (Headers ‘Rinternals.h’ and ‘Rinterface.h’ already did, and inclusion of system headers can still be circumvented by defining`NO_C_HEADERS`

, including as from this version for those two headers.)The manual has long said that**R**headers should**not**be included within an`extern "C"`

block, and almost all the packages affected by this change were doing so. - Including header ‘S.h’ from C++ code would fail on some platforms, and so gives a compilation error on all.
- The deprecated header ‘Rdefines.h’ is now compatible with defining
`R_NO_REMAP`

. - The connections API now includes a function
`R_GetConnection()`

which allows packages implementing connections to convert R`connection`

objects to`Rconnection`

handles used in the API. Code which previously used the low-level R-internal`getConnection()`

entry point should switch to the official API.

- C-level
`asChar(x)`

is fixed for when`x`

is not a vector, and it returns`"TRUE"`

/`"FALSE"`

instead of`"T"`

/`"F"`

for logical vectors. - The first arguments of
`.colSums()`

etc (with an initial dot) are now named`x`

rather than`X`

(matching`colSums()`

): thus error messages are corrected. - A
`coef()`

method for class`"maov"`

has been added to allow`vcov()`

to work with multivariate results. (PR#16380) `method = "libcurl"`

connections signal errors rather than retrieving HTTP error pages (where the ISP reports the error).`xpdrows.data.frame()`

was not checking for unique row names; in particular, this affected assignment to non-existing rows via numerical indexing. (PR#16570)`tail.matrix()`

did not work for zero rows matrices, and could produce row “labels” such as`"[1e+05,]"`

.- Data frames with a column named
`"stringsAsFactors"`

now format and print correctly. (PR#16580) `cor()`

is now guaranteed to return a value with absolute value less than or equal to 1. (PR#16638)- Array subsetting now keeps
`names(dim(.))`

. - Blocking socket connection selection recovers more gracefully on signal interrupts.
- The
`data.frame`

method of`rbind()`

construction`row.names`

works better in borderline integer cases, but may change the names assigned. (PR#16666) - (X11 only)
`getGraphicsEvent()`

miscoded buttons and missed mouse motion events. (PR#16700) `methods(round)`

now also lists`round.POSIXt`

.`tar()`

now works with the default`files = NULL`

. (PR#16716)- Jumps to outer contexts, for example in error recovery, now make intermediate jumps to contexts where
`on.exit()`

actions are established instead of trying to run all`on.exit()`

actions before jumping to the final target. This unwinds the stack gradually, releases resources held on the stack, and significantly reduces the chance of a segfault when running out of C stack space. Error handlers established using`withCallingHandlers()`

and`options("error")`

specifications are ignored when handling a C stack overflow error as attempting one of these would trigger a cascade of C stack overflow errors. (These changes resolve PR#16753.) - The spacing could be wrong when printing a complex array. (Report and patch by Lukas Stadler.)
`pretty(d, n, min.n, *)`

for date-time objects`d`

works again in border cases with large`min.n`

, returns a`labels`

attribute also for small-range dates and in such cases its returned length is closer to the desired`n`

. (PR#16761) Additionally, it finally does cover the range of`d`

, as it always claimed.`tsp(x) <- NULL`

did not handle correctly objects inheriting from both`"ts"`

and`"mts"`

. (PR#16769)`install.packages()`

could give false errors when`options("pkgType")`

was`"binary"`

. (Reported by Jose Claudio Faria.)- A bug fix in
**R**3.0.2 fixed problems with`locator()`

in X11, but introduced problems in Windows. Now both should be fixed. (PR#15700) `download.file()`

with`method = "wininet"`

incorrectly warned of download file length difference when reported length was unknown. (PR#16805)`diag(NULL, 1)`

crashed because of missed type checking. (PR#16853)

]]>

The US primaries are coming on fast with almost 120 days left until the conventions. After building a shinyapp for the Israeli Elections I decided to update features in the app and tried out plotly in the shiny framework.

As a casual voter, trying to gauge the true temperature of the political landscape from the overwhelming abundance of polling is a heavy task. Polling data is continuously published during the state primaries and the variety of pollsters makes it hard to keep track what is going on. The app self updates using data published publicly by realclearpolitics.com.

The app keeps track of polling trends and delegate count daily for you. You create a personal analysis from the granular level data all the way to distributions using interactive ggplot2 and plotly graphs and check out the general elections polling to peak into the near future.

The app can be accessed through a couple of places. I set up an AWS instance to host the app for realtime use and there is the Github repository that is the maintained home of the app that is meant for the R community that can host shiny locally.

(github repo: yonicd/Elections)

#changing locale to run on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_TIME","C") #check to see if libraries need to be installed libs=c("shiny","shinyAce","plotly","ggplot2","rvest","reshape2","zoo","stringr","scales","plyr","dplyr") x=sapply(libs,function(x)if(!require(x,character.only = T)) install.packages(x));rm(x,libs) #run App shiny::runGitHub("yonicd/Elections",subdir="USA2016/shiny") #reset to original locale on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_ALL") |

(see next section for details)

- Current Polling
- Election Analyis
- General Elections
- Polling Database

- The top row depicts the current accumulation of delegates by party and candidate is shown in a step plot, with a horizontal reference line for the threshold needed per party to recieve the nomination. Ther accumulation does not include super delegates since it is uncertain which way they will vote. Currently this dataset is updated offline due to its somewhat static nature and the way the data is posted online forces the use of Selenium drivers. An action button will be added to invoke refreshing of the data by users as needed.
- The bottom row is a 7 day moving average of all polling results published on the state and national level. The ribbon around the moving average is the moving standard deviation on the same window. This is helpful to pick up any changes in uncertainty regarding how the voting public is percieving the candidates. It can be seen that candidates with lower polling averages and increased variance trend up while the opposite is true with the leading candidates, where voter uncertainty is a bad thing for them.

- An interactive polling analysis layout where the user can filter elections, parties, publishers and pollster, dates and create different types of plots using any variable as the x and y axis.
- The default layer is the long term trend (estimated with loess smoother) of polling results published by party and candidate

The user can choose to filter in the plots States, Parties, Candidates, Pollsters. Next there is a slider to choose the days before the conventions you want to view in the plot. This was used instead of a calendar to make a uniform timeline that is cleaner than arbitrary dates. Since there are a lot of states left and no one keeps track of which ones are left an extra filter was added to keep just the states with open primaries.

The new feature added is the option to go fully interactive and try out plotly!. Its integration with ggplot is great and new features are being added all the time to the package.

The base graphics are ggplot thus the options above the graph give the user control on nearly all the options to build a plot. The user can choose from the following variables: **Date, Days Left to Convention, Month, Weekday, Week in Month, Party, Candidate, State, Pollster, Results, Final Primary Result, Pollster Error, Sample Type (Registerd/Likely Voter), Sample Size**. There is an extra column in the Polling Database tab that gives the source URL of the poll that was conducted for anyone who wants to dig deeper in the data.

To define the following plot attributes:

Plot Type | Axes | Grouping | Plot Facets |
---|---|---|---|

Point | X axis variable | Split Y by colors using a different variable | Row Facet |

Bar | Discrete/Continuous | Column Facet | |

Line | Rotation of X tick labels | ||

Step | Y axis variable | ||

Boxplot | |||

Density |

- Create Facets to display subsets of the data in different panels (two more variables to cut data) there are two type of facets to choose from
- Wrap: Wrap 1d ribbon of panels into 2d
- Grid: Layout panels in a grid (matrix)

An example of the distribution of polling results in the open primaries over the last two months:

Zooming in to this trend we can see the state level polling

An analysis showing the convergence of polling errors to Sanders and Clinton over the Primary season. Initially Sanders was underestimated by the pollsters and over time the public sentiment has shifted. Currently the pollsters have captured the public sentiment to the primary outcomes. This can be seen as a ceiling to the Sanders campaign:

- If you are an R user and know ggplot syntax there is an additional editor console,below the plot, where you can create advanced plots freehand, just add to the final object from the GUI called p and the data.frame is poll.shiny, eg p+geom_point(). Just notice that all aesthetics must be given they are not defined in the original ggplot() definition. It is also possible to use any library you want just add it to the top of the code, the end object must be a ggplot. This also works great with plotly so do not worry if you are in interactive mode.

```
#new layer
p+geom_smooth(aes(x=DaysLeft,y=Results,fill=Candidate))+
scale_x_reverse()+scale_fill_discrete(name="Candidate")
```

- You can also remove the original layer if you want using the function remove_geom(ggplot_object,geom_layer), eg p=p+remove_geom(p,”point”) will remove the geom_point layer in the original graph

```
#new layer
p=p+geom_smooth(aes(x=DaysLeft,y=Results,fill=Candidate))+
scale_x_reverse()+scale_fill_discrete(name="Candidate")
remove_geom(p,"point") #leaving only the trend on the plot
```

- Finally the plots can be downloaded to your local computer using the download button.

- A peak into the sentiment of the public on cross party polling. Democratic candidate vs Republican candidate. The plots are set up to show the republican spread (Republican Candidate – Democratic Candidate) on the y-axis.
- The top plot is a longterm overview of the spread distributions with boxplots, while the bottom plot shows a daily account of the spread per candidate over the last two weeks. Both plots are split to National samples and State samples due to their heterogeneous nature.

- All raw data used in the application can be viewed and filtered in a datatable. There is an extra column that gives the source URL of the poll that was conducted for anyone who wants to dig deeper in the data.

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package.

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

`install.packages()`

and related functions now give a more informative warning when an attempt is made to install a base package.`summary(x)`

now prints with less rounding when`x`

contains infinite values. (Request of PR#16620.)`provideDimnames()`

gets an optional`unique`

argument.`shQuote()`

gains`type = "cmd2"`

for quoting in`cmd.exe`

in Windows. (Response to PR#16636.)- The
`data.frame`

method of`rbind()`

gains an optional argument`stringsAsFactors`

(instead of only depending on`getOption("stringsAsFactors")`

). `smooth(x, *)`

now also works for long vectors.`tools::texi2dvi()`

has a workaround for problems with the`texi2dvi`

script supplied by texinfo 6.1.It extracts more error messages from the LaTeX logs when in emulation mode.

`R CMD check`

will leave a log file ‘build_vignettes.log’ from the re-building of vignettes in the ‘.Rcheck’ directory if there is a problem, and always if environment variable_R_CHECK_ALWAYS_LOG_VIGNETTE_OUTPUT_ is set to a true value.

- Use of SUPPORT_OPENMP from header ‘Rconfig.h’ is deprecated in favour of the standard OpenMP define _OPENMP.
(This has been the recommendation in the manual for a while now.)

- The
`make`

macro`AWK`

which is long unused by**R**itself but recorded in file ‘etc/Makeconf’ is deprecated and will be removed in**R**3.3.0. - The C header file ‘S.h’ is no longer documented: its use should be replaced by ‘R.h’.

`kmeans(x, centers = <1-row>)`

now works. (PR#16623)`Vectorize()`

now checks for clashes in argument names. (PR#16577)`file.copy(overwrite = FALSE)`

would signal a successful copy when none had taken place. (PR#16576)`ngettext()`

now uses the same default domain as`gettext()`

. (PR#14605)`array(.., dimnames = *)`

now warns about non-`list`

dimnames and, from**R**3.3.0, will signal the same error for invalid dimnames as`matrix()`

has always done.`addmargins()`

now adds dimnames for the extended margins in all cases, as always documented.`heatmap()`

evaluated its`add.expr`

argument in the wrong environment. (PR#16583)`require()`

etc now give the correct entry of`lib.loc`

in the warning about an old version of a package masking a newer required one.- The internal deparser did not add parentheses when necessary, e.g. before
`[]`

or`[[]]`

. (Reported by Lukas Stadler; additional fixes included as well). `as.data.frame.vector(*, row.names=*)`

no longer produces ‘corrupted’ data frames from row names of incorrect length, but rather warns about them. This will become an error.`url`

connections with`method = "libcurl"`

are destroyed properly. (PR#16681)`withCallingHandler()`

now (again) handles warnings even during S4 generic’s argument evaluation. (PR#16111)`deparse(..., control = "quoteExpressions")`

incorrectly quoted empty expressions. (PR#16686)`format()`

ting datetime objects (`"POSIX[cl]?t"`

) could segfault or recycle wrongly. (PR#16685)`plot.ts(<matrix>, las = 1)`

now does use`las`

.`saveRDS(*, compress = "gzip")`

now works as documented. (PR#16653)- (Windows only) The
`Rgui`

front end did not always initialize the console properly, and could cause**R**to crash. (PR#16998) `dummy.coef.lm()`

now works in more cases, thanks to a proposal by Werner Stahel (PR#16665). In addition, it now works for multivariate linear models (`"mlm"`

,`manova`

) thanks to a proposal by Daniel Wollschlaeger.- The
`as.hclust()`

method for`"dendrogram"`

s failed often when there were ties in the heights. `reorder()`

and`midcache.dendrogram()`

now are non-recursive and hence applicable to somewhat deeply nested dendrograms, thanks to a proposal by Suharto Anggono in PR#16424.`cor.test()`

now calculates very small p values more accurately (affecting the result only in extreme not statistically relevant cases). (PR#16704)`smooth(*, do.ends=TRUE)`

did not always work correctly in**R**versions between 3.0.0 and 3.2.3.`pretty(D)`

for date-time objects`D`

now also works well if`range(D)`

is (much) smaller than a second. In the case of only one unique value in`D`

, the pretty range now is more symmetric around that value than previously.

Similarly,`pretty(dt)`

no longer returns a length 5 vector with duplicated entries for`Date`

objects`dt`

which span only a few days.- The figures in help pages such as
`?points`

were accidentally damaged, and did not appear in**R**3.2.3. (PR#16708) `available.packages()`

sometimes deleted the wrong file when cleaning up temporary files. (PR#16712)- The
`X11()`

device sometimes froze on Red Hat Enterprise Linux 6. It now waits for`MapNotify`

events instead of`Expose`

events, thanks to Siteshwar Vashisht. (PR#16497) `[dpqr]nbinom(*, size=Inf, mu=.)`

now works as limit case, for ‘dpq’ as the Poisson. (PR#16727)

`pnbinom()`

no longer loops infinitely in border cases.`approxfun(*, method="constant")`

and hence`ecdf()`

which calls the former now correctly “predict”`NaN`

values as`NaN`

.`summary.data.frame()`

now displays`NA`

s in`Date`

columns in all cases. (PR#16709)

]]>

The ASA statement about the misuses of the p-value singles it out. It is just as well relevant to the use of most other statistical methods: context matters, no single statistical measure suffices, specific thresholds should be avoided and reporting should not be done selectively. The latter problem is discussed mainly in relation to omitted inferences. We argue that the selective reporting of inferences problem is serious enough a problem in our current industrialized science even when no omission takes place. Many R tools are available to address it, but they are mainly used in very large problems and are grossly underused in areas where lack of replicability hits hard.

Source: xkcd

A few days ago the ASA released a statement titled “on p-values: context, process, and purpose”. It was a way for the ASA to address the concerns about the role of Statistics in the Reproducibility and Replicability (R&R) crisis. In the discussions about R&R the p-value has become a scapegoat, being such a widely used statistical method. The ASA statement made an effort to clarify various misinterpretations and to point at misuses of the p-value, but we fear that the result is a statement that might be read by the target readers as expressing very negative attitude towards the p-value. And indeed, just two days after the release of the ASA statement, a blog post titled “After 150 Years, the ASA Says No to p-values” was published (by Norman Matloff), even though the ASA (as far as we read it) did __not__ say “no to P-values” anywhere in the statement. Thankfully, other online reactions to the ASA statements, such as the article in Nature, and other posts in the blogosphere (see [1], [2], [3], [4], [5]), did not use an anti-p-value rhetoric.

In spite of its misinterpretations, the p-value served science well over the 20^{th} century. Why? Because in some sense the p-value offers a first defense line against being fooled by randomness, separating signal from noise. It requires simpler (or fewer) models than those needed by other statistical tool. The p-value requires (in order to be valid) only a statistical model for the behavior of a statistic under the null hypothesis to hold. Even if a model of an alternative hypothesis is used for choosing a “good” statistic (which would be used for constructing a p-value with decent power for an alternative of interest), this alternative model does not have to be correct in order for the p-value to be valid and useful (i.e.: control type I error at the desired level while offering some power to detect a real effect). In contrast, other (wonderful, useful and complementary) statistical methods such as Likelihood ratios, effect size estimation, confidence intervals, or Bayesian methods all need the assumed models to hold over a wider range of situations, not merely under the tested null. In the context of the “replicability crisis” in science, the type I error control of the p-value under the null hypothesis is an important property. And most importantly, the model needed for the calculation of the p-value may be guaranteed to hold under an appropriately designed and executed randomized experiment.

The p-value is a very valuable tool, but it should be complemented – not replaced – by confidence intervals and effect size estimators (as is possible in the specific setting). The ends of a 95% confidence interval indicates a range of potential null hypothesis that could be rejected. An estimator of effect size (supported by an assessment of uncertainty) is crucial for interpretation and for assessing the scientific significance of the results.

While useful, all these types of inferences are also affected by similar problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? or should scientific discoveries be based on whether posterior odds pass a specific threshold? Does either of them measure the size of the effect? Finally, 95% confidence intervals or credence intervals offer no protection against selection when only those that do not cover 0, are selected into the abstract. The properties each method has on the average for a single parameter (level, coverage or unbiased) will not necessarily hold even on the average when a selection is made.

What, then, went wrong in the last decade or two? The change in the scale of the scientific work, brought about by high throughput experimentation methodologies, availability of large databases and ease of computation, a change that parallels the industrialization that production processes have already gone through. In Genomics, Proteomics, Brain Imaging and such, the number of potential discoveries scanned is enormous so the selection of the interesting ones for highlighting is a must. It has by now been recognized in these fields that merely “full reporting and transparency” (as recommended by ASA) is not enough, and methods should be used to control the effect of the unavoidable selection. Therefore, in those same areas, the p-value bright-line is not set at the traditional 5% level. Methods for adaptively setting it to directly control a variety of false discovery rates or other error rates are commonly used.

Addressing the effect of selection on inference (be it when using p-value, or other methods) has been a very active research area; New strategies and sophisticated selective inference tools for testing, confidence intervals, and effect size estimation, in different setups are being offered. Much of it still remains outside the practitioners’ active toolset, even though many are already available in R, as we describe below. The appendix of this post contains a partial list of R packages that support simultaneous and selective inference.

In summary, when discussing the impact of statistical practices on R&R, the p-value should not be singled out nor its usage discouraged: it’s more likely the fault of selection, and not the p-values’ fault.

Extended support for classical and modern adjustment for Simultaneous and Selective Inference (also known as “multiple comparisons”) is available in R and in various R packages. Traditional concern in these areas has been on properties holding simultaneously for all inferences. More recent concerns are on properties holding on the average over the selected, addressed by varieties of false discovery rates, false coverage rates and conditional approaches. The following is a list of relevant R resources. If you have more, please mention them in the comments.

Every R installation offers functions (from the {stats} package) for dealing with multiple comparisons, such as:** **

**adjust**– that gets a set of p-values as input and returns p-values adjusted using one of several methods: Bonferroni, Holm (1979), Hochberg (1988), Hommel (1988), FDR by Benjamini & Hochberg (1995), and Benjamini & Yekutieli (2001),**t.test****,****pairwise.wilcox.test, and pairwise.prop.test**– all rely on p.adjust and can calculate pairwise comparisons between group levels with corrections for multiple testing.- TukeyHSD- Create a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage. The intervals are based on the Studentized range statistic, Tukey’s ‘Honest Significant Difference’ method.

Once we venture outside of the core R functions, we are introduced to a wealth of R packages and statistical procedures. What follows is a partial list (if you wish to contribute and extend this list, please leave your comment to this post):

- multcomp – Simultaneous tests and confidence intervals for general linear hypotheses in parametric models, including linear, generalized linear, linear mixed effects, and survival models. The package includes demos reproducing analyzes presented in the book “Multiple Comparisons Using R” (Bretz, Hothorn, Westfall, 2010, CRC Press).
- coin (+RcmdrPlugin.coin)- Conditional inference procedures for the general independence problem including two-sample, K-sample (non-parametric ANOVA), correlation, censored, ordered and multivariate problems.
- SimComp – Simultaneous tests and confidence intervals are provided for one-way experimental designs with one or many normally distributed, primary response variables (endpoints).
- PMCMR – Calculate Pairwise Multiple Comparisons of Mean Rank Sums
- mratios – perform (simultaneous) inferences for ratios of linear combinations of coefficients in the general linear model.
- mutoss (and accompanying mutossGUI) – are designed to ease the application and comparison of multiple hypothesis testing procedures.
- nparcomp – compute nonparametric simultaneous confidence intervals for relative contrast effects in the unbalanced one way layout. Moreover, it computes simultaneous p-values.
- ANOM – The package takes results from multiple comparisons with the grand mean (obtained with ‘multcomp’, ‘SimComp’, ‘nparcomp’, or ‘MCPAN’) or corresponding simultaneous confidence intervals as input and produces ANOM decision charts that illustrate which group means deviate significantly from the grand mean.
- gMCP – Functions and a graphical user interface for graphical described multiple test procedures.
- MCPAN – Multiple contrast tests and simultaneous confidence intervals based on normal approximation.
- mcprofile – Calculation of signed root deviance profiles for linear combinations of parameters in a generalized linear model. Multiple tests and simultaneous confidence intervals are provided.
- factorplot – Calculate, print, summarize and plot pairwise differences from GLMs, GLHT or Multinomial Logit models. Relies on stats::p.adjust
- multcompView – Convert a logical vector or a vector of p-values or a correlation, difference, or distance matrix into a display identifying the pairs for which the differences were not significantly different. Designed for use in conjunction with the output of functions like TukeyHSD, dist{stats}, simint, simtest, csimint, csimtest{multcomp}, friedmanmc, kruskalmc{pgirmess}.
- discreteMTP – Multiple testing procedures for discrete test statistics, that use the known discrete null distribution of the p-values for simultaneous inference.
- someMTP – a collection of functions for Multiplicity Correction and Multiple Testing.
- hdi – Implementation of multiple approaches to perform inference in high-dimensional models
- ERP – Significance Analysis of Event-Related Potentials Data
- TukeyC – Perform the conventional Tukey test from aov and aovlist objects
- qvalue – offers a function which takes a list of p-values resulting from the simultaneous testing of many hypotheses and estimates their q-values and local FDR values. (reading this discussion thread might be helpful)
- fdrtool – Estimates both tail area-based false discovery rates (Fdr) as well as local false discovery rates (fdr) for a variety of null models (p-values, z-scores, correlation coefficients, t-scores).
- cp4p – Functions to check whether a vector of p-values respects the assumptions of FDR (false discovery rate) control procedures and to compute adjusted p-values.
- multtest – Non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR).
- selectiveInference – New tools for post-selection inference, for use with forward stepwise regression, least angle regression, the lasso, and the many means problem.
- PoSI (site) – Valid Post-Selection Inference for Linear LS Regression
- HWBH– A shiny app for hierarchical weighted FDR testing of primary and secondary endpoints in Medical Research. By Benjamini Y & Cohen R, 2013. Top of Form
- repfdr(@github)- estimation of Bayes and local Bayes false discovery rates for replicability analysis. Heller R, Yekutieli D, 2014
- SelectiveCI : An R package for computing confidence intervals for selected parameters as described in Asaf Weinstein, William Fithian & Yoav Benjamini,2013 and Yoav Benjamini, Daniel Yekutieli,2005
- Rvalue– Software for FDR testing for replicability in primary and follow-up endpoints. Heller R, Bogomolov M, Benjamini Y, 2014 “Deciding whether follow-up studies have replicated findings in a preliminary large-scale “omics’ study”, under review and available upon request from the first author. Bogomolov M, Heller R, 2013

Other than Simultaneous and Selective Inference, one should also mention that there are many R packages for reproducible research, i.e.: the connecting of data, R code, analysis output, and interpretation – so that scholarship can be recreated, better understood and verified. As well as for meta analysis, i.e.: the combining of findings from independent studies in order to make a more general claim.

- Statistics: P values are just the tip of the iceberg
- An estimate of the science-wise false discovery rate and application to the top medical literature
- On the scalability of statistical procedures: why the p-value bashers just don’t get it.

]]>

The paper got quite the attention on Hacker News, Data Science Central, Simply Stats, Xi’an’s blog, srown ion medium, and probably others. Share your thoughts in the comments.

Here is the abstract and table of content.

More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

**Contents**

**1 Today’s Data Science Moment**

**2 Data Science ‘versus’ Statistics**

2.1 The ‘Big Data’ Meme

2.2 The ‘Skills’ Meme

2.3 The ‘Jobs’ Meme

2.4 What here is real?

2.5 A Better Framework

**3 The Future of Data Analysis, 1962**

**4 The 50 years since FoDA**

4.1 Exhortations

4.2 Reification

**5 Breiman’s ‘Two Cultures’, 2001**

**6 The Predictive Culture’s Secret Sauce**

6.1 The Common Task Framework

6.2 Experience with CTF

6.3 The Secret Sauce

6.4 Required Skills

**7 Teaching of today’s consensus Data Science**

**8 The Full Scope of Data Science**

8.1 The Six Divisions

8.2 Discussion

8.3 Teaching of GDS

8.4 Research in GDS

8.4.1 Quantitative Programming Environments: R

8.4.2 Data Wrangling: Tidy Data

8.4.3 Research Presentation: Knitr

8.5 Discussion

**9 Science about Data Science**

9.1 Science-Wide Meta Analysis

9.2 Cross-Study Analysis

9.3 Cross-Workflow Analysis

9.4 Summary

**10 The Next 50 Years of Data Science**

10.1 Open Science takes over

10.2 Science as data

10.3 Scientific Data Analysis, tested Empirically

10.3.1 DJ Hand (2006)

10.3.2 Donoho and Jin (2008)

10.3.3 Zhao, Parmigiani, Huttenhower and Waldron (2014)

10.4 Data Science in 2065

**11 Conclusion**

Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.

Multidimensional Scaling (MDS) is a multivariate statistical technique first used in geography. The main goal of MDS it is to plot multivariate data points in two dimensions, thus revealing the structure of the dataset by visualizing the relative distance of the observations. Multidimensional scaling is used in diverse fields such as attitude study in psychology, sociology or market research.

Although the `MASS`

package provides non-metric methods via the `isoMDS`

function, we will now concentrate on the classical, metric MDS, which is available by calling the `cmdscale`

function bundled with the `stats`

package. Both types of MDS take a distance matrix as the main argument, which can be created from any numeric tabular data by the `dist`

function.

But before such more complex examples, let’s see what MDS can offer for us while working with an already existing distance matrix, like the built-in `eurodist`

dataset:

```
> as.matrix(eurodist)[1:5, 1:5]
Athens Barcelona Brussels Calais Cherbourg
Athens 0 3313 2963 3175 3339
Barcelona 3313 0 1318 1326 1294
Brussels 2963 1318 0 204 583
Calais 3175 1326 204 0 460
Cherbourg 3339 1294 583 460 0
```

The above subset (first 5-5 values) of the distance matrix represents the travel distance between 21 European cities in kilometers. Running classical MDS on this example returns:

```
> (mds <- cmdscale(eurodist))
[,1] [,2]
Athens 2290.2747 1798.803
Barcelona -825.3828 546.811
Brussels 59.1833 -367.081
Calais -82.8460 -429.915
Cherbourg -352.4994 -290.908
Cologne 293.6896 -405.312
Copenhagen 681.9315 -1108.645
Geneva -9.4234 240.406
Gibraltar -2048.4491 642.459
Hamburg 561.1090 -773.369
Hook of Holland 164.9218 -549.367
Lisbon -1935.0408 49.125
Lyons -226.4232 187.088
Madrid -1423.3537 305.875
Marseilles -299.4987 388.807
Milan 260.8780 416.674
Munich 587.6757 81.182
Paris -156.8363 -211.139
Rome 709.4133 1109.367
Stockholm 839.4459 -1836.791
Vienna 911.2305 205.930
```

These scores are very similar to two principal components (discussed in the previous, *Principal Component Analysis* section), such as running `prcomp(eurodist)$x[, 1:2]`

. As a matter of fact, PCA can be considered as the most basic MDS solution.

Anyway, we have just transformed (reduced) the 21-dimensional space into 2 dimensions, which can be plotted very easily — unlike the original distance matrix with 21 rows and 21 columns:

`> plot(mds)`

Does it ring a bell? If not yet, the below image might be more helpful, where the following two lines of code also renders the city names instead of showing anonymous points:

```
> plot(mds, type = 'n')
> text(mds[, 1], mds[, 2], labels(eurodist))
```

Although the *y* axis seems to be flipped (which you can fix by multiplying the second argument of text by

`-1`

), but we have just rendered a map of some European cities from the distance matrix — without any further geographical data. I hope you find this rather impressive!Please find more data visualization tricks and methods in the 13th, *Data Around Us* chapter, from which you can learn for example how to plot the above results over a satellite map downloaded from online service providers. For now, I will only focus on how to render this plot with the new version of `ggplot2`

to avoid overlaps in the city names, and suppressing the not that useful *x* and*y* axis labels and ticks:

```
> library(ggplot2)
> ggplot(as.data.frame(mds), aes(V1, -V2, label = rownames(mds))) +
+ geom_text(check_overlap = TRUE) + theme_minimal() + xlab('') + ylab('') +
+ scale_y_continuous(breaks = NULL) + scale_x_continuous(breaks = NULL)
```

But let’s get back to the original topic and see how to apply MDS on non-geographic data, which was not prepared to be a distance matrix. We will use the `mtcars`

dataset in the following example resulting in a plot with no axis elements:

```
> mds <- cmdscale(dist(mtcars))
> plot(mds, type = 'n', axes = FALSE, xlab = '', ylab = '')
> text(mds[, 1], mds[, 2], rownames(mds))
```

The above plot shows the 32 cars of the original dataset scattered in a two dimensional space. The distance between the elements was computed by MDS, which took into account all the 11 original numeric variables, and it makes vert easy to identify the similar and very different car types. We will cover these topics in more details in the next chapter, which is dedicated to*Classification and Clustering*.

*This article first appeared in the “Mastering Data Analysis with R” book, and is now published with the permission of Packt Publishing.*

As highlighted by David Smith, this release makes a few small improvements and bug fixes to R, including:

- Improved support for users of the
**Windows OS**in time zones, OS version identification, FTP connections, and printing (in the GUI). - Performance improvements and more support for long vectors in some functions including which.max
- Improved accuracy for the Chi-Square distribution functions in some extreme cases

**Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package.

- Some recently-added Windows time zone names have been added to the conversion table used to convert these to Olson names. (Including those relating to changes for Russia in Oct 2014, as in PR#16503.)
- (Windows) Compatibility information has been added to the manifests for ‘Rgui.exe’, ‘Rterm.exe’ and ‘Rscript.exe’. This should allow
`win.version()`

and`Sys.info()`

to report the actual Windows version up to Windows 10. - Windows
`"wininet"`

FTP first tries EPSV / PASV mode rather than only using active mode (reported by Dan Tenenbaum). `which.min(x)`

and`which.max(x)`

may be much faster for logical and integer`x`

and now also work for long vectors.- The ‘emulation’ part of
`tools::texi2dvi()`

has been somewhat enhanced, including supporting`quiet = TRUE`

. It can be selected by`texi2dvi = "emulation"`

.(Windows) MiKTeX removed its`texi2dvi.exe`

command in Sept 2015:`tools::texi2dvi()`

tries`texify.exe`

if it is not found. - (Windows only) Shortcuts for printing and saving have been added to menus in
`Rgui.exe`

. (Request of PR#16572.) `loess(..., iterTrace=TRUE)`

now provides diagnostics for robustness iterations, and the`print()`

method for`summary(<loess>)`

shows slightly more.- The included version of PCRE has been updated to 8.38, a bug-fix release.
`View()`

now displays nested data frames in a more friendly way. (Request with patch in PR#15915.)

`regexpr(pat, x, perl = TRUE)`

with Python-style named capture did not work correctly when`x`

contained`NA`

strings. (PR#16484)- The description of dataset
`ToothGrowth`

has been improved/corrected. (PR#15953) `model.tables(type = "means")`

and hence`TukeyHSD()`

now support`"aov"`

fits without an intercept term. (PR#16437)`close()`

now reports the status of a`pipe()`

connection opened with an explicit`open`

argument. (PR#16481)- Coercing a list without names to a data frame is faster if the elements are very long. (PR#16467)
- (Unix-only) Under some rare circumstances piping the output from
`Rscript`

or`R -f`

could result in attempting to close the input file twice, possibly crashing the process. (PR#16500) - (Windows)
`Sys.info()`

was out of step with`win.version()`

and did not report Windows 8. `topenv(baseenv())`

returns`baseenv()`

again as in**R**3.1.0 and earlier. This also fixes`compilerJIT(3)`

when used in ‘.Rprofile’.`detach()`

ing the methods package keeps`.isMethodsDispatchOn()`

true, as long as the methods namespace is not unloaded.- Removed some spurious warnings from
`configure`

about the preprocessor not finding header files. (PR#15989) `rchisq(*, df=0, ncp=0)`

now returns`0`

instead of`NaN`

, and`dchisq(*, df=0, ncp=*)`

also no longer returns`NaN`

in limit cases (where the limit is unique). (PR#16521)`pchisq(*, df=0, ncp > 0, log.p=TRUE)`

no longer underflows (for ncp > ~60).`nchar(x, "w")`

returned -1 for characters it did not know about (e.g. zero-width spaces): it now assumes 1. It now knows about most zero-width characters and a few more double-width characters.- Help for
`which.min()`

is now more precise about behavior with logical arguments. (PR#16532) - The print width of character strings marked as
`"latin1"`

or`"bytes"`

was in some cases computed incorrectly. `abbreviate()`

did not give names to the return value if`minlength`

was zero, unlike when it was positive.- (Windows only)
`dir.create()`

did not always warn when it failed to create a directory. (PR#16537) - When operating in a non-UTF-8 multibyte locale (e.g. an East Asian locale on Windows),
`grep()`

and related functions did not handle UTF-8 strings properly. (PR#16264) `read.dcf()`

sometimes misread lines longer than 8191 characters. (Reported by Hervé Pagès with a patch.)`within(df, ..)`

no longer drops columns whose name start with a`"."`

.- The built-in
`HTTP`

server converted entire`Content-Type`

to lowercase including parameters which can cause issues for multi-part form boundaries (PR#16541). - Modifying slots of S4 objects could fail when the methods package was not attached. (PR#16545)
`splineDesign(*, outer.ok=TRUE)`

(splines) is better now (PR#16549), and`interpSpline()`

now allows`sparse=TRUE`

for speedup with non-small sizes.- If the expression in the traceback was too long,
`traceback()`

did not report the source line number. (Patch by Kirill Müller.) - The browser did not truncate the display of the function when exiting with
`options("deparse.max.lines")`

set. (PR#16581) - When
`bs(*, Boundary.knots=)`

had boundary knots inside the data range, extrapolation was somewhat off. (Patch by Trevor Hastie.) `var()`

and hence`sd()`

warn about`factor`

arguments which are deprecated now. (PR#16564)`loess(*, weights = *)`

stored wrong weights and hence gave slightly wrong predictions for`newdata`

. (PR#16587)`aperm(a, *)`

now preserves`names(dim(a))`

.`poly(x, ..)`

now works when either`raw=TRUE`

or`coef`

is specified. (PR#16597)`data(package=*)`

is more careful in determining the path.`prettyNum(*, decimal.mark, big.mark)`

: fixed bug introduced when fixing PR#16411.

- The included configuration code for
`libintl`

has been updated to that from`gettext`

version 0.19.5.1 — this should only affect how an external library is detected (and the only known instance is under OpenBSD). (Wish of PR#16464.) `configure`

has a new argument –disable-java to disable the checks for Java.- The
`configure`

default for`MAIN_LDFLAGS`

has been changed for the FreeBSD, NetBSD and Hurd OSes to one more likely to work with compilers other than`gcc`

(FreeBSD 10 defaults to`clang`

). `configure`

now supports the OpenMP flags -fopenmp=libomp (clang) and -qopenmp (Intel C).- Various macros can be set to override the default behaviour of
`configure`

when detecting OpenMP: see file ‘config.site’. - Source installation on Windows has been modified to allow for MiKTeX installations without
`texi2dvi.exe`

. See file ‘MkRules.dist’.