50 years of Data Science – by David Donoho

David Donoho published a fascinating paper based on a presentation at the Tukey Centennial workshop, Princeton NJ Sept 18 2015. You can download the full paper from here. 

The paper got quite the attention on Hacker News, Data Science Central, Simply Stats, Xi’an’s blog, srown ion medium, and probably others. Share your thoughts in the comments.

Here is the abstract and table of content.


More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.


1 Today’s Data Science Moment

2 Data Science ‘versus’ Statistics

2.1 The ‘Big Data’ Meme

2.2 The ‘Skills’ Meme

2.3 The ‘Jobs’ Meme

2.4 What here is real?

2.5 A Better Framework

3 The Future of Data Analysis, 1962

4 The 50 years since FoDA

4.1 Exhortations

4.2 Reification

5 Breiman’s ‘Two Cultures’, 2001

6 The Predictive Culture’s Secret Sauce

6.1 The Common Task Framework

6.2 Experience with CTF

6.3 The Secret Sauce

6.4 Required Skills

7 Teaching of today’s consensus Data Science

8 The Full Scope of Data Science

8.1 The Six Divisions

8.2 Discussion

8.3 Teaching of GDS

8.4 Research in GDS

8.4.1 Quantitative Programming Environments: R

8.4.2 Data Wrangling: Tidy Data

8.4.3 Research Presentation: Knitr

8.5 Discussion

9 Science about Data Science

9.1 Science-Wide Meta Analysis

9.2 Cross-Study Analysis

9.3 Cross-Workflow Analysis

9.4 Summary

10 The Next 50 Years of Data Science

10.1 Open Science takes over

10.2 Science as data

10.3 Scientific Data Analysis, tested Empirically

10.3.1 DJ Hand (2006)

10.3.2 Donoho and Jin (2008)

10.3.3 Zhao, Parmigiani, Huttenhower and Waldron (2014)

10.4 Data Science in 2065

11 Conclusion

You can download the full paper from here. 

Multidimensional Scaling with R (from “Mastering Data Analysis with R”)

Guest post by Gergely Daróczi. If you like this content, you can buy the full 396 paged e-book for 5 USD until January 8, 2016 as part of Packt’s “$5 Skill Up Campaign” at https://bit.ly/mastering-R

Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.

Multidimensional Scaling (MDS) is a multivariate statistical technique first used in geography. The main goal of MDS it is to plot multivariate data points in two dimensions, thus revealing the structure of the dataset by visualizing the relative distance of the observations. Multidimensional scaling is used in diverse fields such as attitude study in psychology, sociology or market research.

Although the MASS package provides non-metric methods via the isoMDS function, we will now concentrate on the classical, metric MDS, which is available by calling the cmdscale function bundled with the stats package. Both types of MDS take a distance matrix as the main argument, which can be created from any numeric tabular data by the dist function.

But before such more complex examples, let’s see what MDS can offer for us while working with an already existing distance matrix, like the built-in eurodist dataset:

> as.matrix(eurodist)[1:5, 1:5]
          Athens Barcelona Brussels Calais Cherbourg
Athens         0      3313     2963   3175      3339
Barcelona   3313         0     1318   1326      1294
Brussels    2963      1318        0    204       583
Calais      3175      1326      204      0       460
Cherbourg   3339      1294      583    460         0

The above subset (first 5-5 values) of the distance matrix represents the travel distance between 21 European cities in kilometers. Running classical MDS on this example returns:

> (mds <- cmdscale(eurodist))
                      [,1]      [,2]
Athens           2290.2747  1798.803
Barcelona        -825.3828   546.811
Brussels           59.1833  -367.081
Calais            -82.8460  -429.915
Cherbourg        -352.4994  -290.908
Cologne           293.6896  -405.312
Copenhagen        681.9315 -1108.645
Geneva             -9.4234   240.406
Gibraltar       -2048.4491   642.459
Hamburg           561.1090  -773.369
Hook of Holland   164.9218  -549.367
Lisbon          -1935.0408    49.125
Lyons            -226.4232   187.088
Madrid          -1423.3537   305.875
Marseilles       -299.4987   388.807
Milan             260.8780   416.674
Munich            587.6757    81.182
Paris            -156.8363  -211.139
Rome              709.4133  1109.367
Stockholm         839.4459 -1836.791
Vienna            911.2305   205.930

These scores are very similar to two principal components (discussed in the previous, Principal Component Analysis section), such as running prcomp(eurodist)$x[, 1:2]. As a matter of fact, PCA can be considered as the most basic MDS solution.

Anyway, we have just transformed (reduced) the 21-dimensional space into 2 dimensions, which can be plotted very easily — unlike the original distance matrix with 21 rows and 21 columns:

> plot(mds)


Does it ring a bell? If not yet, the below image might be more helpful, where the following two lines of code also renders the city names instead of showing anonymous points:

> plot(mds, type = 'n')
> text(mds[, 1], mds[, 2], labels(eurodist))

Continue reading “Multidimensional Scaling with R (from “Mastering Data Analysis with R”)”

R 3.2.3 is released (with improvements for Windows users, and general bug fixes)

R 3.2.3 (codename “Wooden Christmas Tree”) was released several days ago. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

Major changes in R 3.2.3

As highlighted by David Smith, this release makes a few small improvements and bug fixes to R, including:

  • Improved support for users of the Windows OS in time zones, OS version identification, FTP connections, and printing (in the GUI).
  • Performance improvements and more support for long vectors in some functions including which.max
  • Improved accuracy for the Chi-Square distribution functions in some extreme cases

Upgrading to R 3.2.3 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
installr::updateR() # updating R.

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.


  • Some recently-added Windows time zone names have been added to the conversion table used to convert these to Olson names. (Including those relating to changes for Russia in Oct 2014, as in PR#16503.)
  • (Windows) Compatibility information has been added to the manifests for ‘Rgui.exe’, ‘Rterm.exe’ and ‘Rscript.exe’. This should allow win.version() andSys.info() to report the actual Windows version up to Windows 10.
  • Windows "wininet" FTP first tries EPSV / PASV mode rather than only using active mode (reported by Dan Tenenbaum).
  • which.min(x) and which.max(x) may be much faster for logical and integer x and now also work for long vectors.
  • The ‘emulation’ part of tools::texi2dvi() has been somewhat enhanced, including supporting quiet = TRUE. It can be selected by texi2dvi = "emulation".(Windows) MiKTeX removed its texi2dvi.exe command in Sept 2015: tools::texi2dvi() tries texify.exe if it is not found.
  • (Windows only) Shortcuts for printing and saving have been added to menus in Rgui.exe. (Request of PR#16572.)
  • loess(..., iterTrace=TRUE) now provides diagnostics for robustness iterations, and the print() method for summary(<loess>) shows slightly more.
  • The included version of PCRE has been updated to 8.38, a bug-fix release.
  • View() now displays nested data frames in a more friendly way. (Request with patch in PR#15915.)


  • regexpr(pat, x, perl = TRUE) with Python-style named capture did not work correctly when x contained NA strings. (PR#16484)
  • The description of dataset ToothGrowth has been improved/corrected. (PR#15953)
  • model.tables(type = "means") and hence TukeyHSD() now support "aov" fits without an intercept term. (PR#16437)
  • close() now reports the status of a pipe() connection opened with an explicit open argument. (PR#16481)
  • Coercing a list without names to a data frame is faster if the elements are very long. (PR#16467)
  • (Unix-only) Under some rare circumstances piping the output from Rscript or R -f could result in attempting to close the input file twice, possibly crashing the process. (PR#16500)
  • (Windows) Sys.info() was out of step with win.version() and did not report Windows 8.
  • topenv(baseenv()) returns baseenv() again as in R 3.1.0 and earlier. This also fixes compilerJIT(3) when used in ‘.Rprofile’.
  • detach()ing the methods package keeps .isMethodsDispatchOn() true, as long as the methods namespace is not unloaded.
  • Removed some spurious warnings from configure about the preprocessor not finding header files. (PR#15989)
  • rchisq(*, df=0, ncp=0) now returns 0 instead of NaN, and dchisq(*, df=0, ncp=*) also no longer returns NaN in limit cases (where the limit is unique). (PR#16521)
  • pchisq(*, df=0, ncp > 0, log.p=TRUE) no longer underflows (for ncp > ~60).
  • nchar(x, "w") returned -1 for characters it did not know about (e.g. zero-width spaces): it now assumes 1. It now knows about most zero-width characters and a few more double-width characters.
  • Help for which.min() is now more precise about behavior with logical arguments. (PR#16532)
  • The print width of character strings marked as "latin1" or "bytes" was in some cases computed incorrectly.
  • abbreviate() did not give names to the return value if minlength was zero, unlike when it was positive.
  • (Windows only) dir.create() did not always warn when it failed to create a directory. (PR#16537)
  • When operating in a non-UTF-8 multibyte locale (e.g. an East Asian locale on Windows), grep() and related functions did not handle UTF-8 strings properly. (PR#16264)
  • read.dcf() sometimes misread lines longer than 8191 characters. (Reported by Hervé Pagès with a patch.)
  • within(df, ..) no longer drops columns whose name start with a ".".
  • The built-in HTTP server converted entire Content-Type to lowercase including parameters which can cause issues for multi-part form boundaries (PR#16541).
  • Modifying slots of S4 objects could fail when the methods package was not attached. (PR#16545)
  • splineDesign(*, outer.ok=TRUE) (splines) is better now (PR#16549), and interpSpline() now allows sparse=TRUE for speedup with non-small sizes.
  • If the expression in the traceback was too long, traceback() did not report the source line number. (Patch by Kirill Müller.)
  • The browser did not truncate the display of the function when exiting with options("deparse.max.lines") set. (PR#16581)
  • When bs(*, Boundary.knots=) had boundary knots inside the data range, extrapolation was somewhat off. (Patch by Trevor Hastie.)
  • var() and hence sd() warn about factor arguments which are deprecated now. (PR#16564)
  • loess(*, weights = *) stored wrong weights and hence gave slightly wrong predictions for newdata. (PR#16587)
  • aperm(a, *) now preserves names(dim(a)).
  • poly(x, ..) now works when either raw=TRUE or coef is specified. (PR#16597)
  • data(package=*) is more careful in determining the path.
  • prettyNum(*, decimal.mark, big.mark): fixed bug introduced when fixing PR#16411.


  • The included configuration code for libintl has been updated to that from gettext version — this should only affect how an external library is detected (and the only known instance is under OpenBSD). (Wish of PR#16464.)
  • configure has a new argument –disable-java to disable the checks for Java.
  • The configure default for MAIN_LDFLAGS has been changed for the FreeBSD, NetBSD and Hurd OSes to one more likely to work with compilers other than gcc(FreeBSD 10 defaults to clang).
  • configure now supports the OpenMP flags -fopenmp=libomp (clang) and -qopenmp (Intel C).
  • Various macros can be set to override the default behaviour of configure when detecting OpenMP: see file ‘config.site’.
  • Source installation on Windows has been modified to allow for MiKTeX installations without texi2dvi.exe. See file ‘MkRules.dist’.

“Why do people contribute to the R?” – concolusions from a new PNAS article

tl;dr: People contribute to R for various reasons, which evolves with time. The main reasons appear to be: “fun coding”, personal commitment to the community, interaction with like-minded and/or important people  – leading to higher self-esteem, future job opportunities, a chance to express oneself and enjoyable social inclusion.

From the abstract

One of the cornerstones of the R system for statistical computing is the multitude of packages contributed by numerous package authors. This amount of packages makes an extremely broad range of statistical techniques and other quantitative methods freely available. Thus far, no empirical study has investigated psychological factors that drive authors to participate in the R project. This article presents a study of R package authors, collecting data on different types of participation (number of packages, participation in mailing lists, participation in conferences), three psychological scales (types of motivation, psychological values, and work design characteristics), and various socio-demographic factors. The data are analyzed using item response models and subsequent generalized linear models, showing that the most important determinants for participation are a hybrid form of motivation and the social characteristics of the work design. Other factors are found to have less impact or influence only specific aspects of participation.

Summary of results

R developers, statisticians, and psychologists from Harvard University, University of Vienna, WU Vienna University of Economics, and University of Innsbruck empirically studied psychosocial drivers of participation of R package authors. Through an online survey they collected data from 1,448 package authors. The questionnaire included psychometric scales (types of motivation, psychological values, work design), sociodemografic variables related to the work on R, and three participation measures (number of packages, participation in mailing lists, participation in conferences).


The data were analyzed using item response models and subsequently generalized linear models (logistic regressions, negative-binomial regression) with SIMEX corrected parameters.

The analysis reveals that the most important determinants for participation are a hybrid form of motivation and the social characteristics of the work design. Hybrid motivation acknowledges that motivation is a complex continuum of intrinsic, extrinsic, and internalized extrinsic motives.
Motives evolve over time, as task characteristics shift from need-driven problem solving to mundane maintenance tasks within the R community.
For instance, motivation can evolve from pure “fun coding” towards a personal commitment with associated higher responsibilities within the community. The community itself provides a social work environment with high degrees of interaction, two facets of which are strong motivators. First, interaction with persons perceived as important increases one’s own reputation (self-esteem, future job opportunities, etc.) Second, interaction with alike minded persons (i.e., interested in solving statistical problems) creates opportunities to express oneself and enjoy social inclusion.

The findings do not substantiate the commonly held perception that people develop packages out of purely altruistic motives. It is also notable that in most cases package development is undertaken as part of an individual’s research, which is paid by an (academic) institution, rather than uncompensated developments that cut into leisure time.

Full paper (behind PNAS’s paywall for now) is available here:

Mair, P., Hofmann, E., Gruber, K., Hatzinger, R., Zeileis, A., and Hornik, K. (2015). Motivation, values, and work design as drivers of participation in the R
open source project for statistical computing. Proceedings of the National Academy of Sciences of the United States of America, 112(48), 14788-14792


labels.dendrogram in R 3.2.2 can be ~70 times faster (for trees with 1000 labels)

The recent release of R 3.2.2 came with a small (but highly valuable) improvement to the stats:::labels.dendrogram function. When working with dendrograms with (say) 1000 labels, the new function offers a 70 times speed improvement over the version of the function from R 3.2.1. This speedup is even better than the Rcpp version of labels.dendrogram from the dendextendRcpp package.

Here is some R code to demonstrate this speed improvement:

# IF you are missing an of these - they should be installed:
# Getting labels from dendextendRcpp
labelsRcpp% dist %&gt;% hclust %&gt;% as.dendrogram

And here are the results:

&gt; microbenchmark(labels_3.2.1(dend), labels_3.2.2(dend), labelsRcpp(dend))
Unit: milliseconds
               expr        min         lq     median         uq       max neval
 labels_3.2.1(dend) 186.522968 189.395378 195.684164 208.328365 321.98368   100
 labels_3.2.2(dend)   2.604766   2.826776   2.891728   3.006792  21.24127   100
   labelsRcpp(dend)   3.825401   3.946904   3.999817   4.179552  11.22088   100
&gt; microbenchmark(labels_3.2.2(dend), order.dendrogram(dend))
Unit: microseconds
                   expr      min        lq   median        uq      max neval
     labels_3.2.2(dend) 2520.218 2596.0880 2678.677 2885.2890 9572.460   100
 order.dendrogram(dend)  665.191  712.2235  954.951  996.1055 2268.812   100

As we can see, the new labels function (in R 3.2.2) is about 70 times faster than the older version (from R 3.2.1). When only wanting something like the number of labels, using length on order.dendrogram will still be (about 3 times) faster than using labels.

This improvement is expected to speedup various functions in the dendextend R package (a package for visualizing, adjusting, and comparing dendrograms, which heavily relies on labels.dendrogram). We expect to get even better speedup improvements for larger trees.


R 3.2.2 is released

R 3.2.2 (codename “Fire Safety”) was released last weekend. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.


I personally found two things particularly interesting in this release:

  1. setInternet2(TRUE) is now the default for windows (which will save people from getting “Error in file(con, “r”)” when using the installr package)
  2. The dendrogram method of labels() is much more efficient for large dendrograms since it now uses rapply(). This is expected to speedup various functions in the dendextend R package (a package for visualizing, adjusting, and comparing dendrograms, which heavily relies on labels.dendrogram).

Also, David Smith (from Revolution/Microsoft) highlighted in his post several of the updates in R 3.2.2 he found interesting – mentioning how the new default for accessing the web with R will rely on the HTTPS protocol, and of improving the accuracy in the extreme tails of the t and hypergeometric distributions.

Upgrading to R 3.2.2 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
installr::updateR() # updating R.

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.



  • It is now easier to use secure downloads from https:// URLs on builds which support them: no longer do non-default options need to be selected to do so. In particular, packages can be installed from repositories which offer https:// URLs, and those listed by setRepositories()now do so (for some of their mirrors).Support for https:// URLs is available on Windows, and on other platforms if support forlibcurl was compiled in and if that supports the https protocol (system installations can be expected to do). So https:// support can be expected except on rather old OSes (an example being OS X ‘Snow Leopard’, where a non-system version of libcurl can be used).(Windows only) The default method for accessing URLs via download.file() and url() has been changed to be "wininet" using Windows API calls. This changes the way proxies need to be set and security settings made: there have been some reports of sites being inaccessible under the new default method (but the previous methods remain available).


Continue reading “R 3.2.2 is released”

Slides from my JSM 2015 talk on dendextend

If you happen to be at the JSM 2015 conference this week, then this Monday, at 2pm, I will give a talk on the dendextend R package  (in the session “Advances in Graphical Frameworks and Methods Part 1“) – feel free to drop by and say hi.

Here are my slides for the intended talk:


p.s.: Yes – this presentation is very similar, although not identical, to the one I gave at useR2015. For example, I mention the new bioinformatics paper on dendextend.

dendextend: a package for visualizing, adjusting, and comparing dendrograms (based on a paper from “bioinformatics”)

This post on the dendextend package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full:


Summary: dendextend is an R package for creating and comparing visually appealing tree diagrams. dendextend provides utility functions for manipulating dendrogram objects (their color, shape, and content) as well as several advanced methods for comparing trees to one another (both statistically and visually). As such, dendextend offers a flexible framework for enhancing R’s rich ecosystem of packages for performing hierarchical clustering of items.

Availability: The dendextend R package (including detailed introductory vignettes) is available under the GPL-2 Open Source license and is freely available to download from CRAN at: (https://cran.r-project.org/package=dendextend)

Contact: [email protected]

Continue reading “dendextend: a package for visualizing, adjusting, and comparing dendrograms (based on a paper from “bioinformatics”)”

dendextend version 1.0.1 + useR!2015 presentation

When using the dendextend package in your work, please cite it using:

Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. doi:10.1093/bioinformatics/btv428

My R package dendextend (version 1.0.1) is now on CRAN!

The dendextend package Offers a set of functions for extending dendrogram objects in R, letting you visualize and compare trees of hierarchical clusterings. With it you can (1) Adjust a tree’s graphical parameters – the color, size, type, etc of its branches, nodes and labels. (2) Visually and statistically compare different dendrograms to one another.

The previous release of dendextend (0.18.3) was half a year ago, and this version includes many new features and functions.

To help you discover how dendextend can solve your dendrogram/hierarchical-clustering issues, you may consult one of the following vignettes:

Here is an example figure from the first vignette (analyzing the Iris dataset)



This week, at useR!2015, I will give a talk on the package. This will offer a quick example, and a step-by-step example of some of the most basic/useful functions of the package. Here are the slides:


Lastly, I would like to mention the new d3heatmap package for interactive heat maps. This package is by Joe Cheng from Rstudio, and integrates well with dendrograms in general and dendextend in particular (thanks to some lovely github-commit-discussion between Joe and I). You are invited to see lively examples of the package in the post at the RStudio blog. Here is just one quick example:

d3heatmap(nba_players, colors = “Blues”, scale = “col”, dendrogram = “row”, k_row = 3)


Setting Rstudio server using Amazon Web Services (AWS) – a step by step (screenshots) tutorial

(this is a guest post by Liad Shekel)

Amazon Web Services (AWS) include many different computational tools, ranging from storage systems and virtual servers to databases and analytical tools. For us R-programmers, being familiar and experienced with these tools can be extremely beneficial in terms of efficiency, style, money-saving and more.

In this post we present a step-by-step screenshot tutorial that will get you to know Amazon EC2 service. We will set up an EC2 instance (Amazon virtual server), install an Rstudio server on it and use our beloved Rstudio via browser (all for free!). The slides below will also include an introduction to linux commands (basic), instructions for connecting to a remote server via ssh and more. No previous knowledge is required.

Useful links:

  1. Set up an AWS account (do not worry about the credit card details, you will not be charged for any of  our actions) – the steps are presented in the slides below.
  2. Windows users: download MobaXterm (or any other ssh client software).
    Mac users: make sure you are familiar with the terminal (cause I’m not).