Presidential Election Predictions 2016 (an ASA competition)

Guest post by Jo professor of mathematics, Pomona College.

ASA’s Prediction Competition

In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

http://thisisstatistics.org/electionprediction2016/

To get you started, I’ve written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

The reproducibility crisis in science and prospects for R

Guest post by Gregorio Santori (<[email protected]>)

The results that emerged from a recent Nature‘s survey confirm as, for many researchers, we are living in a weak reproducibility age (Baker M. Is there a reproducibility crisis? Nature 2016;533:453-454). Although the definition of reproducibility can vary widely between disciplines, in this survey was adopted the version for which “another scientist using the same methods gets similar results and can draw the same conclusions” (Reality check on reproducibility. Nature 2016;533:437). Already in 2009, Roger Peng formulated a definition of reproducibility very attractive: “In many fields of study there are examples of scientific investigations that cannot be fully replicated because of a lack of time or resources. In such a situation there is a need for a minimum standard that can fill the void between full replication and nothing. One candidate for this minimum standard is «reproducible research», which requires that data sets and computer code be made available to others for verifying published results and conducting alternative analyses” (Peng R. Reproducible research and Biostatistics. Biostatistics. 2009;10:405-408). For many readers of R-bloggers, the Peng’s formulation probably means in the first place a combination of R, LaTeX, Sweave, knitr, R Markdown, RStudio, and GitHub. From the broader perspective of scholarly journals, it mainly means Web repositories for experimental protocols, raw data, and source code.

Although researchers and funders can contribute in many ways to reproducibility, scholarly journals seem to be in a position to give a decisive advancement for a more reproducible research. In the incipit of the “Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals“, developed by the International Committee of Medical Journals Editors (ICMJE), there is an explicit reference to reproducibility. Moreover, the same ICMJE Recommendations reported as “the Methods section should aim to be sufficiently detailed such that others with access to the data would be able to reproduce the results“, while “[the Statistics section] describe[s] statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results“.

In December 2010, Nature Publishing Group launched Protocol Exchange, “[…] an Open Repository for the deposition and sharing of protocols for scientific research“, where “protocols […] are presented subject to a Creative Commons Attribution-NonCommercial licence“.

In December 2014, PLOS journals announced a new policy for data sharing, resulted in the Data Availability Statement for submitted manuscripts.

In June 2014, at the American Association for the Advancement of Science headquarter, the US National Institute of Health held a joint workshop on the reproducibility, with the participation of the Nature Publishing Group, Science, and the editors representing over 30 basic/preclinical science journals. The workshop resulted in the release of the “Principles and Guidelines for Reporting Preclinical Research“, where rigorous statistical analysis and data/material sharing were emphasized.

In this scenario, I have recently suggested a global “statement for reproducibility” (Research papers: Journals should drive data reproducibility. Nature 2016;535:355). One of the strong points of this proposed statement is represented by the ban of “point-and-click” statistical software. For papers with a “Statistical analysis” section, only original studies carried out by using source code-based statistical environments should be admitted to peer review. In any case, the current policies adopted by scholarly journals seem to be moving towards stringent criteria to ensure more reproducible research. In the next future, the space for “point-and-click” statistical software will progressively shrink, and a cross-platform/open source language/environment such as R will be destined to play a key role.

Using 2D Contour Plots within {ggplot2} to Visualize Relationships between Three Variables

Guest post by John Bellettiere, Vincent Berardi, Santiago Estrada

The Goal

To visually explore relations between two related variables and an outcome using contour plots. We use the contour function in Base R to produce contour plots that are well-suited for initial investigations into three dimensional data. We then develop visualizations using ggplot2 to gain more control over the graphical output. We also describe several data transformations needed to accomplish this visual exploration.

R 3.3.1 is released

R 3.3.1 (codename “Bug in Your Hair”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes is provided below new features and (this release does not introduce new features).

Upgrading to R 3.3.1 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

 ```install.packages("installr") # install setInternet2(TRUE) # only for R versions older than 3.3.0 installr::updateR() # updating R.```

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

heatmaply: interactive heat maps (with R)

I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package.

tl;dr

By running the following 3 lines of code:

 ```install.packages("heatmaply") library(heatmaply) heatmaply(mtcars, k_col = 2, k_row = 3) %>% layout(margin = list(l = 130, b = 40))```

You will get this output in your browser (or RStudio console):

R 3.3.0 is released!

R 3.3.0 (codename “Supposedly Educational”) was released today. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

Upgrading to R 3.3.0 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

 ```install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R.```

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.3.0

SIGNIFICANT USER-VISIBLE CHANGES

• `nchar(x, *)`‘s argument `keepNA` governing how the result for `NA`s in `x` is determined, gets a new default `keepNA = NA` which returns `NA` where `x` is `NA`, except for `type = "width"` which still returns `2`, the formatting / printing width of `NA`.
• All builds have support for https: URLs in the default methods for `download.file()`, `url()` and code making use of them.Unfortunately that cannot guarantee that any particular https: URL can be accessed. For example, server and client have to successfully negotiate a cryptographic protocol (TLS/SSL, …) and the server’s identity has to be verifiable via the available certificates. Different access methods may allow different protocols or use private certificate bundles: we encountered a https: CRAN mirror which could be accessed by one browser but not by another nor by `download.file()` on the same Linux machine.

Election tRends: An interactive US election tracker (using Shiny and Plotly)

Guest post by Jonathan Sidi

Introduction

The US primaries are coming on fast with almost 120 days left until the conventions. After building a shinyapp for the Israeli Elections I decided to update features in the app and tried out plotly in the shiny framework.

As a casual voter, trying to gauge the true temperature of the political landscape from the overwhelming abundance of polling is a heavy task. Polling data is continuously published during the state primaries and the variety of pollsters makes it hard to keep track what is going on. The app self updates using data published publicly by realclearpolitics.com.

The app keeps track of polling trends and delegate count daily for you. You create a personal analysis from the granular level data all the way to distributions using interactive ggplot2 and plotly graphs and check out the general elections polling to peak into the near future.

The app can be accessed through a couple of places. I set up an AWS instance to host the app for realtime use and there is the Github repository that is the maintained home of the app that is meant for the R community that can host shiny locally.

Running the App through Github

(github repo: yonicd/Elections)

 ```#changing locale to run on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_TIME","C")   #check to see if libraries need to be installed libs=c("shiny","shinyAce","plotly","ggplot2","rvest","reshape2","zoo","stringr","scales","plyr","dplyr") x=sapply(libs,function(x)if(!require(x,character.only = T)) install.packages(x));rm(x,libs)   #run App shiny::runGitHub("yonicd/Elections",subdir="USA2016/shiny")   #reset to original locale on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_ALL")```

Application Layout:

(see next section for details)

1. Current Polling
2. Election Analyis
3. General Elections
4. Polling Database

Usage Instructions:

Current Polling

• The top row depicts the current accumulation of delegates by party and candidate is shown in a step plot, with a horizontal reference line for the threshold needed per party to recieve the nomination. Ther accumulation does not include super delegates since it is uncertain which way they will vote. Currently this dataset is updated offline due to its somewhat static nature and the way the data is posted online forces the use of Selenium drivers. An action button will be added to invoke refreshing of the data by users as needed.
• The bottom row is a 7 day moving average of all polling results published on the state and national level. The ribbon around the moving average is the moving standard deviation on the same window. This is helpful to pick up any changes in uncertainty regarding how the voting public is percieving the candidates. It can be seen that candidates with lower polling averages and increased variance trend up while the opposite is true with the leading candidates, where voter uncertainty is a bad thing for them.

R 3.2.4 is released

R 3.2.4 (codename “Very Secure Dishes”) was released today. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

Upgrading to R 3.2.4 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

 ```install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R.```

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

NEW FEATURES

• `install.packages()` and related functions now give a more informative warning when an attempt is made to install a base package.
• `summary(x)` now prints with less rounding when `x` contains infinite values. (Request of PR#16620.)
• `provideDimnames()` gets an optional `unique` argument.
• `shQuote()` gains `type = "cmd2"` for quoting in `cmd.exe` in Windows. (Response to PR#16636.)
• The `data.frame` method of `rbind()` gains an optional argument `stringsAsFactors` (instead of only depending on `getOption("stringsAsFactors")`).
• `smooth(x, *)` now also works for long vectors.
• `tools::texi2dvi()` has a workaround for problems with the `texi2dvi` script supplied by texinfo 6.1.

It extracts more error messages from the LaTeX logs when in emulation mode.

UTILITIES

• `R CMD check` will leave a log file ‘build_vignettes.log’ from the re-building of vignettes in the ‘.Rcheck’ directory if there is a problem, and always if environment variable_R_CHECK_ALWAYS_LOG_VIGNETTE_OUTPUT_ is set to a true value.

DEPRECATED AND DEFUNCT

• Use of SUPPORT_OPENMP from header ‘Rconfig.h’ is deprecated in favour of the standard OpenMP define _OPENMP.

(This has been the recommendation in the manual for a while now.)

• The `make` macro `AWK` which is long unused by R itself but recorded in file ‘etc/Makeconf’ is deprecated and will be removed in R 3.3.0.
• The C header file ‘S.h’ is no longer documented: its use should be replaced by ‘R.h’.

BUG FIXES

• `kmeans(x, centers = <1-row>)` now works. (PR#16623)
• `Vectorize()` now checks for clashes in argument names. (PR#16577)
• `file.copy(overwrite = FALSE)` would signal a successful copy when none had taken place. (PR#16576)
• `ngettext()` now uses the same default domain as `gettext()`. (PR#14605)
• `array(.., dimnames = *)` now warns about non-`list` dimnames and, from R 3.3.0, will signal the same error for invalid dimnames as `matrix()` has always done.
• `addmargins()` now adds dimnames for the extended margins in all cases, as always documented.
• `heatmap()` evaluated its `add.expr` argument in the wrong environment. (PR#16583)
• `require()` etc now give the correct entry of `lib.loc` in the warning about an old version of a package masking a newer required one.
• The internal deparser did not add parentheses when necessary, e.g. before `[]` or `[[]]`. (Reported by Lukas Stadler; additional fixes included as well).
• `as.data.frame.vector(*, row.names=*)` no longer produces ‘corrupted’ data frames from row names of incorrect length, but rather warns about them. This will become an error.
• `url` connections with `method = "libcurl"` are destroyed properly. (PR#16681)
• `withCallingHandler()` now (again) handles warnings even during S4 generic’s argument evaluation. (PR#16111)
• `deparse(..., control = "quoteExpressions")` incorrectly quoted empty expressions. (PR#16686)
• `format()`ting datetime objects (`"POSIX[cl]?t"`) could segfault or recycle wrongly. (PR#16685)
• `plot.ts(<matrix>, las = 1)` now does use `las`.
• `saveRDS(*, compress = "gzip")` now works as documented. (PR#16653)
• (Windows only) The `Rgui` front end did not always initialize the console properly, and could cause R to crash. (PR#16998)
• `dummy.coef.lm()` now works in more cases, thanks to a proposal by Werner Stahel (PR#16665). In addition, it now works for multivariate linear models (`"mlm"`, `manova`) thanks to a proposal by Daniel Wollschlaeger.
• The `as.hclust()` method for `"dendrogram"`s failed often when there were ties in the heights.
• `reorder()` and `midcache.dendrogram()` now are non-recursive and hence applicable to somewhat deeply nested dendrograms, thanks to a proposal by Suharto Anggono in PR#16424.
• `cor.test()` now calculates very small p values more accurately (affecting the result only in extreme not statistically relevant cases). (PR#16704)
• `smooth(*, do.ends=TRUE)` did not always work correctly in R versions between 3.0.0 and 3.2.3.
• `pretty(D)` for date-time objects `D` now also works well if `range(D)` is (much) smaller than a second. In the case of only one unique value in `D`, the pretty range now is more symmetric around that value than previously.
Similarly, `pretty(dt)` no longer returns a length 5 vector with duplicated entries for `Date` objects `dt` which span only a few days.
• The figures in help pages such as `?points` were accidentally damaged, and did not appear in R 3.2.3. (PR#16708)
• `available.packages()` sometimes deleted the wrong file when cleaning up temporary files. (PR#16712)
• The `X11()` device sometimes froze on Red Hat Enterprise Linux 6. It now waits for `MapNotify` events instead of `Expose` events, thanks to Siteshwar Vashisht. (PR#16497)
• `[dpqr]nbinom(*, size=Inf, mu=.)` now works as limit case, for ‘dpq’ as the Poisson. (PR#16727)
`pnbinom()` no longer loops infinitely in border cases.
• `approxfun(*, method="constant")` and hence `ecdf()` which calls the former now correctly “predict” `NaN` values as `NaN`.
• `summary.data.frame()` now displays `NA`s in `Date` columns in all cases. (PR#16709)

It’s not the p-values’ fault – reflections on the recent ASA statement (+relevant R resources)

Joint post by Yoav Benjamini and Tal Galili. The post highlights points raised by Yoav in his official response to the ASA statement (available as on page 4 in the ASA supplemental tab), as well as offers a list of relevant R resources.

Summary

The ASA statement about the misuses of the p-value singles it out. It is just as well relevant to the use of most other statistical methods: context matters, no single statistical measure suffices, specific thresholds should be avoided and reporting should not be done selectively. The latter problem is discussed mainly in relation to omitted inferences. We argue that the selective reporting of inferences problem is serious enough a problem in our current industrialized science even when no omission takes place. Many R tools are available to address it, but they are mainly used in very large problems and are grossly underused in areas where lack of replicability hits hard.

Source: xkcd

50 years of Data Science – by David Donoho

David Donoho published a fascinating paper based on a presentation at the Tukey Centennial workshop, Princeton NJ Sept 18 2015. You can download the full paper from here.

The paper got quite the attention on Hacker News, Data Science Central, Simply Stats, Xi’an’s blog, srown ion medium, and probably others. Share your thoughts in the comments.

Here is the abstract and table of content.

Abstract

More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a \$100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

Contents

1 Today’s Data Science Moment

2 Data Science ‘versus’ Statistics

2.1 The ‘Big Data’ Meme

2.2 The ‘Skills’ Meme

2.3 The ‘Jobs’ Meme

2.4 What here is real?

2.5 A Better Framework

3 The Future of Data Analysis, 1962

4 The 50 years since FoDA

4.1 Exhortations

4.2 Reification

5 Breiman’s ‘Two Cultures’, 2001

6 The Predictive Culture’s Secret Sauce

6.2 Experience with CTF

6.3 The Secret Sauce

6.4 Required Skills

7 Teaching of today’s consensus Data Science

8 The Full Scope of Data Science

8.1 The Six Divisions

8.2 Discussion

8.3 Teaching of GDS

8.4 Research in GDS

8.4.1 Quantitative Programming Environments: R

8.4.2 Data Wrangling: Tidy Data

8.4.3 Research Presentation: Knitr

8.5 Discussion

9.1 Science-Wide Meta Analysis

9.2 Cross-Study Analysis

9.3 Cross-Workflow Analysis

9.4 Summary

10 The Next 50 Years of Data Science

10.1 Open Science takes over

10.2 Science as data

10.3 Scientific Data Analysis, tested Empirically

10.3.1 DJ Hand (2006)

10.3.2 Donoho and Jin (2008)

10.3.3 Zhao, Parmigiani, Huttenhower and Waldron (2014)

10.4 Data Science in 2065

11 Conclusion