The US primaries are coming on fast with almost 120 days left until the conventions. After building a shinyapp for the Israeli Elections I decided to update features in the app and tried out plotly in the shiny framework.

As a casual voter, trying to gauge the true temperature of the political landscape from the overwhelming abundance of polling is a heavy task. Polling data is continuously published during the state primaries and the variety of pollsters makes it hard to keep track what is going on. The app self updates using data published publicly by realclearpolitics.com.

The app keeps track of polling trends and delegate count daily for you. You create a personal analysis from the granular level data all the way to distributions using interactive ggplot2 and plotly graphs and check out the general elections polling to peak into the near future.

The app can be accessed through a couple of places. I set up an AWS instance to host the app for realtime use and there is the Github repository that is the maintained home of the app that is meant for the R community that can host shiny locally.

(github repo: yonicd/Elections)

#changing locale to run on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_TIME","C") #check to see if libraries need to be installed libs=c("shiny","shinyAce","plotly","ggplot2","rvest","reshape2","zoo","stringr","scales","plyr","dplyr") x=sapply(libs,function(x)if(!require(x,character.only = T)) install.packages(x));rm(x,libs) #run App shiny::runGitHub("yonicd/Elections",subdir="USA2016/shiny") #reset to original locale on Windows if (Sys.info()[1] == "Windows") Sys.setlocale("LC_ALL") |

(see next section for details)

- Current Polling
- Election Analyis
- General Elections
- Polling Database

- The top row depicts the current accumulation of delegates by party and candidate is shown in a step plot, with a horizontal reference line for the threshold needed per party to recieve the nomination. Ther accumulation does not include super delegates since it is uncertain which way they will vote. Currently this dataset is updated offline due to its somewhat static nature and the way the data is posted online forces the use of Selenium drivers. An action button will be added to invoke refreshing of the data by users as needed.
- The bottom row is a 7 day moving average of all polling results published on the state and national level. The ribbon around the moving average is the moving standard deviation on the same window. This is helpful to pick up any changes in uncertainty regarding how the voting public is percieving the candidates. It can be seen that candidates with lower polling averages and increased variance trend up while the opposite is true with the leading candidates, where voter uncertainty is a bad thing for them.

- An interactive polling analysis layout where the user can filter elections, parties, publishers and pollster, dates and create different types of plots using any variable as the x and y axis.
- The default layer is the long term trend (estimated with loess smoother) of polling results published by party and candidate

The user can choose to filter in the plots States, Parties, Candidates, Pollsters. Next there is a slider to choose the days before the conventions you want to view in the plot. This was used instead of a calendar to make a uniform timeline that is cleaner than arbitrary dates. Since there are a lot of states left and no one keeps track of which ones are left an extra filter was added to keep just the states with open primaries.

The new feature added is the option to go fully interactive and try out plotly!. Its integration with ggplot is great and new features are being added all the time to the package.

The base graphics are ggplot thus the options above the graph give the user control on nearly all the options to build a plot. The user can choose from the following variables: **Date, Days Left to Convention, Month, Weekday, Week in Month, Party, Candidate, State, Pollster, Results, Final Primary Result, Pollster Error, Sample Type (Registerd/Likely Voter), Sample Size**. There is an extra column in the Polling Database tab that gives the source URL of the poll that was conducted for anyone who wants to dig deeper in the data.

To define the following plot attributes:

Plot Type | Axes | Grouping | Plot Facets |
---|---|---|---|

Point | X axis variable | Split Y by colors using a different variable | Row Facet |

Bar | Discrete/Continuous | Column Facet | |

Line | Rotation of X tick labels | ||

Step | Y axis variable | ||

Boxplot | |||

Density |

- Create Facets to display subsets of the data in different panels (two more variables to cut data) there are two type of facets to choose from
- Wrap: Wrap 1d ribbon of panels into 2d
- Grid: Layout panels in a grid (matrix)

An example of the distribution of polling results in the open primaries over the last two months:

Zooming in to this trend we can see the state level polling

An analysis showing the convergence of polling errors to Sanders and Clinton over the Primary season. Initially Sanders was underestimated by the pollsters and over time the public sentiment has shifted. Currently the pollsters have captured the public sentiment to the primary outcomes. This can be seen as a ceiling to the Sanders campaign:

- If you are an R user and know ggplot syntax there is an additional editor console,below the plot, where you can create advanced plots freehand, just add to the final object from the GUI called p and the data.frame is poll.shiny, eg p+geom_point(). Just notice that all aesthetics must be given they are not defined in the original ggplot() definition. It is also possible to use any library you want just add it to the top of the code, the end object must be a ggplot. This also works great with plotly so do not worry if you are in interactive mode.

```
#new layer
p+geom_smooth(aes(x=DaysLeft,y=Results,fill=Candidate))+
scale_x_reverse()+scale_fill_discrete(name="Candidate")
```

- You can also remove the original layer if you want using the function remove_geom(ggplot_object,geom_layer), eg p=p+remove_geom(p,”point”) will remove the geom_point layer in the original graph

```
#new layer
p=p+geom_smooth(aes(x=DaysLeft,y=Results,fill=Candidate))+
scale_x_reverse()+scale_fill_discrete(name="Candidate")
remove_geom(p,"point") #leaving only the trend on the plot
```

- Finally the plots can be downloaded to your local computer using the download button.

- A peak into the sentiment of the public on cross party polling. Democratic candidate vs Republican candidate. The plots are set up to show the republican spread (Republican Candidate – Democratic Candidate) on the y-axis.
- The top plot is a longterm overview of the spread distributions with boxplots, while the bottom plot shows a daily account of the spread per candidate over the last two weeks. Both plots are split to National samples and State samples due to their heterogeneous nature.

- All raw data used in the application can be viewed and filtered in a datatable. There is an extra column that gives the source URL of the poll that was conducted for anyone who wants to dig deeper in the data.

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package.

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

`install.packages()`

and related functions now give a more informative warning when an attempt is made to install a base package.`summary(x)`

now prints with less rounding when`x`

contains infinite values. (Request of PR#16620.)`provideDimnames()`

gets an optional`unique`

argument.`shQuote()`

gains`type = "cmd2"`

for quoting in`cmd.exe`

in Windows. (Response to PR#16636.)- The
`data.frame`

method of`rbind()`

gains an optional argument`stringsAsFactors`

(instead of only depending on`getOption("stringsAsFactors")`

). `smooth(x, *)`

now also works for long vectors.`tools::texi2dvi()`

has a workaround for problems with the`texi2dvi`

script supplied by texinfo 6.1.It extracts more error messages from the LaTeX logs when in emulation mode.

`R CMD check`

will leave a log file ‘build_vignettes.log’ from the re-building of vignettes in the ‘.Rcheck’ directory if there is a problem, and always if environment variable_R_CHECK_ALWAYS_LOG_VIGNETTE_OUTPUT_ is set to a true value.

- Use of SUPPORT_OPENMP from header ‘Rconfig.h’ is deprecated in favour of the standard OpenMP define _OPENMP.
(This has been the recommendation in the manual for a while now.)

- The
`make`

macro`AWK`

which is long unused by**R**itself but recorded in file ‘etc/Makeconf’ is deprecated and will be removed in**R**3.3.0. - The C header file ‘S.h’ is no longer documented: its use should be replaced by ‘R.h’.

`kmeans(x, centers = <1-row>)`

now works. (PR#16623)`Vectorize()`

now checks for clashes in argument names. (PR#16577)`file.copy(overwrite = FALSE)`

would signal a successful copy when none had taken place. (PR#16576)`ngettext()`

now uses the same default domain as`gettext()`

. (PR#14605)`array(.., dimnames = *)`

now warns about non-`list`

dimnames and, from**R**3.3.0, will signal the same error for invalid dimnames as`matrix()`

has always done.`addmargins()`

now adds dimnames for the extended margins in all cases, as always documented.`heatmap()`

evaluated its`add.expr`

argument in the wrong environment. (PR#16583)`require()`

etc now give the correct entry of`lib.loc`

in the warning about an old version of a package masking a newer required one.- The internal deparser did not add parentheses when necessary, e.g. before
`[]`

or`[[]]`

. (Reported by Lukas Stadler; additional fixes included as well). `as.data.frame.vector(*, row.names=*)`

no longer produces ‘corrupted’ data frames from row names of incorrect length, but rather warns about them. This will become an error.`url`

connections with`method = "libcurl"`

are destroyed properly. (PR#16681)`withCallingHandler()`

now (again) handles warnings even during S4 generic’s argument evaluation. (PR#16111)`deparse(..., control = "quoteExpressions")`

incorrectly quoted empty expressions. (PR#16686)`format()`

ting datetime objects (`"POSIX[cl]?t"`

) could segfault or recycle wrongly. (PR#16685)`plot.ts(<matrix>, las = 1)`

now does use`las`

.`saveRDS(*, compress = "gzip")`

now works as documented. (PR#16653)- (Windows only) The
`Rgui`

front end did not always initialize the console properly, and could cause**R**to crash. (PR#16998) `dummy.coef.lm()`

now works in more cases, thanks to a proposal by Werner Stahel (PR#16665). In addition, it now works for multivariate linear models (`"mlm"`

,`manova`

) thanks to a proposal by Daniel Wollschlaeger.- The
`as.hclust()`

method for`"dendrogram"`

s failed often when there were ties in the heights. `reorder()`

and`midcache.dendrogram()`

now are non-recursive and hence applicable to somewhat deeply nested dendrograms, thanks to a proposal by Suharto Anggono in PR#16424.`cor.test()`

now calculates very small p values more accurately (affecting the result only in extreme not statistically relevant cases). (PR#16704)`smooth(*, do.ends=TRUE)`

did not always work correctly in**R**versions between 3.0.0 and 3.2.3.`pretty(D)`

for date-time objects`D`

now also works well if`range(D)`

is (much) smaller than a second. In the case of only one unique value in`D`

, the pretty range now is more symmetric around that value than previously.

Similarly,`pretty(dt)`

no longer returns a length 5 vector with duplicated entries for`Date`

objects`dt`

which span only a few days.- The figures in help pages such as
`?points`

were accidentally damaged, and did not appear in**R**3.2.3. (PR#16708) `available.packages()`

sometimes deleted the wrong file when cleaning up temporary files. (PR#16712)- The
`X11()`

device sometimes froze on Red Hat Enterprise Linux 6. It now waits for`MapNotify`

events instead of`Expose`

events, thanks to Siteshwar Vashisht. (PR#16497) `[dpqr]nbinom(*, size=Inf, mu=.)`

now works as limit case, for ‘dpq’ as the Poisson. (PR#16727)

`pnbinom()`

no longer loops infinitely in border cases.`approxfun(*, method="constant")`

and hence`ecdf()`

which calls the former now correctly “predict”`NaN`

values as`NaN`

.`summary.data.frame()`

now displays`NA`

s in`Date`

columns in all cases. (PR#16709)

]]>

The ASA statement about the misuses of the p-value singles it out. It is just as well relevant to the use of most other statistical methods: context matters, no single statistical measure suffices, specific thresholds should be avoided and reporting should not be done selectively. The latter problem is discussed mainly in relation to omitted inferences. We argue that the selective reporting of inferences problem is serious enough a problem in our current industrialized science even when no omission takes place. Many R tools are available to address it, but they are mainly used in very large problems and are grossly underused in areas where lack of replicability hits hard.

Source: xkcd

A few days ago the ASA released a statement titled “on p-values: context, process, and purpose”. It was a way for the ASA to address the concerns about the role of Statistics in the Reproducibility and Replicability (R&R) crisis. In the discussions about R&R the p-value has become a scapegoat, being such a widely used statistical method. The ASA statement made an effort to clarify various misinterpretations and to point at misuses of the p-value, but we fear that the result is a statement that might be read by the target readers as expressing very negative attitude towards the p-value. And indeed, just two days after the release of the ASA statement, a blog post titled “After 150 Years, the ASA Says No to p-values” was published (by Norman Matloff), even though the ASA (as far as we read it) did __not__ say “no to P-values” anywhere in the statement. Thankfully, other online reactions to the ASA statements, such as the article in Nature, and other posts in the blogosphere (see [1], [2], [3], [4], [5]), did not use an anti-p-value rhetoric.

In spite of its misinterpretations, the p-value served science well over the 20^{th} century. Why? Because in some sense the p-value offers a first defense line against being fooled by randomness, separating signal from noise. It requires simpler (or fewer) models than those needed by other statistical tool. The p-value requires (in order to be valid) only a statistical model for the behavior of a statistic under the null hypothesis to hold. Even if a model of an alternative hypothesis is used for choosing a “good” statistic (which would be used for constructing a p-value with decent power for an alternative of interest), this alternative model does not have to be correct in order for the p-value to be valid and useful (i.e.: control type I error at the desired level while offering some power to detect a real effect). In contrast, other (wonderful, useful and complementary) statistical methods such as Likelihood ratios, effect size estimation, confidence intervals, or Bayesian methods all need the assumed models to hold over a wider range of situations, not merely under the tested null. In the context of the “replicability crisis” in science, the type I error control of the p-value under the null hypothesis is an important property. And most importantly, the model needed for the calculation of the p-value may be guaranteed to hold under an appropriately designed and executed randomized experiment.

The p-value is a very valuable tool, but it should be complemented – not replaced – by confidence intervals and effect size estimators (as is possible in the specific setting). The ends of a 95% confidence interval indicates a range of potential null hypothesis that could be rejected. An estimator of effect size (supported by an assessment of uncertainty) is crucial for interpretation and for assessing the scientific significance of the results.

While useful, all these types of inferences are also affected by similar problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? or should scientific discoveries be based on whether posterior odds pass a specific threshold? Does either of them measure the size of the effect? Finally, 95% confidence intervals or credence intervals offer no protection against selection when only those that do not cover 0, are selected into the abstract. The properties each method has on the average for a single parameter (level, coverage or unbiased) will not necessarily hold even on the average when a selection is made.

What, then, went wrong in the last decade or two? The change in the scale of the scientific work, brought about by high throughput experimentation methodologies, availability of large databases and ease of computation, a change that parallels the industrialization that production processes have already gone through. In Genomics, Proteomics, Brain Imaging and such, the number of potential discoveries scanned is enormous so the selection of the interesting ones for highlighting is a must. It has by now been recognized in these fields that merely “full reporting and transparency” (as recommended by ASA) is not enough, and methods should be used to control the effect of the unavoidable selection. Therefore, in those same areas, the p-value bright-line is not set at the traditional 5% level. Methods for adaptively setting it to directly control a variety of false discovery rates or other error rates are commonly used.

Addressing the effect of selection on inference (be it when using p-value, or other methods) has been a very active research area; New strategies and sophisticated selective inference tools for testing, confidence intervals, and effect size estimation, in different setups are being offered. Much of it still remains outside the practitioners’ active toolset, even though many are already available in R, as we describe below. The appendix of this post contains a partial list of R packages that support simultaneous and selective inference.

In summary, when discussing the impact of statistical practices on R&R, the p-value should not be singled out nor its usage discouraged: it’s more likely the fault of selection, and not the p-values’ fault.

Extended support for classical and modern adjustment for Simultaneous and Selective Inference (also known as “multiple comparisons”) is available in R and in various R packages. Traditional concern in these areas has been on properties holding simultaneously for all inferences. More recent concerns are on properties holding on the average over the selected, addressed by varieties of false discovery rates, false coverage rates and conditional approaches. The following is a list of relevant R resources. If you have more, please mention them in the comments.

Every R installation offers functions (from the {stats} package) for dealing with multiple comparisons, such as:** **

**adjust**– that gets a set of p-values as input and returns p-values adjusted using one of several methods: Bonferroni, Holm (1979), Hochberg (1988), Hommel (1988), FDR by Benjamini & Hochberg (1995), and Benjamini & Yekutieli (2001),**t.test****,****pairwise.wilcox.test, and pairwise.prop.test**– all rely on p.adjust and can calculate pairwise comparisons between group levels with corrections for multiple testing.- TukeyHSD- Create a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage. The intervals are based on the Studentized range statistic, Tukey’s ‘Honest Significant Difference’ method.

Once we venture outside of the core R functions, we are introduced to a wealth of R packages and statistical procedures. What follows is a partial list (if you wish to contribute and extend this list, please leave your comment to this post):

- multcomp – Simultaneous tests and confidence intervals for general linear hypotheses in parametric models, including linear, generalized linear, linear mixed effects, and survival models. The package includes demos reproducing analyzes presented in the book “Multiple Comparisons Using R” (Bretz, Hothorn, Westfall, 2010, CRC Press).
- coin (+RcmdrPlugin.coin)- Conditional inference procedures for the general independence problem including two-sample, K-sample (non-parametric ANOVA), correlation, censored, ordered and multivariate problems.
- SimComp – Simultaneous tests and confidence intervals are provided for one-way experimental designs with one or many normally distributed, primary response variables (endpoints).
- PMCMR – Calculate Pairwise Multiple Comparisons of Mean Rank Sums
- mratios – perform (simultaneous) inferences for ratios of linear combinations of coefficients in the general linear model.
- mutoss (and accompanying mutossGUI) – are designed to ease the application and comparison of multiple hypothesis testing procedures.
- nparcomp – compute nonparametric simultaneous confidence intervals for relative contrast effects in the unbalanced one way layout. Moreover, it computes simultaneous p-values.
- ANOM – The package takes results from multiple comparisons with the grand mean (obtained with ‘multcomp’, ‘SimComp’, ‘nparcomp’, or ‘MCPAN’) or corresponding simultaneous confidence intervals as input and produces ANOM decision charts that illustrate which group means deviate significantly from the grand mean.
- gMCP – Functions and a graphical user interface for graphical described multiple test procedures.
- MCPAN – Multiple contrast tests and simultaneous confidence intervals based on normal approximation.
- mcprofile – Calculation of signed root deviance profiles for linear combinations of parameters in a generalized linear model. Multiple tests and simultaneous confidence intervals are provided.
- factorplot – Calculate, print, summarize and plot pairwise differences from GLMs, GLHT or Multinomial Logit models. Relies on stats::p.adjust
- multcompView – Convert a logical vector or a vector of p-values or a correlation, difference, or distance matrix into a display identifying the pairs for which the differences were not significantly different. Designed for use in conjunction with the output of functions like TukeyHSD, dist{stats}, simint, simtest, csimint, csimtest{multcomp}, friedmanmc, kruskalmc{pgirmess}.
- discreteMTP – Multiple testing procedures for discrete test statistics, that use the known discrete null distribution of the p-values for simultaneous inference.
- someMTP – a collection of functions for Multiplicity Correction and Multiple Testing.
- hdi – Implementation of multiple approaches to perform inference in high-dimensional models
- ERP – Significance Analysis of Event-Related Potentials Data
- TukeyC – Perform the conventional Tukey test from aov and aovlist objects
- qvalue – offers a function which takes a list of p-values resulting from the simultaneous testing of many hypotheses and estimates their q-values and local FDR values. (reading this discussion thread might be helpful)
- fdrtool – Estimates both tail area-based false discovery rates (Fdr) as well as local false discovery rates (fdr) for a variety of null models (p-values, z-scores, correlation coefficients, t-scores).
- cp4p – Functions to check whether a vector of p-values respects the assumptions of FDR (false discovery rate) control procedures and to compute adjusted p-values.
- multtest – Non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR).
- selectiveInference – New tools for post-selection inference, for use with forward stepwise regression, least angle regression, the lasso, and the many means problem.
- PoSI (site) – Valid Post-Selection Inference for Linear LS Regression
- HWBH– A shiny app for hierarchical weighted FDR testing of primary and secondary endpoints in Medical Research. By Benjamini Y & Cohen R, 2013. Top of Form
- repfdr(@github)- estimation of Bayes and local Bayes false discovery rates for replicability analysis. Heller R, Yekutieli D, 2014
- SelectiveCI : An R package for computing confidence intervals for selected parameters as described in Asaf Weinstein, William Fithian & Yoav Benjamini,2013 and Yoav Benjamini, Daniel Yekutieli,2005
- Rvalue– Software for FDR testing for replicability in primary and follow-up endpoints. Heller R, Bogomolov M, Benjamini Y, 2014 “Deciding whether follow-up studies have replicated findings in a preliminary large-scale “omics’ study”, under review and available upon request from the first author. Bogomolov M, Heller R, 2013

Other than Simultaneous and Selective Inference, one should also mention that there are many R packages for reproducible research, i.e.: the connecting of data, R code, analysis output, and interpretation – so that scholarship can be recreated, better understood and verified. As well as for meta analysis, i.e.: the combining of findings from independent studies in order to make a more general claim.

- Statistics: P values are just the tip of the iceberg
- An estimate of the science-wise false discovery rate and application to the top medical literature
- On the scalability of statistical procedures: why the p-value bashers just don’t get it.

]]>

The paper got quite the attention on Hacker News, Data Science Central, Simply Stats, Xi’an’s blog, srown ion medium, and probably others. Share your thoughts in the comments.

Here is the abstract and table of content.

More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

**Contents**

**1 Today’s Data Science Moment**

**2 Data Science ‘versus’ Statistics**

2.1 The ‘Big Data’ Meme

2.2 The ‘Skills’ Meme

2.3 The ‘Jobs’ Meme

2.4 What here is real?

2.5 A Better Framework

**3 The Future of Data Analysis, 1962**

**4 The 50 years since FoDA**

4.1 Exhortations

4.2 Reification

**5 Breiman’s ‘Two Cultures’, 2001**

**6 The Predictive Culture’s Secret Sauce**

6.1 The Common Task Framework

6.2 Experience with CTF

6.3 The Secret Sauce

6.4 Required Skills

**7 Teaching of today’s consensus Data Science**

**8 The Full Scope of Data Science**

8.1 The Six Divisions

8.2 Discussion

8.3 Teaching of GDS

8.4 Research in GDS

8.4.1 Quantitative Programming Environments: R

8.4.2 Data Wrangling: Tidy Data

8.4.3 Research Presentation: Knitr

8.5 Discussion

**9 Science about Data Science**

9.1 Science-Wide Meta Analysis

9.2 Cross-Study Analysis

9.3 Cross-Workflow Analysis

9.4 Summary

**10 The Next 50 Years of Data Science**

10.1 Open Science takes over

10.2 Science as data

10.3 Scientific Data Analysis, tested Empirically

10.3.1 DJ Hand (2006)

10.3.2 Donoho and Jin (2008)

10.3.3 Zhao, Parmigiani, Huttenhower and Waldron (2014)

10.4 Data Science in 2065

**11 Conclusion**

Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.

Multidimensional Scaling (MDS) is a multivariate statistical technique first used in geography. The main goal of MDS it is to plot multivariate data points in two dimensions, thus revealing the structure of the dataset by visualizing the relative distance of the observations. Multidimensional scaling is used in diverse fields such as attitude study in psychology, sociology or market research.

Although the `MASS`

package provides non-metric methods via the `isoMDS`

function, we will now concentrate on the classical, metric MDS, which is available by calling the `cmdscale`

function bundled with the `stats`

package. Both types of MDS take a distance matrix as the main argument, which can be created from any numeric tabular data by the `dist`

function.

But before such more complex examples, let’s see what MDS can offer for us while working with an already existing distance matrix, like the built-in `eurodist`

dataset:

```
> as.matrix(eurodist)[1:5, 1:5]
Athens Barcelona Brussels Calais Cherbourg
Athens 0 3313 2963 3175 3339
Barcelona 3313 0 1318 1326 1294
Brussels 2963 1318 0 204 583
Calais 3175 1326 204 0 460
Cherbourg 3339 1294 583 460 0
```

The above subset (first 5-5 values) of the distance matrix represents the travel distance between 21 European cities in kilometers. Running classical MDS on this example returns:

```
> (mds <- cmdscale(eurodist))
[,1] [,2]
Athens 2290.2747 1798.803
Barcelona -825.3828 546.811
Brussels 59.1833 -367.081
Calais -82.8460 -429.915
Cherbourg -352.4994 -290.908
Cologne 293.6896 -405.312
Copenhagen 681.9315 -1108.645
Geneva -9.4234 240.406
Gibraltar -2048.4491 642.459
Hamburg 561.1090 -773.369
Hook of Holland 164.9218 -549.367
Lisbon -1935.0408 49.125
Lyons -226.4232 187.088
Madrid -1423.3537 305.875
Marseilles -299.4987 388.807
Milan 260.8780 416.674
Munich 587.6757 81.182
Paris -156.8363 -211.139
Rome 709.4133 1109.367
Stockholm 839.4459 -1836.791
Vienna 911.2305 205.930
```

These scores are very similar to two principal components (discussed in the previous, *Principal Component Analysis* section), such as running `prcomp(eurodist)$x[, 1:2]`

. As a matter of fact, PCA can be considered as the most basic MDS solution.

Anyway, we have just transformed (reduced) the 21-dimensional space into 2 dimensions, which can be plotted very easily — unlike the original distance matrix with 21 rows and 21 columns:

`> plot(mds)`

Does it ring a bell? If not yet, the below image might be more helpful, where the following two lines of code also renders the city names instead of showing anonymous points:

```
> plot(mds, type = 'n')
> text(mds[, 1], mds[, 2], labels(eurodist))
```

Although the *y* axis seems to be flipped (which you can fix by multiplying the second argument of text by

`-1`

), but we have just rendered a map of some European cities from the distance matrix — without any further geographical data. I hope you find this rather impressive!Please find more data visualization tricks and methods in the 13th, *Data Around Us* chapter, from which you can learn for example how to plot the above results over a satellite map downloaded from online service providers. For now, I will only focus on how to render this plot with the new version of `ggplot2`

to avoid overlaps in the city names, and suppressing the not that useful *x* and*y* axis labels and ticks:

```
> library(ggplot2)
> ggplot(as.data.frame(mds), aes(V1, -V2, label = rownames(mds))) +
+ geom_text(check_overlap = TRUE) + theme_minimal() + xlab('') + ylab('') +
+ scale_y_continuous(breaks = NULL) + scale_x_continuous(breaks = NULL)
```

But let’s get back to the original topic and see how to apply MDS on non-geographic data, which was not prepared to be a distance matrix. We will use the `mtcars`

dataset in the following example resulting in a plot with no axis elements:

```
> mds <- cmdscale(dist(mtcars))
> plot(mds, type = 'n', axes = FALSE, xlab = '', ylab = '')
> text(mds[, 1], mds[, 2], rownames(mds))
```

The above plot shows the 32 cars of the original dataset scattered in a two dimensional space. The distance between the elements was computed by MDS, which took into account all the 11 original numeric variables, and it makes vert easy to identify the similar and very different car types. We will cover these topics in more details in the next chapter, which is dedicated to*Classification and Clustering*.

*This article first appeared in the “Mastering Data Analysis with R” book, and is now published with the permission of Packt Publishing.*

As highlighted by David Smith, this release makes a few small improvements and bug fixes to R, including:

- Improved support for users of the
**Windows OS**in time zones, OS version identification, FTP connections, and printing (in the GUI). - Performance improvements and more support for long vectors in some functions including which.max
- Improved accuracy for the Chi-Square distribution functions in some extreme cases

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package.

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

- Some recently-added Windows time zone names have been added to the conversion table used to convert these to Olson names. (Including those relating to changes for Russia in Oct 2014, as in PR#16503.)
- (Windows) Compatibility information has been added to the manifests for ‘Rgui.exe’, ‘Rterm.exe’ and ‘Rscript.exe’. This should allow
`win.version()`

and`Sys.info()`

to report the actual Windows version up to Windows 10. - Windows
`"wininet"`

FTP first tries EPSV / PASV mode rather than only using active mode (reported by Dan Tenenbaum). `which.min(x)`

and`which.max(x)`

may be much faster for logical and integer`x`

and now also work for long vectors.- The ‘emulation’ part of
`tools::texi2dvi()`

has been somewhat enhanced, including supporting`quiet = TRUE`

. It can be selected by`texi2dvi = "emulation"`

.(Windows) MiKTeX removed its`texi2dvi.exe`

command in Sept 2015:`tools::texi2dvi()`

tries`texify.exe`

if it is not found. - (Windows only) Shortcuts for printing and saving have been added to menus in
`Rgui.exe`

. (Request of PR#16572.) `loess(..., iterTrace=TRUE)`

now provides diagnostics for robustness iterations, and the`print()`

method for`summary(<loess>)`

shows slightly more.- The included version of PCRE has been updated to 8.38, a bug-fix release.
`View()`

now displays nested data frames in a more friendly way. (Request with patch in PR#15915.)

`regexpr(pat, x, perl = TRUE)`

with Python-style named capture did not work correctly when`x`

contained`NA`

strings. (PR#16484)- The description of dataset
`ToothGrowth`

has been improved/corrected. (PR#15953) `model.tables(type = "means")`

and hence`TukeyHSD()`

now support`"aov"`

fits without an intercept term. (PR#16437)`close()`

now reports the status of a`pipe()`

connection opened with an explicit`open`

argument. (PR#16481)- Coercing a list without names to a data frame is faster if the elements are very long. (PR#16467)
- (Unix-only) Under some rare circumstances piping the output from
`Rscript`

or`R -f`

could result in attempting to close the input file twice, possibly crashing the process. (PR#16500) - (Windows)
`Sys.info()`

was out of step with`win.version()`

and did not report Windows 8. `topenv(baseenv())`

returns`baseenv()`

again as in**R**3.1.0 and earlier. This also fixes`compilerJIT(3)`

when used in ‘.Rprofile’.`detach()`

ing the methods package keeps`.isMethodsDispatchOn()`

true, as long as the methods namespace is not unloaded.- Removed some spurious warnings from
`configure`

about the preprocessor not finding header files. (PR#15989) `rchisq(*, df=0, ncp=0)`

now returns`0`

instead of`NaN`

, and`dchisq(*, df=0, ncp=*)`

also no longer returns`NaN`

in limit cases (where the limit is unique). (PR#16521)`pchisq(*, df=0, ncp > 0, log.p=TRUE)`

no longer underflows (for ncp > ~60).`nchar(x, "w")`

returned -1 for characters it did not know about (e.g. zero-width spaces): it now assumes 1. It now knows about most zero-width characters and a few more double-width characters.- Help for
`which.min()`

is now more precise about behavior with logical arguments. (PR#16532) - The print width of character strings marked as
`"latin1"`

or`"bytes"`

was in some cases computed incorrectly. `abbreviate()`

did not give names to the return value if`minlength`

was zero, unlike when it was positive.- (Windows only)
`dir.create()`

did not always warn when it failed to create a directory. (PR#16537) - When operating in a non-UTF-8 multibyte locale (e.g. an East Asian locale on Windows),
`grep()`

and related functions did not handle UTF-8 strings properly. (PR#16264) `read.dcf()`

sometimes misread lines longer than 8191 characters. (Reported by Hervé Pagès with a patch.)`within(df, ..)`

no longer drops columns whose name start with a`"."`

.- The built-in
`HTTP`

server converted entire`Content-Type`

to lowercase including parameters which can cause issues for multi-part form boundaries (PR#16541). - Modifying slots of S4 objects could fail when the methods package was not attached. (PR#16545)
`splineDesign(*, outer.ok=TRUE)`

(splines) is better now (PR#16549), and`interpSpline()`

now allows`sparse=TRUE`

for speedup with non-small sizes.- If the expression in the traceback was too long,
`traceback()`

did not report the source line number. (Patch by Kirill Müller.) - The browser did not truncate the display of the function when exiting with
`options("deparse.max.lines")`

set. (PR#16581) - When
`bs(*, Boundary.knots=)`

had boundary knots inside the data range, extrapolation was somewhat off. (Patch by Trevor Hastie.) `var()`

and hence`sd()`

warn about`factor`

arguments which are deprecated now. (PR#16564)`loess(*, weights = *)`

stored wrong weights and hence gave slightly wrong predictions for`newdata`

. (PR#16587)`aperm(a, *)`

now preserves`names(dim(a))`

.`poly(x, ..)`

now works when either`raw=TRUE`

or`coef`

is specified. (PR#16597)`data(package=*)`

is more careful in determining the path.`prettyNum(*, decimal.mark, big.mark)`

: fixed bug introduced when fixing PR#16411.

- The included configuration code for
`libintl`

has been updated to that from`gettext`

version 0.19.5.1 — this should only affect how an external library is detected (and the only known instance is under OpenBSD). (Wish of PR#16464.) `configure`

has a new argument –disable-java to disable the checks for Java.- The
`configure`

default for`MAIN_LDFLAGS`

has been changed for the FreeBSD, NetBSD and Hurd OSes to one more likely to work with compilers other than`gcc`

(FreeBSD 10 defaults to`clang`

). `configure`

now supports the OpenMP flags -fopenmp=libomp (clang) and -qopenmp (Intel C).- Various macros can be set to override the default behaviour of
`configure`

when detecting OpenMP: see file ‘config.site’. - Source installation on Windows has been modified to allow for MiKTeX installations without
`texi2dvi.exe`

. See file ‘MkRules.dist’.

One of the cornerstones of the R system for statistical computing is the multitude of packages contributed by numerous package authors. This amount of packages makes an extremely broad range of statistical techniques and other quantitative methods freely available. Thus far, no empirical study has investigated psychological factors that drive authors to participate in the R project. This article presents a study of R package authors, collecting data on different types of participation (number of packages, participation in mailing lists, participation in conferences), three psychological scales (types of motivation, psychological values, and work design characteristics), and various socio-demographic factors. The data are analyzed using item response models and subsequent generalized linear models, showing that the most important determinants for participation are a hybrid form of motivation and the social characteristics of the work design. Other factors are found to have less impact or influence only specific aspects of participation.

R developers, statisticians, and psychologists from Harvard University, University of Vienna, WU Vienna University of Economics, and University of Innsbruck empirically studied psychosocial drivers of participation of R package authors. Through an online survey they collected data from 1,448 package authors. The questionnaire included psychometric scales (types of motivation, psychological values, work design), sociodemografic variables related to the work on R, and three participation measures (number of packages, participation in mailing lists, participation in conferences).

The data were analyzed using item response models and subsequently generalized linear models (logistic regressions, negative-binomial regression) with SIMEX corrected parameters.

The analysis reveals that the most important determinants for participation are a hybrid form of motivation and the social characteristics of the work design. Hybrid motivation acknowledges that motivation is a complex continuum of intrinsic, extrinsic, and internalized extrinsic motives.

Motives evolve over time, as task characteristics shift from need-driven problem solving to mundane maintenance tasks within the R community.

For instance, motivation can evolve from pure “fun coding” towards a personal commitment with associated higher responsibilities within the community. The community itself provides a social work environment with high degrees of interaction, two facets of which are strong motivators. First, interaction with persons perceived as important increases one’s own reputation (self-esteem, future job opportunities, etc.) Second, interaction with alike minded persons (i.e., interested in solving statistical problems) creates opportunities to express oneself and enjoy social inclusion.

The findings do not substantiate the commonly held perception that people develop packages out of purely altruistic motives. It is also notable that in most cases package development is undertaken as part of an individual’s research, which is paid by an (academic) institution, rather than uncompensated developments that cut into leisure time.

Full paper (behind PNAS’s paywall for now) is available here:

Mair, P., Hofmann, E., Gruber, K., Hatzinger, R., Zeileis, A., and Hornik, K. (2015). Motivation, values, and work design as drivers of participation in the R

open source project for statistical computing. Proceedings of the National Academy of Sciences of the United States of America, 112(48), 14788-14792

]]>

Here is some R code to demonstrate this speed improvement:

# IF you are missing an of these - they should be installed: install.packages("dendextend") install.packages("dendextendRcpp") install.packages("microbenchmark") # Getting labels from dendextendRcpp labelsRcpp% dist %>% hclust %>% as.dendrogram labels(dend) |

And here are the results:

> microbenchmark(labels_3.2.1(dend), labels_3.2.2(dend), labelsRcpp(dend)) Unit: milliseconds expr min lq median uq max neval labels_3.2.1(dend) 186.522968 189.395378 195.684164 208.328365 321.98368 100 labels_3.2.2(dend) 2.604766 2.826776 2.891728 3.006792 21.24127 100 labelsRcpp(dend) 3.825401 3.946904 3.999817 4.179552 11.22088 100 > > microbenchmark(labels_3.2.2(dend), order.dendrogram(dend)) Unit: microseconds expr min lq median uq max neval labels_3.2.2(dend) 2520.218 2596.0880 2678.677 2885.2890 9572.460 100 order.dendrogram(dend) 665.191 712.2235 954.951 996.1055 2268.812 100 |

As we can see, the new labels function (in R 3.2.2) is about 70 times faster than the older version (from R 3.2.1). When only wanting something like the number of labels, using length on order.dendrogram will still be (about 3 times) faster than using labels.

This improvement is expected to speedup various functions in the dendextend R package (a package for visualizing, adjusting, and comparing dendrograms, which heavily relies on labels.dendrogram). We expect to get even better speedup improvements for larger trees.

]]>I personally found two things particularly interesting in this release:

- setInternet2(TRUE) is now the default for windows (which will save people from getting “Error in file(con, “r”)” when using the installr package)
- The dendrogram method of labels() is much more efficient for large dendrograms since it now uses rapply(). This is expected to speedup various functions in the dendextend R package (a package for visualizing, adjusting, and comparing dendrograms, which heavily relies on labels.dendrogram).

Also, David Smith (from Revolution/Microsoft) highlighted in his post several of the updates in R 3.2.2 he found interesting – mentioning how the new default for accessing the web with R will rely on the HTTPS protocol, and of improving the accuracy in the extreme tails of the t and hypergeometric distributions.

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install setInternet2(TRUE) installr::updateR() # updating R. |

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package.

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

- It is now easier to use secure downloads from https:// URLs on builds which support them: no longer do non-default options need to be selected to do so. In particular, packages can be installed from repositories which offer https:// URLs, and those listed by
`setRepositories()`

now do so (for some of their mirrors).Support for https:// URLs is available on Windows, and on other platforms if support for`libcurl`

was compiled in and if that supports the`https`

protocol (system installations can be expected to do). So https:// support can be expected except on rather old OSes (an example being OS X ‘Snow Leopard’, where a non-system version of`libcurl`

can be used).(Windows only) The default method for accessing URLs*via*`download.file()`

and`url()`

has been changed to be`"wininet"`

using Windows API calls. This changes the way proxies need to be set and security settings made: there have been some reports of sites being inaccessible under the new default method (but the previous methods remain available).

`cmdscale()`

gets new option`list.`

for increased flexibility when a list should be returned.`configure`

now supports`texinfo`

version 6.0, which (unlike the change from 4.x to 5.0) is a minor update. (Wish of PR#16456.)- (Non-Windows only)
`download.file()`

with default`method = "auto"`

now chooses`"libcurl"`

if that is available and a https:// or ftps:// URL is used. - (Windows only)
`setInternet2(TRUE)`

is now the default. The command-line option`--internet2`

and environment variable R_WIN_INTERNET2 are now ignored.Thus by default the`"internal"`

method for`download.file()`

and`url()`

uses the`"wininet"`

method: to revert to the previous default use`setInternet2(FALSE)`

.This means that https:// can be read by default by`download.file()`

(they have been readable by`file()`

and`url()`

since**R**3.2.0).There are implications for how proxies need to be set (see`?download.file`

): also,`cacheOK = FALSE`

is not supported. `chooseCRANmirror()`

and`chooseBioCmirror()`

now offer HTTPS mirrors in preference to HTTP mirrors. This changes the interpretation of their`ind`

arguments: see their help pages.`capture.output()`

gets optional arguments`type`

and`split`

to pass to`sink()`

, and hence can be used to capture messages.

- Header ‘Rconfig.h’ now defines
`HAVE_ALLOCA_H`

if the platform has the ‘alloca.h’ header (it is needed to define`alloca`

on Solaris and AIX, at least: see ‘Writing R Extensions’ for how to use it).

- The
`libtool`

script generated by`configure`

has been modified to support FreeBSD >= 10 (PR#16410).

- The HTML help page links to demo code failed due to a change in
**R**3.2.0. (PR#16432) - If the
`na.action`

argument was used in`model.frame()`

, the original data could be modified. (PR#16436) `getGraphicsEvent()`

could cause a crash if a graphics window was closed while it was in use. (PR#16438)`matrix(x, nr, nc, byrow = TRUE)`

failed if`x`

was an object of type`"expression"`

.`strptime()`

could overflow the allocated storage on the C stack when the timezone had a non-standard format much longer than the standard formats. (Part of PR#16328.)`options(OutDec = s)`

now signals a warning (which will become an error in the future) when`s`

is not a string with exactly one character, as that has been a documented requirement.`prettyNum()`

gains a new option`input.d.mark`

which together with other changes, e.g., the default for`decimal.mark`

, fixes some`format()`

ting variants with non-default`getOption("OutDec")`

such as in PR#16411.`download.packages()`

failed for`type`

equal to either`"both"`

or`"binary"`

. (Reported by Dan Tenenbaum.)- The
`dendrogram`

method of`labels()`

is much more efficient for large dendrograms, now using`rapply()`

. (Comment #15 of PR#15215) - The
`"port"`

algorithm of`nls()`

could give spurious errors. (Reported by Radford Neal.) - Reference classes that inherited from reference classes in another package could invalidate methods of the inherited class. Fixing this requires adding the ability for methods to be “external”, with the object supplied explicitly as the first argument, named
`.self`

. See “Inter-Package Superclasses” in the documentation. `readBin()`

could fail on the SPARC architecture due to alignment issues. (Reported by Radford Neal.)`qt(*, df=Inf, ncp=.)`

now uses the natural`qnorm()`

limit instead of returning`NaN`

. (PR#16475)- Auto-printing of S3 and S4 values now searches for
`print()`

in the base namespace and`show()`

in the methods namespace instead of searching the global environment. `polym()`

gains a`coefs = NULL`

argument and returns class`"poly"`

just like`poly()`

which gets a new`simple=FALSE`

option. They now lead to correct`predict()`

ions, e.g., on subsets of the original data.`rhyper(nn, <large>)`

now works correctly. (PR#16489)`ttkimage()`

did not (and could not) work so was removed. Ditto for`tkimage.cget()`

and`tkimage.configure()`

. Added two Ttk widgets and missing subcommands for Tk’s`image`

command:`ttkscale()`

,`ttkspinbox()`

,`tkimage.delete()`

,`tkimage.height()`

,`tkimage.inuse()`

,`tkimage.type()`

,`tkimage.types()`

,`tkimage.width()`

. (PR#15372, PR#16450)`getClass("foo")`

now also returns a class definition when it is found in the cache more than once.

Here are my slides for the intended talk:

p.s.: Yes – this presentation is very similar, although not identical, to the one I gave at useR2015. For example, I mention the new bioinformatics paper on dendextend.

]]>