Data Driven Cheatsheets

Guest post by Jonathan Sidi

Cheatsheets are currently built and used exclusivley as a teaching tool. We want to try and change this and produce a cheat sheet that gives a roadmap to build a known product, but also is built as a function so users can input data into it to make the cheatsheet more personalized. This gives a versalility of a consistent format that people can share with each other, but has the added value of conveying a message through data driven visual changes.

Example

ggplot2 themes

The ggplot2 theme object is an amazing object you can specify nearly any part of the plot that is not conditonal on the data. What sets the theme object apart is that its structure is consistent, but the values in it change. In addition to change a theme it is a single function that too has a consistent call. The reoccuring challenge for users is to remember all the options that can be used in the theme call (there are approximately 220 unique options to calibrate at last count) or bookmark the help page for the theme and remember how you deciphered it last time.

This becomes a problem to pass all the information of the theme to someone who does not know what the values are set in your theme and attach instructions on it to let them recreate it without needing to open any web pages.

In writing the library ggedit we tried to make it easy to edit your theme so you don’t have to know too much about ggplots to make a large number of changes at once, for a quick clip see here. We had to make it easy to track those changes for people who are not versed in R, and plot.theme() was the outcome. In short think of the theme as a lot of small images that are combined to create a singel portrait.

Continue reading “Data Driven Cheatsheets”

ggedit 0.0.2: a GUI for advanced editing of ggplot2 objects

Guest post by Jonathan Sidi, Metrum Research Group

Last week the updated version of ggedit was presented in RStudio::conf2017. First, a BIG thank you to the whole RStudio team for a great conference and being so awesome to answer the insane amount of questions I had (sorry!). For a quick intro to the package see the previous post.

To install the package:

devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")

Highlights of the updated version.

  • verbose script handling during updating in the gagdet (see video below)
  • verbose script output for updated layers and theme to parse and evaluate in console or editor
  • colourpicker control for both single colours/fills and and palletes
  • output for scale objects eg scale*grandient,scale*grandientn and scale*manual
  • verbose script output for scales eg scale*grandient,scale*grandientn and scale*manual to parse and evaluate in console or editor
  • input plot objects can have the data in the layer object and in the base object.
    • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))+geom_point()
    • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length))+geom_point(aes(colour=Species))
    • ggplot()+geom_point(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))
  • plot.theme(): S3 method for class ‘theme’
    • visualizing theme objects in single output
    • visual comparison of two themes objects in single output
    • will be expanded upon in upcoming post

Continue reading “ggedit 0.0.2: a GUI for advanced editing of ggplot2 objects”

ggedit – interactive ggplot aesthetic and theme editor

Guest post by Jonathan Sidi, Metrum Research Group

ggplot2 has become the standard of plotting in R for many users. New users, however, may find the learning curve steep at first, and more experienced users may find it challenging to keep track of all the options (especially in the theme!).

ggedit is a package that helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics just right, all while keeping everything portable for further research and collaboration.

ggedit is powered by a Shiny gadget where the user inputs a ggplot plot object or a list of ggplot objects. You can run ggedit directly from the console from the Addin menu within RStudio.

Continue reading “ggedit – interactive ggplot aesthetic and theme editor”

Using 2D Contour Plots within {ggplot2} to Visualize Relationships between Three Variables

Guest post by John Bellettiere, Vincent Berardi, Santiago Estrada

The Goal

To visually explore relations between two related variables and an outcome using contour plots. We use the contour function in Base R to produce contour plots that are well-suited for initial investigations into three dimensional data. We then develop visualizations using ggplot2 to gain more control over the graphical output. We also describe several data transformations needed to accomplish this visual exploration.

Continue reading “Using 2D Contour Plots within {ggplot2} to Visualize Relationships between Three Variables”

heatmaply: interactive heat maps (with R)

I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package.

tl;dr

By running the following 3 lines of code:

install.packages("heatmaply")
library(heatmaply)
heatmaply(mtcars, k_col = 2, k_row = 3) %>% layout(margin = list(l = 130, b = 40))

You will get this output in your browser (or RStudio console):

Continue reading “heatmaply: interactive heat maps (with R)”

Multidimensional Scaling with R (from “Mastering Data Analysis with R”)

Guest post by Gergely Daróczi. If you like this content, you can buy the full 396 paged e-book for 5 USD until January 8, 2016 as part of Packt’s “$5 Skill Up Campaign” at https://bit.ly/mastering-R

Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.

Multidimensional Scaling (MDS) is a multivariate statistical technique first used in geography. The main goal of MDS it is to plot multivariate data points in two dimensions, thus revealing the structure of the dataset by visualizing the relative distance of the observations. Multidimensional scaling is used in diverse fields such as attitude study in psychology, sociology or market research.

Although the MASS package provides non-metric methods via the isoMDS function, we will now concentrate on the classical, metric MDS, which is available by calling the cmdscale function bundled with the stats package. Both types of MDS take a distance matrix as the main argument, which can be created from any numeric tabular data by the dist function.

But before such more complex examples, let’s see what MDS can offer for us while working with an already existing distance matrix, like the built-in eurodist dataset:

> as.matrix(eurodist)[1:5, 1:5]
          Athens Barcelona Brussels Calais Cherbourg
Athens         0      3313     2963   3175      3339
Barcelona   3313         0     1318   1326      1294
Brussels    2963      1318        0    204       583
Calais      3175      1326      204      0       460
Cherbourg   3339      1294      583    460         0

The above subset (first 5-5 values) of the distance matrix represents the travel distance between 21 European cities in kilometers. Running classical MDS on this example returns:

> (mds <- cmdscale(eurodist))
                      [,1]      [,2]
Athens           2290.2747  1798.803
Barcelona        -825.3828   546.811
Brussels           59.1833  -367.081
Calais            -82.8460  -429.915
Cherbourg        -352.4994  -290.908
Cologne           293.6896  -405.312
Copenhagen        681.9315 -1108.645
Geneva             -9.4234   240.406
Gibraltar       -2048.4491   642.459
Hamburg           561.1090  -773.369
Hook of Holland   164.9218  -549.367
Lisbon          -1935.0408    49.125
Lyons            -226.4232   187.088
Madrid          -1423.3537   305.875
Marseilles       -299.4987   388.807
Milan             260.8780   416.674
Munich            587.6757    81.182
Paris            -156.8363  -211.139
Rome              709.4133  1109.367
Stockholm         839.4459 -1836.791
Vienna            911.2305   205.930

These scores are very similar to two principal components (discussed in the previous, Principal Component Analysis section), such as running prcomp(eurodist)$x[, 1:2]. As a matter of fact, PCA can be considered as the most basic MDS solution.

Anyway, we have just transformed (reduced) the 21-dimensional space into 2 dimensions, which can be plotted very easily — unlike the original distance matrix with 21 rows and 21 columns:

> plot(mds)

2028OS_09_16

Does it ring a bell? If not yet, the below image might be more helpful, where the following two lines of code also renders the city names instead of showing anonymous points:

> plot(mds, type = 'n')
> text(mds[, 1], mds[, 2], labels(eurodist))
2028OS_09_17

Continue reading “Multidimensional Scaling with R (from “Mastering Data Analysis with R”)”

Slides from my JSM 2015 talk on dendextend

If you happen to be at the JSM 2015 conference this week, then this Monday, at 2pm, I will give a talk on the dendextend R package  (in the session “Advances in Graphical Frameworks and Methods Part 1“) – feel free to drop by and say hi.

Here are my slides for the intended talk:

 

p.s.: Yes – this presentation is very similar, although not identical, to the one I gave at useR2015. For example, I mention the new bioinformatics paper on dendextend.

dendextend: a package for visualizing, adjusting, and comparing dendrograms (based on a paper from “bioinformatics”)

This post on the dendextend package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full:

abstract

Summary: dendextend is an R package for creating and comparing visually appealing tree diagrams. dendextend provides utility functions for manipulating dendrogram objects (their color, shape, and content) as well as several advanced methods for comparing trees to one another (both statistically and visually). As such, dendextend offers a flexible framework for enhancing R’s rich ecosystem of packages for performing hierarchical clustering of items.

Availability: The dendextend R package (including detailed introductory vignettes) is available under the GPL-2 Open Source license and is freely available to download from CRAN at: (https://cran.r-project.org/package=dendextend)

Contact: [email protected]

Continue reading “dendextend: a package for visualizing, adjusting, and comparing dendrograms (based on a paper from “bioinformatics”)”

dendextend version 1.0.1 + useR!2015 presentation

When using the dendextend package in your work, please cite it using:

Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. doi:10.1093/bioinformatics/btv428

My R package dendextend (version 1.0.1) is now on CRAN!

The dendextend package Offers a set of functions for extending dendrogram objects in R, letting you visualize and compare trees of hierarchical clusterings. With it you can (1) Adjust a tree’s graphical parameters – the color, size, type, etc of its branches, nodes and labels. (2) Visually and statistically compare different dendrograms to one another.

The previous release of dendextend (0.18.3) was half a year ago, and this version includes many new features and functions.

To help you discover how dendextend can solve your dendrogram/hierarchical-clustering issues, you may consult one of the following vignettes:

Here is an example figure from the first vignette (analyzing the Iris dataset)

iris_heatmap_dend

 

This week, at useR!2015, I will give a talk on the package. This will offer a quick example, and a step-by-step example of some of the most basic/useful functions of the package. Here are the slides:

 

Lastly, I would like to mention the new d3heatmap package for interactive heat maps. This package is by Joe Cheng from Rstudio, and integrates well with dendrograms in general and dendextend in particular (thanks to some lovely github-commit-discussion between Joe and I). You are invited to see lively examples of the package in the post at the RStudio blog. Here is just one quick example:

d3heatmap(nba_players, colors = “Blues”, scale = “col”, dendrogram = “row”, k_row = 3)

d3heatmap

Israel’s 2015 election polls’ analysis with Shiny + ggplot2

(This is a guest post by my friend Yoni Sidi, a PhD candidate in statistics at the Hebrew University)

Background

The Israeli elections are coming up this Tuesday, 17/3/2015 (i.e.: tomorrow!). They are a bit more complicated than your average US presidential race. The elections in Israel are based on nationwide proportional representation. The electoral threshold is 3.25% and the number of seats (or mandates) out of a total of 120 is proportional to the number of votes it recieves, so the threshold roughly translates to at least four mandates. The Israeli system is a multi-party system and is based on coalition governments. Multi-party is putting it mildly, there are 11 that have a chance (and are expected) to pass the mandate threshold.
There are two major parties, Hamachane Hazioni (Left Wing) and the Likud (Right Wing), that are hoping to garner between 16%-25% of the votes, 20-30 mandates. The main winners though are the medium size parties that recomend to the President who they think has the best chance to construct the next government, so yes there is a good possibility that the general elections winner will not be one constructing the coalition. Making the actual winners the parties that create the biggest coalition which exceeds 60 mandates.
An abundance of polling has been continually published during the run up and the variaety of pollsters and publishers is hard to keep track of as a casual voter trying to gauge the temperature of the political landscape. I came across a great realtime database by Project 61 on google docs of past and present polling result information and decided that it was a great opportunity to learn the Shiny library of RStudio and create an app that I can check current and past results. So after I figured out how to connect google docs to R, I created a self updating app that became a nice place to keep track of polling every day, check trends and distributions using interactive ggplot2 graphs and simulate coalition outcomes.
Please note that as of Friday (March 13th), until election day (March 17th), it is forbidden to perform new polls in Israel, hence the data presented here cannot allow for an up-to-date inference about the expected results of the election. This post is for educational purposes.

Running the election polls Shiny app on your computer

The github repo is available here.

#changing locale to run on Windows
if (Sys.info()[1] == "Windows") Sys.setlocale("LC_ALL","Hebrew_Israel.1255") 
 
#check to see if libraries need to be installed
libs <- c("shiny","shinyAce","httr","XML","stringr","ggplot2","scales","plyr","reshape2","dplyr")
x <- sapply(libs,function(x)if(!require(x,character.only = T)) install.packages(x))
rm(x,libs)
 
#run App
shiny::runGitHub("Elections","yonicd",subdir="shiny")
 
#reset to original locale on Windows
if (Sys.info()[1] == "Windows") Sys.setlocale("LC_ALL")

 

Usage Instructions:

  1. Current Polling
  2. Election Analyis
  3. Mandate Simulator and Coalition Whiteboard
  4. Polling Database

Current Polling

  • The latest polling day results published in the media and the prediction made using the Project 61 weighting schemes. The parties are stacked into blocks to see which block has best chance to create a coalition.

LastDayPlot

The Project 61 prediction is based past pollster error deriving weights from the 2003,2006,2009 and 2013 elections, dependant on days to elections and parties. In their site there is an extensive analysis on pollster bias towards certain parties and party blocks.

Election Analysis

  • An interactive polling analysis layout where the user can filter elections, parties, publishers and pollster, dates and create different types of plots using any variable as the x and y axis.
  • The default layer is the 60 day trend (estimated with loess smoother) of mandates published by each pollster by party

pad_screen_grab

The user can choose to include in the plots Elections (2003,2006,2009,2013,2015) and the subsequent filters are populated with the relevant parties, pollsters and publishers relevant to the chosen elections. Next there is a slider to choose the days before the election you want to view in the plot. This was used instead of a calendar to make a uniform timeline when comparing across elections.

In addition the plot itself is a ggplot thus the options above the graph give the user control on nearly all the options to build a plot. The user can choose from the following variables:

Time Party Results Poll
Election Party Mandates Publisher
DaysLeft Ideology (5 Party Blocks) Mandate.Group Pollster
Date Ideology.Group (2 Party Blocks) Results
year Attribute (Party History) (Pollster) Error
month
week

To define the following plot attributes:

Plot Type Axes Grouping Plot Facets
Point X axis variable Split Y by colors using a different variable Row Facet
Bar Discrete/Continuous Column Facet
Line Rotation of X tick labels
Step Y axis variable
Boxplot
Density
  • Create Facets to display subsets of the data in different panels (two more variables to cut data) there are two type of facets to choose from
    • Wrap: Wrap 1d ribbon of panels into 2d
    • Grid: Layout panels in a grid (matrix)

An example of filtering pollsters to compare different tendencies for each party in the 2015 elections:

ElectionPlot_pollster_trend

An example of comparing distribution mandates per party in the last two months of polling

boxplot_month

An example of comparing distribution of pollster errors across elections (up to 10 days prior end of polling), by splitting the parties into five groups compared to previous election: old party,new party, combined (combination of two or more old parties), new.split (new party created from a split of a party from last election), old.split (old party that was a left from the split).

ElectionPlot_longitudinal

 

As we can see the pollster do not get a good indication of new,new.split or combined parties, which could be a problem this election since there are: 3 combined, 2 new splits.

attribute_compare

  • If you are an R user and know ggplot there is an additional editor console,below the plot, where you can create advanced plots freehand, just add to the final object from the GUI called p and the data.frame is x, eg p+geom_point(). Just notice that all aesthetics must be given they are not defined in the original ggplot() definition. It is also possible to use any library you want just add it to the top of the code, the end object must be a ggplot.

pad_screen_grab_ace

 

#new layer
p+geom_smooth(aes(x=DaysLeft,y=Mandates,fill=Party.En))+
scale_x_reverse()+scale_fill_discrete(name="Party")
  • You can also remove the original layer if you want using the function remove_geom(ggplot_object,geom_layer), eg p=p+remove_geom(p,“point”) will remove the geom_point layer in the original graph

pad_screen_grab_ace_remove_geom

p=remove_geom(p,"point") #blank ggplot with facets in place
#new layer
p+geom_smooth(aes(x=DaysLeft,y=Mandates,fill=Party.En))+
scale_x_reverse()+scale_fill_discrete(name="Party")
  • Finally the plots can be viewed in English or Hebrew, and can be downloaded to you local computer using the download button.

Mandate Simulator and Coalition Whiteboard

  • A bootstrap simulation is run on Polling results from up to 10 of the latest polls using the sampling error as the uncertainty of each mandate published. Taking into account mandate surplus agreements using the Hagenbach-Bischoff quota method and the mandate threshold limit (in this election it is 4 mandates), calculating the simulated final tally of mandates. The distributions are plotted per party and the location of the median published results in the media.
  • The user can choose how many polls to take into account, up to last 10 polls, and how big a simulation they want to run: 50,100,500,1000 random polling results per each party and poll.

sim_screen_grab

  • Once the simulator is complete you can create coalitions based on either the simulated distribution or actual published polls and see who can pass 60 mandates. Choose the coalition parties and the opposition parties from dropdown lists. (Yes the ones chosen are nonsensical on purpose…)

coal_screen_grab

Polling Database

  • All raw data used in the application can be viewed and filtered in a datatable.