Currently I am doing my master thesis on multi-state models. Survival analysis was my favourite course in the masters program, partly because of the great survival package which is maintained by Terry Therneau. The only thing I am not so keen on are the default plots created by this package, by using plot.survfit. Although the plots are very easy to produce, they are not that attractive (as are most R default plots) and legends has to be added manually. I come across them all the time in the literature and wondered whether there was a better way to display survival. Since I was getting the grips of ggplot2 recently I decided to write my own function, with the same functionality as plot.survfitbut with a result that is much better looking. I stuck to the defaults of plot.survfit as much as possible, for instance by default plotting confidence intervals for single-stratum survival curves, but not for multi-stratum curves. Below you’ll find the code of the ggsurv function. Just as plot.survfit it only requires a fitted survival object to produce a default plot. We’ll use the lung data set from the survival package for illustration. First we load in the function to the console (see at the end of this post).

Once the function is loaded, we can get going, we use the lung data set from the survival package for illustration.

What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)!

By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. In this post I will share some nice plots and quick insights that can be made from this great data. The code for this analysis is given at the end of this post.

Top 8 most downloaded R packages – downloads over time

Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version):

We can see the strong weekly seasonality of the downloads, with Saturday and Sunday having much fewer downloads than other days. This is not surprising since we know that the countries which uses R the most have these days as rest days (see James Cheshire’s world map of R users). It is also interesting to note how some packages had exceptional peaks on some dates. For example, I wonder what happened on January 23rd 2013 that the digest package suddenly got so many downloads, or that colorspace started getting more downloads from April 15th 2013.

“Family tree” of the top 100 most downloaded R packages

We can extract from this data the top 100 most downloaded R packages. Moreover, we can create a matrix showing for each package which of our unique ids (censored IP addresses), has downloaded which package. Using this indicator matrix, we can thing of the “similarity” (or distance) between each two packages, and based on that we can create a hierarchical clustering of the packages – showing which packages “goes along” with one another.

With this analysis, you can locate package on the list which you often use, and then see which other packages are “related” to that package. If you don’t know that package – consider having a look at it – since other R users are clearly finding the two packages to be “of use”.

Such analysis can (and should!) be extended. For example, we can imagine creating a “suggest a package” feature based on this data, utilizing the package which you use, the OS that you use, and other parameters. But such coding is beyond the scope of this post.

Here is the “family tree” (dendrogram) of related packages:

To make it easier to navigate, here is a table with links to the top 100 R packages, and their links:

This is a guest article by Nina Zumel and John Mount, authors of the new book Practical Data Science with R. For readers of this blog, there is a 50% discount offthe “Practical Data Science with R” book, simply by using the code pdswrblo when reaching checkout (until the 30th this month). Here is the post:

Normalizing data by mean and standard deviation is most meaningful when the data distribution is roughly symmetric. In this article, based on chapter 4 of Practical Data Science with R, the authors show you a transformation that can make some distributions more symmetric.

The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.

In this article, we will look at some log transformations and when to use them.

Monetary amounts—incomes, customer value, account or purchase sizes—are some of the most commonly encountered sources of skewed distributions in data science applications. In fact, as we discuss in Appendix B: Important Statistical Concepts, monetary amounts are often lognormally distributed—that is, the log of the data is normally distributed. This leads us to the idea that taking the log of the data can restore symmetry to it. We demonstrate this in figure 1.

For the purposes of modeling, which logarithm you use—natural logarithm, log base 10 or log base 2—is generally not critical. In regression, for example, the choice of logarithm affects the magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the value of the outcome. I like to use log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.

An aside on graphing

The difference between using the ggplot layer scale_x_log10 on a densityplot of income and plotting a densityplot of log10(income) is primarily axis labeling. Using scale_x_log10 will label the x-axis in dollars amounts, rather than in logs.

It’s also generally a good idea to log transform data with values that range over several orders of magnitude. First, because modeling techniques often have a difficult time with very wide data ranges, and second, because such data often comes from multiplicative processes, so log units are in some sense more natural.

For example, when you are studying weight loss, the natural unit is often pounds or kilograms. If I weigh 150 pounds, and my friend weighs 200, we are both equally active, and we both go on the exact same restricted-calorie diet, then we will probably both lose about the same number of pounds—in other words, how much weight we lose doesn’t (to first order) depend on how much we weighed in the first place, only on calorie intake. This is an additive process.

On the other hand, if management gives everyone in the department a raise, it probably isn’t by giving everyone $5000 extra. Instead, everyone gets a 2 percent raise: how much extra money ends up in my paycheck depends on my initial salary. This is a multiplicative process, and the natural unit of measurement is percentage, not absolute dollars. Other examples of multiplicative processes: a change to an online retail site increases conversion (purchases) for each item by 2 percent (not by exactly two purchases); a change to a restaurant menu increases patronage every night by 5 percent (not by exactly five customers every night). When the process is multiplicative, log-transforming the process data can make modeling easier.

Of course, taking the logarithm only works if the data is non-negative. There are other transforms, such as arcsinh, that you can use to decrease data range if you have zero or negative values. I don’t like to use arcsinh, because I don’t find the values of the transformed data to be meaningful. In applications where the skewed data is monetary (like account balances or customer value), I instead use what I call a “signed logarithm”. A signed logarithm takes the logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values with absolute value less than one are mapped to zero. The difference between log and signed log are shown in figure 2.

Clearly this isn’t useful if values below unit magnitude are important. But with many monetary variables (in US currency), values less than a dollar aren’t much different from zero (or one), for all practical purposes. So, for example, mapping account balances that are less than a dollar to $1 (the equivalent every account always having a minimum balance of one dollar) is probably okay.

Once you’ve got the data suitably cleaned and transformed, you are almost ready to start the modeling stage.

Summary

At some point, you will have data that is as good quality as you can make it. You’ve fixed problems with missing data, and performed any needed transformations. You are ready to go on the modeling stage. Remember, though, that data science is an iterative process. You may discover during the modeling process that you have to do additional data cleaning or transformation.

R 3.0.1 (codename “Good Sport”) was released last week. As mentioned earlier by David, this version improves serialization performance with big objects, improves reliability for parallel programming and fixes a few minor bugs.

Upgrading to R 3.0.1

You can download the latest version from here. Or, if you are using windows, you can upgrade to the latest version using the installr package (also available on CRAN and github). Simply run the following code:

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr
updateR(to_checkMD5sums = FALSE)# the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.1. Soon this should get resolved and you could go back to using updateR(), install.R() or the menu upgrade system.

I try to keep the installr package updated and useful. If you have any suggestions or remarks on the package, you’re invited to leave a comment below.

If you use the global library system (as I do), you can run the following in the new version of R:

A few hours ago Peter Dalgaard (of R Core Team) announced the release of R 3.0.0! Bellow you can read the changes in this release.

One of the features worth noticing is the introduction of long vectors to R 3.0.0. As David Smith recently wrote:

Although many people won’t notice the difference, the introduction of long vectors to R is in fact a significant upgrade, and required a lot of work behind-the-scenes to implement in the core R engine. It will allow data frames to exceed their current 2 billion row limit, and in general allow R to make better use of memory in systems with large amounts of RAM. Many thanks go to the R core team for making this improvement.

or wait for it to be mirrored at a CRAN site nearer to you. Binaries for various platforms will appear in due course (which often means it will be within the next 2-48 hours).

If you are running R on Ubuntu, you may wish to consult this post.

If you are running R on Windows, you can use the following code to quickly download and install the latest R version using the installr package:

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr
install.R(to_checkMD5sums = FALSE)# the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.0. Soon this should get resolved and you could go back to using updateR()

Either way, all users should note that this new release requires that packages will need to be re-installed, which means that after you install the new R, you should run the following command in it:

update.packages(checkBuilt=TRUE)

(thank to Prof. Ripley for the above clarification, and the FAQ pointer)

The problem: producing a Word (.docx) file of a statistical report created in R, with as little overhead as possible. The solution: combining R+knitr+rmarkdown+pander+pandoc (it is easier than it is spelled).

If you get what this post is about, just jump to the “Solution: the workflow” section.

Preface: why is this a problem (/still)

Before turning to the solution, let’s address two preliminary questions:

Q: Why is it important to be able to create report in Word from R?

A: Because many researchers we may work with are used to working with Word for editing their text, tracking changes and merging edits between different authors, and copy-pasting text/tables/images from various sources. This means that a report produced as a PDF file is less useful for collaborating with less-tech-savvy researchers (copying text or tables from PDF is not fun). Even exchanging HTML files may appear somewhat awkward to fellow researchers. Continue reading →

Upgrading R on Windows is not easy. While the R FAQ offer guidelines, some users may prefer to simply run a command in order to upgrade their R to the latest version. That is what the new {installr} package is all about.

The {installr} package offers a set of R functions for the installation and updating of software (currently, only on Windows OS), with a special focus on R itself. To update R, you can simply run the following code:

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr# using the package:
updateR()# this will start the updating process of your R installation. It will check for newer versions, and if one is available, will guide you through the decisions you'd need to make.

Running this function will perform the following steps:

Check what is the latest R version. If the current installed R version is up-to-date, the function ends (and returns FALSE)

If a newer version of R is available, you will be asked if to review the NEWS of the latest R version – in order to decide if to install the newest R or not.

If you wish it – the function will download and install the latest R version. (you will need to press the "next" buttons on your own)

Once the installation is done, you should press "any-key", and the function will proceed with copying all of your packages from your old (well, current) R installation, into your newer R installation.

You can then erase all of the packages in your old R installation.

After your packages are moved (and the old ones possibly erased), you will get the option to update all of your packages in the new version of R.

Lastely – you can open the new Rgui and close the current session of your old R. (This is a bit buggy in version 0.8, but has been fixed in version 0.8.1)

If you know you wish to upgrade R, and you want the packages moved (not copied, MOVED), you can simply run:

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr
updateR(F, T, T, F, T, F, T)# install, move, update.package, quit R.

Since the various steps are broken into individual functions, you can also pick and choose what to run using the relevant function:

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr# step by step functions:
check.for.updates.R()# tells you if there is a new version of R or not.
install.R()# download and run the latest R installer
copy.packages.between.libraries()# copy your packages to the newest R installation from the one version before it (if ask=T, it will ask you between which two versions to perform the copying)

# installing/loading the package:if(!require(installr)){install.packages("installr");require(installr)}#load / install+load installr
updateR(F, T, F, F, F, F, T)# only install R (if there is a newer version), and quits it.

And then run the following in the new version of R:

The {installr} package also offers functions for installing various other software on Windows. These functions include: install.pandoc (which was mentioned on this blog recently), install.git, install.Rtools, install.MikTeX, install.RStudio, and a general install.URL and install.packages.zip functions. You can see these further explained in the package’s Reference manual.

Feature requests, bug reports – and your help in improving the package

You can see the latest version of installr on github, where you can also submit bug reports (you may also just leave a comment in this post). Since this is my first R package, I might have (e.g: probably have) missed something here or there. So any comment on how to improve my code/documentation/R-fu, will be most welcomed (here or on github).

If this type of coding is fun/easy for you, you can help me improve this package on github. Cool new features I think may be added (by me or others) are:

Add an uninstall.R function – to remove the old R version.

Add more support for upgrading R for people who uses a global library for their packages.

Add support for Linux and Mac! This one I am less likely to do on my own – and would love to see someone else extend my code to other operation systems.

GUI – add a menu based option for running updateR. Something like help->”check for updates” would be great. (p.s: this idea came from Yihui Xie)

add even more install.software functions. If you have functions for which you’d like to be able to easily install them – just let me know and it could be included in future releases.

Thanks

Final note, I would like to thank the many people who have developed WONDERFUL tools for making R package development possible (and even somewhat fast), on Windows. These include Prof. Brian Ripley and Duncan Murdoch for Rtools, also Uwe Ligges for his work on CRAN, Hadley Wickham for devtools (in general, and for its documentation), Yihui Xie for roxygen2, JJ and others in the RStudio team for RStudio, the people behind git and github, and more. There are probably more things I can thank these people for, and many more people I should thank, but I can’t figure who you are probably (feel free to e-mail me, I appreciate you work even if it is not clear to me your are behind it).

The R blogger Rolf Fredheim has recently wrote a great piece called “Reproducible research with R, Knitr, Pandoc and Word“, where he advocates for Pandoc as an essential part of reproducible research workflow in R, in helping to turn documents which are knitted in R into high quality Word for exchanging with our colleagues. It is a great post, with many useful bits of code, and I wanted to supplement it with one missing function: “install.pandoc“.