Log Transformations for Skewed and Wide Distributions

This is a guest article by Nina Zumel and John Mount, authors of the new book Practical Data Science with RFor readers of this blog, there is a 50% discount off the “Practical Data Science with R” book, simply by using the code pdswrblo when reaching checkout (until the 30th this month). Here is the post:

Normalizing data by mean and standard deviation is most meaningful when the data distribution is roughly symmetric. In this article, based on chapter 4 of Practical Data Science with R, the authors show you a transformation that can make some distributions more symmetric.

The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.

In this article, we will look at some log transformations and when to use them.

Monetary amounts—incomes, customer value, account or purchase sizes—are some of the most commonly encountered sources of skewed distributions in data science applications. In fact, as we discuss in Appendix B: Important Statistical Concepts, monetary amounts are often lognormally distributed—that is, the log of the data is normally distributed. This leads us to the idea that taking the log of the data can restore symmetry to it. We demonstrate this in figure 1.

 

zumel_fig_1
Figure 1 A nearly lognormal distribution, and its log

 For the purposes of modeling, which logarithm you use—natural logarithm, log base 10 or log base 2—is generally not critical. In regression, for example, the choice of logarithm affects the magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the value of the outcome. I like to use log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.

An aside on graphing

The difference between using the ggplot layer scale_x_log10 on a densityplot of income and plotting a densityplot of log10(income) is primarily axis labeling. Using scale_x_log10 will label the x-axis in dollars amounts, rather than in logs.

It’s also generally a good idea to log transform data with values that range over several orders of magnitude. First, because modeling techniques often have a difficult time with very wide data ranges, and second, because such data often comes from multiplicative processes, so log units are in some sense more natural.

For example, when you are studying weight loss, the natural unit is often pounds or kilograms. If I weigh 150 pounds, and my friend weighs 200, we are both equally active, and we both go on the exact same restricted-calorie diet, then we will probably both lose about the same number of pounds—in other words, how much weight we lose doesn’t (to first order) depend on how much we weighed in the first place, only on calorie intake. This is an additive process.

On the other hand, if management gives everyone in the department a raise, it probably isn’t by giving everyone $5000 extra. Instead, everyone gets a 2 percent raise: how much extra money ends up in my paycheck depends on my initial salary. This is a multiplicative process, and the natural unit of measurement is percentage, not absolute dollars. Other examples of multiplicative processes: a change to an online retail site increases conversion (purchases) for each item by 2 percent (not by exactly two purchases); a change to a restaurant menu increases patronage every night by 5 percent (not by exactly five customers every night). When the process is multiplicative, log-transforming the process data can make modeling easier.

Of course, taking the logarithm only works if the data is non-negative. There are other transforms, such as arcsinh, that you can use to decrease data range if you have zero or negative values. I don’t like to use arcsinh, because I don’t find the values of the transformed data to be meaningful. In applications where the skewed data is monetary (like account balances or customer value), I instead use what I call a “signed logarithm”. A signed logarithm takes the logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values with absolute value less than one are mapped to zero. The difference between log and signed log are shown in figure 2.

Figure 2 Signed log lets you visualize non-positive data on a logarithmic scale
Figure 2 Signed log lets you visualize non-positive data on a logarithmic scale

Here’s how to calculate signed log base 10, in R:

signedlog10 = function(x) {
ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x)))
}

Clearly this isn’t useful if values below unit magnitude are important. But with many monetary variables (in US currency), values less than a dollar aren’t much different from zero (or one), for all practical purposes. So, for example, mapping account balances that are less than a dollar to $1 (the equivalent every account always having a minimum balance of one dollar) is probably okay.

Once you’ve got the data suitably cleaned and transformed, you are almost ready to start the modeling stage.

Summary

At some point, you will have data that is as good quality as you can make it. You’ve fixed problems with missing data, and performed any needed transformations. You are ready to go on the modeling stage. Remember, though, that data science is an iterative process. You may discover during the modeling process that you have to do additional data cleaning or transformation.

For source code, sample chapters, the Online Author Forum, and other resources, go to
http://www.manning.com/zumel/

R 3.0.1 is released

R 3.0.1 (codename “Good Sport”) was released last week. As mentioned earlier by David, this version improves serialization performance with big objects, improves reliability for parallel programming and fixes a few minor bugs.

Upgrading to R 3.0.1

You can download the latest version from here. Or, if you are using windows, you can upgrade to the latest version using the installr package (also available on CRAN and github). Simply run the following code:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr
 
updateR(to_checkMD5sums = FALSE) # the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.1. Soon this should get resolved and you could go back to using updateR(), install.R() or the menu upgrade system.

I try to keep the installr package updated and useful. If you have any suggestions or remarks on the package, you’re invited to leave a comment below.

If you use the global library system (as I do), you can run the following in the new version of R:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

Continue reading

R 3.0.0 is released! (what’s new, and how to upgrade)

A few hours ago Peter Dalgaard (of R Core Team) announced the release of R 3.0.0!  Bellow you can read the changes in this release.

One of the features worth noticing is the introduction of long vectors to R 3.0.0. As David Smith recently wrote:

Although many people won’t notice the difference, the introduction of long vectors to R is in fact a significant upgrade, and required a lot of work behind-the-scenes to implement in the core R engine. It will allow data frames to exceed their current 2 billion row limit, and in general allow R to make better use of memory in systems with large amounts of RAM. Many thanks go to the R core team for making this improvement.

You can get the source code from:  http://cran.r-project.org/src/base/R-3/R-3.0.0.tar.gz

or wait for it to be mirrored at a CRAN site nearer to you. Binaries for various platforms will appear in due course (which often means it will be within the next 2-48 hours).

If you are running R on Ubuntu, you may wish to consult this post.

If you are running R on Windows, you can use the following code to quickly download and install the latest R version using the installr package:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr
install.R(to_checkMD5sums = FALSE) # the use of to_checkMD5sums is because of a slight bug in the MD5 file on R 3.0.0. Soon this should get resolved and you could go back to using updateR()

Either way, all users should note that this new release requires that packages will need to be re-installed, which means that after you install the new R, you should run the following command in it:

update.packages(checkBuilt=TRUE)

(thank to Prof. Ripley for the above clarification, and the FAQ pointer)

R 3.0.0 NEWS:

SIGNIFICANT USER-VISIBLE CHANGES

Continue reading

installr_menubar_updateR

Updating R (on Windows) through a menu-bar: installr 0.9 released on CRAN

In preparation for the upcoming release of R 3.0.0, a new release 0.9 of installr is now on CRAN.

The package can be installed and loaded using:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr

The new version includes various bug fixes (as can be seen in the NEWS file) and new functions and features. The most user visible feature is that from now on, whenever loading installr in the Rgui, it will add a new menu-bar for updating your R version (the menu is removed when the package is detached).

installr_menubar_updateR

When choosing to update R, a new GUI based system will guide you step by step through the updating process. It will first check if a newer version of R is available, if so, it will offer to show the latest NEWS of that release, download and install the new version, and copy/move your packages from the previous library folder, to the one in the new installation. If you have a global library folder, you can simply stop the updating once your new R is installed, and continue as you would otherwise (in the future, I intend to update the package to also allow it to deal with people using a global library folder).

installr_updateR_noupdate_needed

(for using {installr} to update R through R terminal, see my previous post: Updating R from R (on Windows) – using the {installr} package)

Another new feature is the “installr()” function (which can also be run through the menubar), running it will open a window with a list of software you can download and install using the installr package (From Rtools and RStudio to pandoc and MikTeX).

installr_installr_function

I hope you’ll enjoy this new release, and as always – please let me know in the comments (or via e-mail) if you come across any bugs or have suggestions for new features.

Writing a MS-Word document using R (with as little overhead as possible)

The problem: producing a Word (.docx) file of a statistical report created in R, with as little overhead as possible.
The solution: combining R+knitr+rmarkdown+pander+pandoc (it is easier than it is spelled).

If you get what this post is about, just jump to the “Solution: the workflow” section.

rmd_to_docx

Preface: why is this a problem (/still)

Before turning to the solution, let’s address two preliminary questions:

Q: Why is it important to be able to create report in Word from R?

A: Because many researchers we may work with are used to working with Word for editing their text, tracking changes and merging edits between different authors, and copy-pasting text/tables/images from various sources.
This means that a report produced as a PDF file is less useful for collaborating with less-tech-savvy researchers (copying text or tables from PDF is not fun). Even exchanging HTML files may appear somewhat awkward to fellow researchers.
Continue reading

Updating R from R (on Windows) – using the {installr} package

Upgrading R on Windows is not easy. While the R FAQ offer guidelines, some users may prefer to simply run a command in order to upgrade their R to the latest version. That is what the new {installr} package is all about.

The {installr} package offers a set of R functions for the installation and updating of software (currently, only on Windows OS), with a special focus on R itself. To update R, you can simply run the following code:

# installing/loading the package:
if(!require(installr)) { 
install.packages("installr"); require(installr)} #load / install+load installr
 
# using the package:
updateR() # this will start the updating process of your R installation.  It will check for newer versions, and if one is available, will guide you through the decisions you'd need to make.

Running this function will perform the following steps:

  • Check what is the latest R version. If the current installed R version is up-to-date, the function ends (and returns FALSE)

  • If a newer version of R is available, you will be asked if to review the NEWS of the latest R version – in order to decide if to install the
    newest R or not.

  • If you wish it – the function will download and install the latest R version. (you will need to press the "next" buttons on your own)

  • Once the installation is done, you should press "any-key", and the function will proceed with copying all of your packages from your old (well, current) R installation, into your newer R installation.

  • You can then erase all of the packages in your old R installation.

  • After your packages are moved (and the old ones possibly erased), you will get the option to update all of your packages in the new version of R.

  • Lastely – you can open the new Rgui and close the current session of your old R. (This is a bit buggy in version 0.8, but has been fixed in version 0.8.1)

If you know you wish to upgrade R, and you want the packages moved (not copied, MOVED), you can simply run:

# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)} #load / install+load installr
 
updateR(F, T, T, F, T, F, T) # install, move, update.package, quit R.

Since the various steps are broken into individual functions, you can also pick and choose what to run using the relevant function:

# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)} #load / install+load installr
 
# step by step functions:
check.for.updates.R() # tells you if there is a new version of R or not.
install.R() # download and run the latest R installer
copy.packages.between.libraries() # copy your packages to the newest R installation from the one version before it (if ask=T, it will ask you between which two versions to perform the copying)

If you like using the global library system, you can run the following in the old R:

# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)} #load / install+load installr
 
updateR(F, T, F, F, F, F, T) # only install R (if there is a newer version), and quits it.

And then run the following in the new version of R:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

The {installr} package also offers functions for installing various other software on Windows. These functions include: install.pandoc (which was mentioned on this blog recently), install.git, install.Rtools, install.MikTeX, install.RStudio, and a general install.URL and install.packages.zip functions. You can see these further explained in the package’s Reference manual.

Feature requests, bug reports – and your help in improving the package

You can see the latest version of installr on github, where you can also submit bug reports (you may also just leave a comment in this post). Since this is my first R package, I might have (e.g: probably have) missed something here or there. So any comment on how to improve my code/documentation/R-fu, will be most welcomed (here or on github).

If this type of coding is fun/easy for you, you can help me improve this package on github. Cool new features I think may be added (by me or others) are:

  • Add an uninstall.R function – to remove the old R version.
  • Add more support for upgrading R for people who uses a global library for their packages.
  • Add support for Linux and Mac! This one I am less likely to do on my own – and would love to see someone else extend my code to other operation systems.
  • GUI – add a menu based option for running updateR. Something like help->”check for updates” would be great. (p.s: this idea came from Yihui Xie)
  • add even more install.software functions. If you have functions for which you’d like to be able to easily install them – just let me know and it could be included in future releases.

Thanks

Final note, I would like to thank the many people who have developed WONDERFUL tools for making R package development possible (and even somewhat fast), on Windows. These include Prof. Brian Ripley and Duncan Murdoch for Rtools, also Uwe Ligges for his work on CRAN, Hadley Wickham for devtools (in general, and for its documentation), Yihui Xie for roxygen2, JJ and others in the RStudio team for RStudio, the people behind git and github, and more. There are probably more things I can thank these people for, and many more people I should thank, but I can’t figure who you are probably (feel free to e-mail me, I appreciate you work even if it is not clear to me your are behind it).

Installing Pandoc from R (on Windows) – using the {installr} package

The R blogger Rolf Fredheim has recently wrote a great piece called “Reproducible research with R, Knitr, Pandoc and Word“, where he advocates for Pandoc as an essential part of reproducible research workflow in R, in helping to turn documents which are knitted in R into high quality Word for exchanging with our colleagues. It is a great post, with many useful bits of code, and I wanted to supplement it with one missing function: “install.pandoc“.

Update: the install.pandoc function is now part of the {installr} package.

Continue reading

{stargazer} package for beautiful LaTeX tables from R statistical models output

stargazer is a new R package that creates LaTeX code for well-formatted regression tables, with multiple models side-by-side, as well as for summary statistics tables. It can also output the content of data frames directly into LaTeX. Compared to available alternatives, stargazer excels in three regards:  its ease of use, the large number of models it supports, and its beautiful aesthetics.

Ease of use

stargazer was designed with the user’s comfort in mind. The learning curve is very mild and all arguments are very intuitive, so that even a beginning user of R or LaTeX can quickly become familiar with the package’s many capabilities. The package is intelligent, and tries to minimize the amount of effort the user has to put into adjusting argument values. If stargazer is given a set of regression model objects, for instance, the package will create a side-by-side regression table. By contrast, if the user feeds it a data frame, stargazer will know that the user is most likely looking for a summary statistics table or – if the summary argument is set to false – wants to output the content of the data frame.

A quick reproducible example shows just how easy stargazer is to use. You can install stargazer from CRAN in the usual way:

install.packages("stargazer")
library(stargazer)

Continue reading

Statistics with R, and open source stuff (software, data, community)