This is a guest article by Nina Zumel and John Mount, authors of the new book

Practical Data Science with R.For readers of this blog, there is a50% discountoff the “Practical Data Science with R” book, simply by using the codewhen reaching checkout (until the 30th this month). Here is the post:pdswrblo

*Normalizing data by mean and standard deviation is most meaningful when the data distribution is roughly symmetric. In this article, based on chapter 4 of Practical Data Science with R, the authors show you a transformation that can make some distributions more symmetric.*

The need for data transformation can depend on the modeling method that you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variables is approximately linear, that the input variables are approximately normal in distribution, and that the output variable is constant variance (that is, the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.

In this article, we will look at some log transformations and when to use them.

Monetary amounts—incomes, customer value, account or purchase sizes—are some of the most commonly encountered sources of skewed distributions in data science applications. In fact, as we discuss in Appendix B: Important Statistical Concepts, monetary amounts are often lognormally distributed—that is, the log of the data is normally distributed. This leads us to the idea that taking the log of the data can restore symmetry to it. We demonstrate this in figure 1.

* *For the purposes of modeling, *which *logarithm you use—natural logarithm, log base 10 or log base 2—is generally not critical. In regression, for example, the choice of logarithm affects the magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the value of the outcome. I like to use log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.

An aside on graphingThe difference between using the ggplot layer scale_x_log10 on a densityplot of

incomeand plotting a densityplot oflog10(income)is primarily axis labeling. Using scale_x_log10 will label the x-axis in dollars amounts, rather than in logs.

It’s also generally a good idea to log transform data with values that range over several orders of magnitude. First, because modeling techniques often have a difficult time with very wide data ranges, and second, because such data often comes from multiplicative processes, so log units are in some sense more natural.

For example, when you are studying weight loss, the natural unit is often pounds or kilograms. If I weigh 150 pounds, and my friend weighs 200, we are both equally active, and we both go on the exact same restricted-calorie diet, then we will probably both lose about the same number of pounds—in other words, how much weight we lose doesn’t (to first order) depend on how much we weighed in the first place, only on calorie intake. This is an *additive *process.

On the other hand, if management gives everyone in the department a raise, it probably isn’t by giving everyone $5000 extra. Instead, everyone gets a 2 percent raise: how much extra money ends up in my paycheck depends on my initial salary. This is a *multiplicative *process, and the natural unit of measurement is percentage, not absolute dollars. Other examples of multiplicative processes: a change to an online retail site increases conversion (purchases) for each item by 2 percent (not by exactly two purchases); a change to a restaurant menu increases patronage every night by 5 percent (not by exactly five customers every night). When the process is multiplicative, log-transforming the process data can make modeling easier.

Of course, taking the logarithm only works if the data is non-negative. There are other transforms, such as arcsinh, that you can use to decrease data range if you have zero or negative values. I don’t like to use arcsinh, because I don’t find the values of the transformed data to be meaningful. In applications where the skewed data is monetary (like account balances or customer value), I instead use what I call a “signed logarithm”. A signed logarithm takes the logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values with absolute value less than one are mapped to zero. The difference between log and signed log are shown in figure 2.

Here’s how to calculate signed log base 10, in R:

signedlog10 = function(x) { ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x))) } |

Clearly this isn’t useful if values below unit magnitude are important. But with many monetary variables (in US currency), values less than a dollar aren’t much different from zero (or one), for all practical purposes. So, for example, mapping account balances that are less than a dollar to $1 (the equivalent every account always having a minimum balance of one dollar) is probably okay.

Once you’ve got the data suitably cleaned and transformed, you are almost ready to start the modeling stage.

**Summary**

At some point, you will have data that is as good quality as you can make it. You’ve fixed problems with missing data, and performed any needed transformations. You are ready to go on the modeling stage. Remember, though, that data science is an iterative process. You may discover during the modeling process that you have to do additional data cleaning or transformation.

For source code, sample chapters, the Online Author Forum, and other resources, go to

http://www.manning.com/zumel/

I like the practical approach of this blog. Very useful. Only I wonder: with this ‘signed log’ function you appear to get a bimodal distribution, which is probably much harder to model. How to deal with this?

Hi Lydia,

It depends on your analysis. But it could well be that this is the correct distribution to work with.

Or – that when can go and use non-parametric methods in order to mitigate the complex distribution.

If you or others have more to add – I’d be happy to read.

Tal

Why would anyone calculate the spread on Log values and not on the real measured data?

Hi Soren,

It can be relevant if you are interested (for example) in comparing the means of the two populations (after performing a log transform) by using something like a t-test.

Since the transformation preserves the location of statistics such as the median, the t-test may even be interpreted in the original scale.

And of course, in various situations, the scaled data may be a relevant quantity of interest.

I’d be curious to read more thoughts on the matter.

Best,

Tal

Nice post, but here are a few suggestions for improvement. The comment “the input variables need to be approximately normal in distribution” is not entirely concise. What is required that only and solely the residuals are normally distributed. This becomes clear when we think of experimental data where the input variables are chosen numbers like 1,2,3,4… In this case the input variables are then certainly uniformly distributed and not normally distributed. What counts is that ONLY the residuals need to be normally distributed for getting the standard errors right. However, even without this assumption the regression is still valid. And if they are not, the residuals become normally distributed in large sample due to the Law of Large Numbers (LLN). That is indeed the most beautiful gift of the universe to the statistician – as this says “don’t worry if the residuals are not normally distributed, if you have a large sample, they will automatically be.”

Furthermore, we also do not require that “input variables and output variables are approximately linear” – this is what we are testing and not an assumption. The coefficient is the only component that needs to enter linearly. Therefore, if you think that the input variable is quadratic, it is perfectly alright to regress y = a + (x^2)*b + e.

Thirdly the statement that “the output variable needs a constant variance” is also not correct in this context. We only require the conditional distribution of the output variable given the input variable to have an equal variance for homoskedasticity to hold and thus not to use heteroskedasticity robust standard errors (White). Here is why we take the logarithm of the variables, as there are indications that taking the logs decreases the heteroskedasticity of the residuals and thus makes the model more efficient when estimating the standard errors.

Under a generative model you *might* assume strong conditions: like the variables being normally distributed. See Gelman for some comments for and against this position http://andrewgelman.com/2013/08/04/19470/ . For the y = a + x^2 + e, you have y is linear in x^2 (x^2 either being a treated column , an effect or some other conversion be it implicit or explicit). There are a lot of different ways to frame regression.

Hi here, I wish to know if after transformation we may have data passing the normality tests