In this post I present a function that helps to label outlier observations When plotting a boxplot using R.

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Identifying these points in R is very simply when dealing with only one boxplot and a few outliers. That can easily be done using the “identify” function in R. For example, running the code bellow will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable you to pick the outlier point and have it’s label (in this case, that number id) plotted beside the point:

set.seed(482) y <- rnorm(100) boxplot(y) identify(rep(1, length(y)), y, labels = seq_along(y)) |

However, this solution is **not **scalable when dealing with:

- Many outliers
- Overlapping data-points, and
- Multiple boxplots in the same graphic window

For such cases I recently wrote the function “boxplot.with.outlier.label” (which you can **download from here**). This function will plot operates in a similar way as “boxplot” (formula) does, with the added option of defining “label_name”. When outliers are presented, the function will then progress to mark all the outliers using the label_name variable. This function can handle interaction terms and will also try to space the labels so that they won’t overlap (my thanks goes to Greg Snow for his function “spread.labs” from the {TeachingDemos} package, and helpful comments in the R-help mailing list).

Here is some example code you can try out for yourself:

source("https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r") # Load the function # sample some points and labels for us: set.seed(492) y <- rnorm(2000) x1 <- sample(letters[1:2], 2000,T) x2 <- sample(letters[1:2], 2000,T) lab_y <- sample(letters[1:4], 2000,T) # plot a boxplot with interactions: boxplot.with.outlier.label(y~x2*x1, lab_y) |

Here is the resulting graph:

You can also have a try and run the following code to see how it handles simpler cases:

# plot a boxplot without interactions: boxplot.with.outlier.label(y~x1, lab_y, ylim = c(-5,5)) # plot a boxplot of y only boxplot.with.outlier.label(y, lab_y, ylim = c(-5,5)) boxplot.with.outlier.label(y, lab_y, spread_text = F) # here the labels will overlap (because I turned spread_text off) |

Here is the output of the last example, showing how the plot looks when we allow for the text to overlap (we would often prefer to NOT allow it).

Regarding package dependencies: notice that this function requires you to first install the packages {TeachingDemos} (by Greg Snow) and {plyr} (by Hadley Wickham)

**Updates:**

19.04.2011 – I’ve added support to the boxplot “names” and “at” parameters.

You are very much invited to leave your comments if you find a **bug**, think of ways to **improve **the function, or simply **enjoyed** it and would like to share it with me.

Could be a bug. Getting boxplots but no labels on Mac OS X 10.6.6 with R 2.11.1.

Hi Albert, what code are you running and do you get any errors?

The exact sample code. Re-running caused me to find the bug, which was silent.

“require(plyr)” needs to be before the “is.formula” call.

Fixing that, I get the labels now.

Thanks Albert, Good catch.

I thought is.formula was part of R. I fixed it now.

I’ve done something similar with slight difference.

Tukey advocated different plotting symbols for outliers and extreme outliers, so I only label extreme outliers (roughly 3.0 * IQR instead of 1.5 * IQR).

Hi Kevin,

That’s a good idea. Let me know if you got any code I might look at to see how you implemented it.

Cheers,

Tal

Hi

I have a code for boxplot with outliers and extreme outliers.

I write this code quickly, for teach this type of boxplot in classroom.

The code is this:

#–debuxar valores extremos nun boxplot

datos=iris[[2]]^5 #construimos unha variable con valores extremos

boxplot(datos) #representamos o diagrama de caixa

dc=boxplot(datos,plot=F) #garda en dc o diagrama, pero non o volve a representar

attach(dc)

if (length(out)>0)

{ #separa os distintos elementos, por comodidade

for (i in 1:length(out)) #iniciase un bucle, que fai o mesmo para cada valor anomalo

#o que fai vai entre chaves

{

if (out[i]>4*stats[4,group[i]]-3*stats[2,group[i]] | out[i]<4*stats[2,group[i]]-3*stats[4,group[i]])

#unha condición, se se cumpre realiza o que está entre chaves

{

points(group[i],out[i],col="white") #borra o punto anterior

points(group[i],out[i],pch=4) #escribe o punto novo

}

}

rm(i)

} #do if

detach(dc) #elimina a separacion dos elementos de dc

rm(dc) #borra dc

#rematou o debuxo de valores extremos

I apologise for not write better english.

X.M.

Thanks X.M.,

Maybe I should adding some notation for extreme outliers.

p.s: I updated the code to enable the change in the “range” parameter (e.g: controlling the length of the fences)

Cheers,

Tal

Another bug. For some seeds, I get an error, and the labels are not all drawn. For example, set the seed to 42.

> set.seed(42)

> y x1 x2 lab_y # plot a boxplot with interactions:

> boxplot.with.outlier.label(y~x2*x1, lab_y)

Error in text.default(temp_x + 0.19, temp_y_new, current_label, col = label.col) :

zero length ‘labels’

Albert – thanks for the second catch.

I found the bug (it didn’t know what to do in case that there was a sub group without any outliers).

It is now fixed and the updated code is uploaded to the site.

Looks very nice! Only wish it was in ggplot2, which is the way to display graphs I use all the time. But very handy nonetheless!

After the last line of the second code block, I get this error:

> boxplot.with.outlier.label(y~x2*x1, lab_y)

Error in model.frame.default(y) : object is not a matrix

Checking str(y) it is a num[1:2000]…

Thanks Jon,

I found the bug and fixed it (the bug was introduced after the major extension introduced to deal with cases of identical y values – it is now fixed)

Best,

Tal

Thanks, I’ll be using this already

good function, it can be very helpfull.

Unfortunately it seems it won’t work when you have different number of data in your groups because of missing values.

Maybe an idea for further improvement ?

Thanks for the code.

I have some trouble using it. When i use function as follow:

for(i in c(4,5,7:34,36:43))

{

mini=min(ForeMeans15[,i],HindMeans15[,i] )

maxi=max(ForeMeans15[,i],HindMeans15[,i])

boxplot.with.outlier.label(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex, ForeMeans15$mouseID, border=3, cex.axis=0.6,names=c(“forenctrl.f”,”forentg+.f”, “forenctrl.m”,”forentg+.m”), xlab=”All groups at speed=15″, ylab=colnames(ForeMeans15)[i], col=colors()[c(641,640,28,121)], main= colnames(ForeMeans15)[i], at=c(1,3,5,7), xlim=c(1,10), ylim=c(mini-((abs(mini)*20)/100), maxi+((abs(maxi)*20)/100)))

stripchart(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex,vertical =T, cex=0.8, pch=16, col=”black”, bg=”black”, add=T, at=c(1,3,5,7))

savePlot(paste(“15cmsPlotAll”,colnames(ForeMeans15)[i]), type=”png”)

}

I get the following error:

Fehler in text.default(temp_x + move_text_right, temp_y_new, current_label, :

‘labels’ mit Länge 0

or like in English

Error in text.default(temp_x + move_text_right, temp_y_new, current_label, :

‘labels’ with length 0

i also get the error if I use it for just one vector!

i hope you could help me. Am I maybe using the wrong syntax for the function??

Best regards

Lil

I have many NAs showing in the outlier_df output. Is there a way to get rid of the NAs and only show the true outliers? I have tried na.rm=TRUE, but failed. Thank you! (Btw. it’s a cool function!)

Hi Gerit, sorry for the later response.

Can you give a simple example showing your problem? (using the dput function may help)

Cheers,

Tal

Hi Tal,

I am trying to use your script but am getting an error. The call I am using is: boxplot.with.outlier.label(mynewdata, mydata$Name, push_text_right = 1.5, range = 3.0)

where mynewdata holds 5 columns of data with 170 rows and mydata$Name is also 170rows.

The error is: Error in `[.data.frame`(xx, , y_name) : undefined columns selected

In all your examples you use a formula and I don’t know if this is my problem or not.

The boxplot is created but without any labels.

Thanks for any help you can offer!

Hi Sheri,

I can’t seem to reproduce the example.

Could you use dput, and post a SHORT reproducible example of your error?

Hi Tal,

I wish I could post the output from dput but I get an error when I try to dput or dump (object not found). The script successfully creates a boxplot with labels when I choose a single column such as

boxplot.with.outlier.label(mynewdata$Max, mydata$Name, push_text_right = 1.5, range = 3.0)

and dput produces output for the this call.

I can use the script by single columns as it provides me with the names of the outliers which is what I need anyway! Thanks very much for making your work available.

Thank you very much, you help me a lot!!!

Hi, I can’t seem to download the sources; WordPress redirects (HTTP 301) the source-URL to http://www.r-statistics.com/all-articles/ . Could you share it once again, please? It looks really useful

Hi Alexander,

You’re right – it seems the file is no longer available.

In the meantime, you can get it from here:

https://www.dropbox.com/s/8jlp7hjfvwwzoh3/boxplot.with.outlier.label.r?dl=0

I’ll fix the post soon.

Best,

Tal

o.k., I fixed it. You can now get it from github:

source(“https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r”)