boxplot_with_outliers_1

How to label all the outliers in a boxplot

In this post I present a function that helps to label outlier observations When plotting a boxplot using R.

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Identifying these points in R is very simply when dealing with only one boxplot and a few outliers. That can easily be done using the “identify” function in R. For example, running the code bellow will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable you to pick the outlier point and have it’s label (in this case, that number id) plotted beside the point:

set.seed(482)
y <- rnorm(100)
boxplot(y)
identify(rep(1, length(y)), y, labels = seq_along(y))

However, this solution is not scalable when dealing with:

  • Many outliers
  • Overlapping data-points, and
  • Multiple boxplots in the same graphic window

For such cases I recently wrote the function “boxplot.with.outlier.label” (which you can download from here). This function will plot operates in a similar way as “boxplot” (formula) does, with the added option of defining “label_name”. When outliers are presented, the function will then progress to mark all the outliers using the label_name variable. This function can handle interaction terms and will also try to space the labels so that they won’t overlap (my thanks goes to Greg Snow for his function “spread.labs” from the {TeachingDemos} package, and helpful comments in the R-help mailing list).

Here is some example code you can try out for yourself:

source("https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r") # Load the function
# sample some points and labels for us:
set.seed(492)
y <- rnorm(2000)
x1 <- sample(letters[1:2], 2000,T)
x2 <- sample(letters[1:2], 2000,T)
lab_y <- sample(letters[1:4], 2000,T)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x2*x1, lab_y)

Here is the resulting graph:

You can also have a try and run the following code to see how it handles simpler cases:

# plot a boxplot without interactions:
boxplot.with.outlier.label(y~x1, lab_y, ylim = c(-5,5))
# plot a boxplot of y only
boxplot.with.outlier.label(y, lab_y, ylim = c(-5,5))
boxplot.with.outlier.label(y, lab_y, spread_text = F) # here the labels will overlap (because I turned spread_text off)

Here is the output of the last example, showing how the plot looks when we allow for the text to overlap (we would often prefer to NOT allow it).

boxplot - with one group and identifiyed outliers (allowing label overlap)

Regarding package dependencies: notice that this function requires you to first install the packages {TeachingDemos} (by Greg Snow) and {plyr} (by Hadley Wickham)

Updates:
19.04.2011 – I’ve added support to the boxplot “names” and “at” parameters.

You are very much invited to leave your comments if you find a bug, think of ways to improve the function, or simply enjoyed it and would like to share it with me.

25 thoughts on “How to label all the outliers in a boxplot”

  1. The exact sample code. Re-running caused me to find the bug, which was silent.

    “require(plyr)” needs to be before the “is.formula” call.

    Fixing that, I get the labels now.

  2. I’ve done something similar with slight difference.

    Tukey advocated different plotting symbols for outliers and extreme outliers, so I only label extreme outliers (roughly 3.0 * IQR instead of 1.5 * IQR).

      1. Hi

        I have a code for boxplot with outliers and extreme outliers.

        I write this code quickly, for teach this type of boxplot in classroom.

        The code is this:

        #–debuxar valores extremos nun boxplot

        datos=iris[[2]]^5 #construimos unha variable con valores extremos
        boxplot(datos) #representamos o diagrama de caixa

        dc=boxplot(datos,plot=F) #garda en dc o diagrama, pero non o volve a representar
        attach(dc)
        if (length(out)>0)
        { #separa os distintos elementos, por comodidade
        for (i in 1:length(out)) #iniciase un bucle, que fai o mesmo para cada valor anomalo
        #o que fai vai entre chaves
        {
        if (out[i]>4*stats[4,group[i]]-3*stats[2,group[i]] | out[i]<4*stats[2,group[i]]-3*stats[4,group[i]])
        #unha condición, se se cumpre realiza o que está entre chaves
        {
        points(group[i],out[i],col="white") #borra o punto anterior
        points(group[i],out[i],pch=4) #escribe o punto novo
        }
        }
        rm(i)
        } #do if
        detach(dc) #elimina a separacion dos elementos de dc
        rm(dc) #borra dc
        #rematou o debuxo de valores extremos

        I apologise for not write better english.

        X.M.

        1. Thanks X.M.,
          Maybe I should adding some notation for extreme outliers.

          p.s: I updated the code to enable the change in the “range” parameter (e.g: controlling the length of the fences)

          Cheers,
          Tal

  3. Another bug. For some seeds, I get an error, and the labels are not all drawn. For example, set the seed to 42.

    > set.seed(42)
    > y x1 x2 lab_y # plot a boxplot with interactions:
    > boxplot.with.outlier.label(y~x2*x1, lab_y)
    Error in text.default(temp_x + 0.19, temp_y_new, current_label, col = label.col) :
    zero length ‘labels’

    1. Albert – thanks for the second catch. :)

      I found the bug (it didn’t know what to do in case that there was a sub group without any outliers).
      It is now fixed and the updated code is uploaded to the site.

  4. Looks very nice! Only wish it was in ggplot2, which is the way to display graphs I use all the time. But very handy nonetheless!

  5. After the last line of the second code block, I get this error:

    > boxplot.with.outlier.label(y~x2*x1, lab_y)
    Error in model.frame.default(y) : object is not a matrix

    Checking str(y) it is a num[1:2000]…

    1. Thanks Jon,
      I found the bug and fixed it (the bug was introduced after the major extension introduced to deal with cases of identical y values – it is now fixed)

      Best,
      Tal

  6. good function, it can be very helpfull.

    Unfortunately it seems it won’t work when you have different number of data in your groups because of missing values.

    Maybe an idea for further improvement ?

  7. Thanks for the code.
    I have some trouble using it. When i use function as follow:

    for(i in c(4,5,7:34,36:43))
    {
    mini=min(ForeMeans15[,i],HindMeans15[,i] )
    maxi=max(ForeMeans15[,i],HindMeans15[,i])

    boxplot.with.outlier.label(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex, ForeMeans15$mouseID, border=3, cex.axis=0.6,names=c(“forenctrl.f”,”forentg+.f”, “forenctrl.m”,”forentg+.m”), xlab=”All groups at speed=15″, ylab=colnames(ForeMeans15)[i], col=colors()[c(641,640,28,121)], main= colnames(ForeMeans15)[i], at=c(1,3,5,7), xlim=c(1,10), ylim=c(mini-((abs(mini)*20)/100), maxi+((abs(maxi)*20)/100)))
    stripchart(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex,vertical =T, cex=0.8, pch=16, col=”black”, bg=”black”, add=T, at=c(1,3,5,7))

    savePlot(paste(“15cmsPlotAll”,colnames(ForeMeans15)[i]), type=”png”)
    }

    I get the following error:
    Fehler in text.default(temp_x + move_text_right, temp_y_new, current_label, :
    ‘labels’ mit Länge 0
    or like in English
    Error in text.default(temp_x + move_text_right, temp_y_new, current_label, :
    ‘labels’ with length 0
    i also get the error if I use it for just one vector!

    i hope you could help me. Am I maybe using the wrong syntax for the function??

    Best regards

    Lil

  8. I have many NAs showing in the outlier_df output. Is there a way to get rid of the NAs and only show the true outliers? I have tried na.rm=TRUE, but failed. Thank you! (Btw. it’s a cool function!)

  9. Hi Tal,

    I am trying to use your script but am getting an error. The call I am using is: boxplot.with.outlier.label(mynewdata, mydata$Name, push_text_right = 1.5, range = 3.0)

    where mynewdata holds 5 columns of data with 170 rows and mydata$Name is also 170rows.

    The error is: Error in `[.data.frame`(xx, , y_name) : undefined columns selected

    In all your examples you use a formula and I don’t know if this is my problem or not.

    The boxplot is created but without any labels.

    Thanks for any help you can offer!

      1. Hi Tal,
        I wish I could post the output from dput but I get an error when I try to dput or dump (object not found). The script successfully creates a boxplot with labels when I choose a single column such as

        boxplot.with.outlier.label(mynewdata$Max, mydata$Name, push_text_right = 1.5, range = 3.0)

        and dput produces output for the this call.

        I can use the script by single columns as it provides me with the names of the outliers which is what I need anyway! Thanks very much for making your work available.

    1. o.k., I fixed it. You can now get it from github:

      source(“https://raw.githubusercontent.com/talgalili/R-code-snippets/master/boxplot.with.outlier.label.r”)

Leave a Reply