boxplot_with_outliers_1

How to label all the outliers in a boxplot

In this post I present a function that helps to label outlier observations When plotting a boxplot using R.

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Identifying these points in R is very simply when dealing with only one boxplot and a few outliers. That can easily be done using the “identify” function in R. For example, running the code bellow will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable you to pick the outlier point and have it’s label (in this case, that number id) plotted beside the point:

set.seed(482)
y <- rnorm(100)
boxplot(y)
identify(rep(1, length(y)), y, labels = seq_along(y))

However, this solution is not scalable when dealing with:

  • Many outliers
  • Overlapping data-points, and
  • Multiple boxplots in the same graphic window

For such cases I recently wrote the function “boxplot.with.outlier.label” (which you can download from here). This function will plot operates in a similar way as “boxplot” (formula) does, with the added option of defining “label_name”. When outliers are presented, the function will then progress to mark all the outliers using the label_name variable. This function can handle interaction terms and will also try to space the labels so that they won’t overlap (my thanks goes to Greg Snow for his function “spread.labs” from the {TeachingDemos} package, and helpful comments in the R-help mailing list).

Here is some example code you can try out for yourself:

source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function
# sample some points and labels for us:
set.seed(492)
y <- rnorm(2000)
x1 <- sample(letters[1:2], 2000,T)
x2 <- sample(letters[1:2], 2000,T)
lab_y <- sample(letters[1:4], 2000,T)
# plot a boxplot with interactions:
boxplot.with.outlier.label(y~x2*x1, lab_y)

Here is the resulting graph:

You can also have a try and run the following code to see how it handles simpler cases:

# plot a boxplot without interactions:
boxplot.with.outlier.label(y~x1, lab_y, ylim = c(-5,5))
# plot a boxplot of y only
boxplot.with.outlier.label(y, lab_y, ylim = c(-5,5))
boxplot.with.outlier.label(y, lab_y, spread_text = F) # here the labels will overlap (because I turned spread_text off)

Here is the output of the last example, showing how the plot looks when we allow for the text to overlap (we would often prefer to NOT allow it).

boxplot - with one group and identifiyed outliers (allowing label overlap)

Regarding package dependencies: notice that this function requires you to first install the packages {TeachingDemos} (by Greg Snow) and {plyr} (by Hadley Wickham)

Updates:
19.04.2011 – I’ve added support to the boxplot “names” and “at” parameters.

You are very much invited to leave your comments if you find a bug, think of ways to improve the function, or simply enjoyed it and would like to share it with me.

  • Albert

    Could be a bug. Getting boxplots but no labels on Mac OS X 10.6.6 with R 2.11.1.

  • http://www.talgalili.com Tal Galili

    Hi Albert, what code are you running and do you get any errors?

  • Albert

    The exact sample code. Re-running caused me to find the bug, which was silent.

    “require(plyr)” needs to be before the “is.formula” call.

    Fixing that, I get the labels now.

    • http://www.talgalili.com Tal Galili

      Thanks Albert, Good catch.

      I thought is.formula was part of R. I fixed it now.

  • Kevin Wright

    I’ve done something similar with slight difference.

    Tukey advocated different plotting symbols for outliers and extreme outliers, so I only label extreme outliers (roughly 3.0 * IQR instead of 1.5 * IQR).

    • http://www.talgalili.com Tal Galili

      Hi Kevin,

      That’s a good idea. Let me know if you got any code I might look at to see how you implemented it.

      Cheers,
      Tal

      • Xose M.

        Hi

        I have a code for boxplot with outliers and extreme outliers.

        I write this code quickly, for teach this type of boxplot in classroom.

        The code is this:

        #–debuxar valores extremos nun boxplot

        datos=iris[[2]]^5 #construimos unha variable con valores extremos
        boxplot(datos) #representamos o diagrama de caixa

        dc=boxplot(datos,plot=F) #garda en dc o diagrama, pero non o volve a representar
        attach(dc)
        if (length(out)>0)
        { #separa os distintos elementos, por comodidade
        for (i in 1:length(out)) #iniciase un bucle, que fai o mesmo para cada valor anomalo
        #o que fai vai entre chaves
        {
        if (out[i]>4*stats[4,group[i]]-3*stats[2,group[i]] | out[i]<4*stats[2,group[i]]-3*stats[4,group[i]])
        #unha condición, se se cumpre realiza o que está entre chaves
        {
        points(group[i],out[i],col="white") #borra o punto anterior
        points(group[i],out[i],pch=4) #escribe o punto novo
        }
        }
        rm(i)
        } #do if
        detach(dc) #elimina a separacion dos elementos de dc
        rm(dc) #borra dc
        #rematou o debuxo de valores extremos

        I apologise for not write better english.

        X.M.

        • http://www.talgalili.com Tal Galili

          Thanks X.M.,
          Maybe I should adding some notation for extreme outliers.

          p.s: I updated the code to enable the change in the “range” parameter (e.g: controlling the length of the fences)

          Cheers,
          Tal

  • Albert

    Another bug. For some seeds, I get an error, and the labels are not all drawn. For example, set the seed to 42.

    > set.seed(42)
    > y x1 x2 lab_y # plot a boxplot with interactions:
    > boxplot.with.outlier.label(y~x2*x1, lab_y)
    Error in text.default(temp_x + 0.19, temp_y_new, current_label, col = label.col) :
    zero length ‘labels’

    • http://www.talgalili.com Tal Galili

      Albert – thanks for the second catch. :)

      I found the bug (it didn’t know what to do in case that there was a sub group without any outliers).
      It is now fixed and the updated code is uploaded to the site.

  • EMil BB

    Looks very nice! Only wish it was in ggplot2, which is the way to display graphs I use all the time. But very handy nonetheless!

  • jon w

    After the last line of the second code block, I get this error:

    > boxplot.with.outlier.label(y~x2*x1, lab_y)
    Error in model.frame.default(y) : object is not a matrix

    Checking str(y) it is a num[1:2000]…

    • http://www.talgalili.com Tal Galili

      Thanks Jon,
      I found the bug and fixed it (the bug was introduced after the major extension introduced to deal with cases of identical y values – it is now fixed)

      Best,
      Tal

      • jon w

        Thanks, I’ll be using this already :)

  • antoine

    good function, it can be very helpfull.

    Unfortunately it seems it won’t work when you have different number of data in your groups because of missing values.

    Maybe an idea for further improvement ?

  • Lili

    Thanks for the code.
    I have some trouble using it. When i use function as follow:

    for(i in c(4,5,7:34,36:43))
    {
    mini=min(ForeMeans15[,i],HindMeans15[,i] )
    maxi=max(ForeMeans15[,i],HindMeans15[,i])

    boxplot.with.outlier.label(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex, ForeMeans15$mouseID, border=3, cex.axis=0.6,names=c(“fore\nctrl.f”,”fore\ntg+.f”, “fore\nctrl.m”,”fore\ntg+.m”), xlab=”All groups at speed=15″, ylab=colnames(ForeMeans15)[i], col=colors()[c(641,640,28,121)], main= colnames(ForeMeans15)[i], at=c(1,3,5,7), xlim=c(1,10), ylim=c(mini-((abs(mini)*20)/100), maxi+((abs(maxi)*20)/100)))
    stripchart(ForeMeans15[,i]~ForeMeans15$genotype*ForeMeans15$sex,vertical =T, cex=0.8, pch=16, col=”black”, bg=”black”, add=T, at=c(1,3,5,7))

    savePlot(paste(“15cmsPlotAll”,colnames(ForeMeans15)[i]), type=”png”)
    }

    I get the following error:
    Fehler in text.default(temp_x + move_text_right, temp_y_new, current_label, :
    ‘labels’ mit Länge 0
    or like in English
    Error in text.default(temp_x + move_text_right, temp_y_new, current_label, :
    ‘labels’ with length 0
    i also get the error if I use it for just one vector!

    i hope you could help me. Am I maybe using the wrong syntax for the function??

    Best regards

    Lil

  • Gerit

    I have many NAs showing in the outlier_df output. Is there a way to get rid of the NAs and only show the true outliers? I have tried na.rm=TRUE, but failed. Thank you! (Btw. it’s a cool function!)

    • http://www.r-statistics.com/ Tal Galili

      Hi Gerit, sorry for the later response.

      Can you give a simple example showing your problem? (using the dput function may help)

      Cheers,
      Tal

  • Sheri

    Hi Tal,

    I am trying to use your script but am getting an error. The call I am using is: boxplot.with.outlier.label(mynewdata, mydata$Name, push_text_right = 1.5, range = 3.0)

    where mynewdata holds 5 columns of data with 170 rows and mydata$Name is also 170rows.

    The error is: Error in `[.data.frame`(xx, , y_name) : undefined columns selected

    In all your examples you use a formula and I don’t know if this is my problem or not.

    The boxplot is created but without any labels.

    Thanks for any help you can offer!

    • http://www.r-statistics.com/ Tal Galili

      Hi Sheri,
      I can’t seem to reproduce the example.
      Could you use dput, and post a SHORT reproducible example of your error?

      • Sheri

        Hi Tal,
        I wish I could post the output from dput but I get an error when I try to dput or dump (object not found). The script successfully creates a boxplot with labels when I choose a single column such as

        boxplot.with.outlier.label(mynewdata$Max, mydata$Name, push_text_right = 1.5, range = 3.0)

        and dput produces output for the this call.

        I can use the script by single columns as it provides me with the names of the outliers which is what I need anyway! Thanks very much for making your work available.

  • cam

    Thank you very much, you help me a lot!!!