Quantile LOESS – Combining a moving quantile window with LOESS (R function)

In this post I will provide R code that implement’s the combination of repeated running quantile with the LOESS smoother to create a type of “quantile LOESS” (e.g: “Local Quantile Regression”).

This method is useful when the need arise to fit robust and resistant (Need to be verified) a smoothed line for a quantile (an example for such a case is provided at the end of this post).

If you wish to use the function in your own code, simply run inside your R console the following line:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")

Background

I came a cross this idea in an article titled “High throughput data analysis in behavioral genetics” by Anat Sakov, Ilan Golani, Dina Lipkind and my advisor Yoav Benjamini. From the abstract:

In recent years, a growing need has arisen in different fields, for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with non-standard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. […] we use a non-standard mix of robust and resistant methods: LOWESS and repeated running median.

The motivation for this technique came from “Path data” (of mice) which is

prone to suffer from noise and outliers. During progression a tracking system might lose track of the animal, inserting (occasionally very large) outliers into the data. During lingering, and even more so during arrests, outliers are rare, but the recording noise is large relative to the actual size of the movement. The statistical implications are that the two types of behavior require different degrees of smoothing and resistance. An additional complication is that the two interchange many times throughout a session. As a result, the statistical solution adopted needs not only to smooth the data, but also to recognize, adaptively, when there are arrests. To the best of our knowledge, no single existing smoothing technique has yet been able to fulfill this dual task. We elaborate on the sources of noise, and propose a mix of LOWESS (Cleveland, 1977) and the repeated running median (RRM; Tukey, 1977) to cope with these challenges

If all we wanted to do was to perform moving average (running average) on the data, using R, we could simply use the rollmean function from the zoo package.
But since we wanted also to allow quantile smoothing, we turned to use the rollapply function.

R function for performing Quantile LOESS

Here is the R function that implements the LOESS smoothed repeated running quantile (with implementation for using this with a simple implementation for using average instead of quantile):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# This code relies on the rollapply function from the "zoo" package.  My thanks goes to Achim Zeileis and Gabor Grothendieck for their work on the package.
Quantile.loess<- function(Y, X = NULL, 
							number.of.splits = NULL,
							window.size = 20,
							percent.of.overlap.between.two.windows = NULL,
							the.distance.between.each.window = NULL,
							the.quant = .95,
							window.alignment = c("center"), 
							window.function = function(x) {quantile(x, the.quant)},
							# If you wish to use this with a running average instead of a running quantile, you could simply use:
							# window.function = mean,
							...)
{
	# input: Y and X, and smothing parameters
	# output: new y and x
 
	# Extra parameter "..." goes to the loess	
 
	# window.size ==  the number of observation in the window (not the window length!)
 
	# "number.of.splits" will override "window.size"
	# let's compute the window.size:	
	if(!is.null(number.of.splits)) {window.size <- ceiling(length(Y)/number.of.splits)}
 
	# If the.distance.between.each.window is not specified, let's make the distances fully distinct
	if(is.null(the.distance.between.each.window)) {the.distance.between.each.window <- window.size}
 
	# If percent.of.overlap.between.windows is not null, it will override the.distance.between.each.window 
	if(!is.null(percent.of.overlap.between.two.windows)) 
		{
			the.distance.between.each.window <- window.size * (1-percent.of.overlap.between.two.windows)
		}
 
 
 
	# loading zoo
	if(!require(zoo)) 	
	{
		print("zoo is not installed - please install it.")
		install.packages("zoo")
	}
 
 
	if(is.null(X)) {X <- index(Y)} # if we don't have any X, then Y must be ordered, in which case, we can use the indexes of Y as X.
 
	# creating our new X and Y
	zoo.Y <- zoo(x = Y, order.by = X)
	#zoo.X <- attributes(zoo.Y)$index
 
	new.Y <- rollapply(zoo.Y, width = window.size, 
								FUN = window.function,
								by = the.distance.between.each.window,
								align = window.alignment)
	new.X <- attributes(new.Y)$index	
	new.Y.loess <- loess(new.Y~new.X, family = "sym",...)$fitted 
 
	return(list(y = new.Y, x = new.X, y.loess = new.Y.loess))
}

More on the math of the algorithm can be found in the original article.

Example: Predicting “worst case scenario” Ozone levels using temperature

The following example uses the “airquality” dataset which gives us “Daily air quality measurements in New York, May to September 1973.” With several variables, we will only look at Ozone level and Temperature.
Since high Ozone levels reduces the air quality we breath, I would like to give a prediction of the predicted “worst case” Ozone level (e.g: 95% Ozone level) using to the temperature of the same day.

How would you try to do something like that?

The first solution would be to use the “rq” function from the Quantile Regression R package, but if we where to look at the data, we would see that fitting a straight line is not suitable for our data (since we have a sharp change in slope around the temperature of 80 degrees).
This is a situation where Quantile LOESS (of 95%) might prove to be useful. Here is the code to produce the above plot.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
 
data(airquality)
attach(airquality)
 
no.na <- !(is.na(Ozone) | is.na(Temp))
Ozone.2 <- Ozone[no.na]
Temp.2 <- jitter(Temp[no.na])
 
plot(Ozone ~ Temp, main = "Predicting the 95% Ozone level according to Temperature")
 
# fitting the Quantile regression
require(quantreg)
abline(rq(Ozone ~ Temp, tau = .95), col = "red")
 
# fitting the Quantile LOESS
source("http://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
QL <- Quantile.loess(Y = Ozone.2, X = Temp.2, 
							the.quant = .95,
							window.size = 10,
							window.alignment = c("center"))
points(QL$y.loess ~ QL$x, type = "l", col = "green")
 
legend("topleft",legend = c("95% Quantile regression", "95% Quantile LOESS"), fill = c("red","green"))

Update: I changed in the article’s name from LOWESS to LOESS

After A considerate e-mail from Dirk Eddelbuettel I corrected myself from using LOWESS to LOESS throughout the article. Here’s an explanation to why I did it and also why I corrected it –

Dirk wrote to me:

You have a post entitled ‘quantile lowess’ but you then (correctly) use loess. Do you understand that there are two functions lowess() and loess()?
The former is sort-of a predecessor but nobody but really old books still talks about it. Google for (maybe) ‘Brian Ripley lowess loess’ as he drove
that point home a few times on r-help.

My answer was:

Thanks Dirk, […]
Regarding the loess != lowess, I noticed that this is indeed the case when I first wrote the post but I was in a predicament:
On the one hand, LOESS is the more modern approach (and what I used in the script). But on the other hand, LOWESS is what the original article’s authors where using. I ended up deciding I would call it the way I did, but after reading what you wrote, I realized I made a mistake.
I went through the article and corrected the lowess to loess, while also adding a paragraph for explain my reasoning.

Update: regarding the method being robust

After Nicholas’s comment I went checking and came across a R-help thread by
Martin Maechler explaining how to update my code from above so that the system will be robust. Martin wrote (My notes are added in []):
One gotcha [when comparing lowess to loess is]– particularly if you were used to the fact that lowess() by default is resistant to outliers {well, in many cases at least} :

  • lowess() per default has “iter = 3″ which means it uses 3 “robustifying” (also called “huberizing” for Huber (1960)) iterations .
  • loess() on the other hand has an argument `family’ with possible values “gaussian” and “symmetric” (can be abbreviated) where the *first* one is the default (unfortunately, in my opinion).

I.e., loess() by default is not resistant/robust where as lowess() is. […] I would however recommend using loess(….., family = “sym”) routinely.

* * *

If you find this code useful, please let me know about it in the comments.

  • Nicholas

    The is not a robust procedure. For robust quantile smoothing look at the cobs package on CRAN. Also look at Roger Koencker’s package, quantreg, for more quantile smoothing.

    • http://www.talgalili.com Tal Galili

      Hi Nicholas,
      Thank you for the pointers for the packages.

      May I ask why do you think this procedure is not robust ? (I am not saying you are wrong, it is just that I am not sure how exactly to define robustness in this case – and thus how to say the method doesn’t hold to it)

      Best,
      Tal

      • Nicholas

        Hi,
        Well lowess is not robust, there is a not too pathological counter example in the lowess examples I believe, and a thread from a couple years ago by Martin Maechler. But also intuitively the procedure you are proposing seems a little dodgy. You take the full data, and throw away most of it to do a running smoother on the quantiles? Generally one would want to do some procedure that uses all the data, but weights things accordingly to get a smooth quantile. Look at some of Paul Eilers papers on asymetric least squares approaches. For a robust approach, one can use L1 instead of L2 loss.

        Hope that helps (although I will not vouch for the accuracy of everything I said, ie comes with no warranty),

        Nicholas

        (This comment was copied from an e-mail, with the approval of the author)

        • http://www.talgalili.com Tal Galili

          Thanks Nicholas,

          Regarding robustness, since this is a mix of both LOWESS and quantile – I assume some measure of robustness will be held because of the use of quantiles.

          Regarding the fact that there might be a way to use more data points – that is a good question – I would like to see what others have done with it (but I don’t think I’ll go into this on my own right now).

          Regarding the use of quantreg – I searched the package documentation and wasn’t able to find something that gives a quantile smoother (I found reference there for quantile spline smoothers, but didn’t find the function to use it – maybe I missed something).

          In any case, thank you very much for your comment and input!

          Tal

  • Lu Hu

    Very useful code. Thank you for the contribution. But is there a way to make this moving quantile based on time serial, instead of a fixed window size?

    • http://www.talgalili.com Tal Galili

      Hello Lu Hu,
      Thank you for the compliment.

      What do you mean with having the window like ” time serial” ?
      Do you have an example of that? (an article in English?)

  • Lu Hu

    Like the Figure 1 and 2 in this article (
    http://www.cnr.berkeley.edu/~ahg/pubs/Season95.pdf )

    for example, running quantile of the data in 30 days intervals

    Thanks.

    • http://www.talgalili.com Tal Galili

      Hi Li hu,
      I think I understood you now.
      Actually, what the function does is not use (what you call) a “fixed window size”, but a changing window size.
      That is, when you use the “window.size” parameter in the function, what you are changing is the number of data points that each window has.
      So, for example, if you set the function to use “window.size=30″, you will get a window that will always have 30 days.
      A harder question would be “assuming I don’t always have 30 days, but I want the window to move in 30 days each time. So that the number of data points per window will change – can that be done?” – The answer to this is no. The function (currently) can’t do this.

      • Lu Hu

        I see. Thank you very much.