Correlation scatter-plot matrix for ordered-categorical data

When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables).
Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

There are four solution for the point-overlap problem that I know of:

  1. Jitter the data a bit to give a sense of the “density” of the points
  2. Use a color spectrum to represent when a point actually represent “many points”
  3. Use different points sizes to represent when there are “many points” in the location of that point
  4. Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

In this post I will offer the code for the  a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

And here is the code to produce this plot:

R code for producing a Correlation scatter-plot matrix – for ordered-categorical data

Note that this code will work fine for continues data points (although I might suggest to enlarge the “point.size.rescale” parameter to something bigger then 1.5 in the “panel.smooth.ordered.categorical” function)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# -----------------
# Functions
# -----------------
 
panel.cor.ordered.categorical <- function(x, y, digits=2, prefix="", cex.cor) 
{
 
    usr <- par("usr"); on.exit(par(usr)) 
    par(usr = c(0, 1, 0, 1)) 
 
    r <- abs(cor(x, y, method = "spearman")) # notive we use spearman, non parametric correlation here
    r.no.abs <- cor(x, y, method = "spearman")
 
 
    txt <- format(c(r.no.abs , 0.123456789), digits=digits)[1] 
    txt <- paste(prefix, txt, sep="") 
    if(missing(cex.cor)) cex <- 0.8/strwidth(txt) 
 
    test <- cor.test(x,y, method = "spearman") 
    # borrowed from printCoefmat
    Signif <- symnum(test$p.value, corr = FALSE, na = FALSE, 
                  cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
                  symbols = c("***", "**", "*", ".", " ")) 
 
    text(0.5, 0.5, txt, cex = cex * r) 
    text(.8, .8, Signif, cex=cex, col=2) 
}
 
 
 
 
panel.smooth.ordered.categorical <- function (x, y, col = par("col"), bg = NA, pch = par("pch"), 
												cex = 1, col.smooth = "red", span = 2/3, iter = 3, 
												point.size.rescale = 1.5, ...) 
{
	#require(colorspace)
    require(reshape)
    z <- merge(data.frame(x,y), melt(table(x ,y)),sort =F)$value
    #the.col <- heat_hcl(length(x))[z]
    z <- point.size.rescale*z/ (length(x)) # notice how we rescale the dots accourding to the maximum z could have gotten
 
    symbols( x, y,  circles = z,#rep(0.1, length(x)), #sample(1:2, length(x), replace = T) ,
			inches=F, bg= "grey",#the.col ,
			fg = bg, add = T)
 
    # points(x, y, pch = pch, col = col, bg = bg, cex = cex)
    ok <- is.finite(x) & is.finite(y)
    if (any(ok)) 
        lines(stats::lowess(x[ok], y[ok], f = span, iter = iter), 
            col = col.smooth, ...)
}
 
 
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE, br = 20)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col="orange", ...)
}
 
 
pairs.ordered.categorical <- function(xx,...)
		{
			pairs(xx , 
					diag.panel = panel.hist ,
					lower.panel=panel.smooth.ordered.categorical,
					upper.panel=panel.cor.ordered.categorical,
					cex.labels = 1.5, ...) 
		}
 
 
 
 
# -----------------
# Example
# -----------------
 
set.seed(666)
a1 <- sample(1:5, 100, replace = T)
a2 <- sample(1:5, 100, replace = T)
a3 <- round(jitter(a2, 7) )
	a3[a3 < 1 | a3 > 5] <- 3
a4 <- 6-round(jitter(a1, 7) )
	a4[a4 < 1 | a4 > 5] <- 3
 
aa <- data.frame(a1,a2,a3, a4)
 
require(reshape)
 
# plotting :)		
pairs.ordered.categorical(aa)

Credits:

  • The original R code for the correlation matrix plot was taken from R Graph Gallery (The differences are: 1) The use of spearman correlation; 2) The adding of hist panel and; 3) The changing of points sizes
  • The idea to use symbols for changing the point sizes was offered by Doug Y’barbo.
    And also to Dirk Eddelbuettel for offering to use cex (although I ended up not using that)

If you got ideas on how to improve this code (or reproducing it with ggplot2 or lattice), please do so in the comments (or on your own blog, but be sure to let me know :-) )

  • http://www.programmingwithdata.com/ Ian Fiske

    Thanks! That’s a great plot for assessing correlation of ordinal variables.

    Small typo: the graphic uses solutions 3-4, not 2-4 as the dots have no color gradient.

    • http://www.talgalili.com Tal Galili

      Thanks Ian for the positive feedback :)

      Also thanks for pointing that the graphic doesn’t use solution 2 (I’ll correct that in the article).

      The reason I wrote it is that the code give an hidden option to use solution 2 (I just didn’t feel it added enough in this case to include it).

      For people who will want to implement solution 2, please remove the “note #” sign from:
      #the.col < - heat_hcl(length(x))[z]
      And
      inches=F, bg= “grey”,#the.col ,
      In the:
      panel.smooth.ordered.categorical function.

      I am sure it could be improved more. My biggest problem (with both the colors and the sizes), is how to make sure they scale “well” for various situations in the data.
      If you have any thoughts – be welcomed to share them!

      Cheers :)
      Tal

  • Pingback: EcoArte » El “arte”del análisis de datos: De las hojas de cálculo a R Juan Freire Universidade da Coruña

  • Pingback: EcoArte » El “arte”del análisis de datos: De las hojas de cálculo a R – Juan Freire

  • Mahtab Gh

    Hi,

    First of all, thanks for putting together the code for Likert-scale correlation matrix; it has helped me a lot.

    My question is about the use of Spearman’s rho: as you know, Spearman’s formula cannot calculate p-values with good precision, when there are ties in the ranks. Such ties usually exist in survey data. Do you suggest using Pearson’s corr. coeff. instead?

    I’d really appreciate if you could comment on this.

    • http://www.talgalili.com Tal Galili

      Hi Mahtab,
      Glad to have helped :)

      In such a case, I would suggest:
      1) using bootstrap/permutation methods in order to calculate the P values.
      2) Checking what methods exists for this case in the literature and see if they’ve got an R implementation.

      Cheers,
      Tal

  • John

    Your function renders text size of negative r values incorrectly, because size is based on string width, which is longer for negative numbers (5 characters) than for positive numbers (4 characters).
    You can see this effect in your screenshot above, where -0.73 is in slightly smaller text than is 0.70.
    The original function over at R Graph Gallery didn’t run into this problem because it displayed abs(r).

    • http://www.talgalili.com Tal Galili

      Thank you John, I won’t get to fixing it before I come back from useR 2010.

      If any reader wishes to offer a fix – I’d be glad to incorporate it.

  • Benjamin

    Hey Tal,

    the graph looks great! Im new to R and I do have some problems to get the Syntax work. Could you help me? I’ve got a data frame for which I need all corrrelations in a scatter plot (23 Variables, I know its a lot). The data is ordinal with 4 scale steps (1 to 4). Can I read in the wholde dataset or should I list all the variable names in? I deleted all other Variables in the dataset, so there are only the 23 Variables with no missings.
    Thanks a lot!