Tag Archives: code

How to upgrade R on windows 7

Background – time to upgrade to R 2.13.0

The news of the new release of R 2.13.0 is out, and the R blogosphere is buzzing. Bloggers posting excitedly about the new R compiler package that brings with it the hope to speed up our R code with up to 4 times improvement and even a JIT compiler for R. So it is time to upgrade, and bloggers are here to help. Some wrote how to upgrade R on Linux and mac OSX (based on posts by Paolo). And it is now my turn, with suggestions on how to upgrade R on windows 7.

Upgrading R on windows – the two strategies

The classic description of how to upgrade R can be found in the R project FAQ page (and also the FAQ on how to install R on windows)

There are basically two strategies for R upgrading on windows. The first is to install a new R version and copy paste all the packages to the new R installation folder. The second is to have a global R package folder, each time synced to the most current R installation (thus saving us the time of copying the package library each we upgrade R).

I described the second strategy in detail in a post I wrote a year ago titled: “How to upgrade R on windows XP – another strategy” which explains how to upgrade R using the simple two-liner code:

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
New.R.RunMe()

p.s: If this is the first time you are upgrading R using this method, then first run the following two lines on your old R installation (before running the above code in the new R intallation):

source("http://www.r-statistics.com/wp-content/uploads/2010/04/upgrading-R-on-windows.r.txt")
Old.R.RunMe()

The above code should be enough.  However, there are some common pitfalls you might encounter when upgrading R on windows 7, bellow I outline the ones I know about, and how they can be solved.

Continue reading

Dumping functions from the global environment into an R script file

Looking at a project you didn’t touch for years poses many challenges. The less documentation and organization you had in your files, the more time you’ll have to spend tracing back what you did back when the code was written.

I just opened up such a project, that was before I ever knew to split my .r files to “data.r”, “functions.r”, “do.r”. All I have are several versions of an old .RData file and many .r files with a mix of functions and commands (oh the shame!)

One idea I had for the tracing back was to take the latest version of .RData I had, and see what functions I had in it’s environment. simply typing ls() wouldn’t work. Also, I wanted to have a list of all the functions that where defined in my .RData environment. Thanks to the code recently published by Richie Cotton, I was able to create the “save.functions.from.env”. This function will go through all your defined functions and write them into “d:\\temp.r”.

I hope this might be useful to one of you in the future, here is the code to do it:

save.functions.from.env <- function(file = "d:\\temp.r")
{
	# This function will go through all your defined functions and write them into "d:\\temp.r"
	# let's get all the functions from the envoirnement:
	funs <- Filter(is.function, sapply(ls( ".GlobalEnv"), get))
 
	# Let's 
	for(i in seq_along(funs))
	{
		cat(	# number the function we are about to add
			paste("\n" , "#------ Function number ", i , "-----------------------------------" ,"\n"),
			append = T, file = file
			)
 
		cat(	# print the function into the file
			paste(names(funs)[i] , "<-", paste(capture.output(funs[[i]]), collapse = "\n"), collapse = "\n"),
			append = T, file = file
			)
 
		cat(
			paste("\n" , "#-----------------------------------------" ,"\n"),
			append = T, file = file
			)
	}
 
	cat( # writing at the end of the file how many new functions where added to it
		paste("# A total of ", length(funs), " Functions where written into", file),
		append = T, file = file
		)
	print(paste("A total of ", length(funs), " Functions where written into", file))
}
 
# save.functions.from.env() # this is how you run it

Update: Joshua Ulrich gave on stackoverflow another solution for this challenge:

	newEnv <- new.env()
	load("myFunctions.Rdata", newEnv)
	dump(c(lsf.str(newEnv)), file="normalCodeFile.R", envir=newEnv)

And also suggested to look into ?prompt (which creates documentation files for objects) and / or ?package.skeleton.

How to upgrade R on windows XP – another strategy (and the R code to do it)

Update: This post has a follow-up for how to upgrade R on windows 7 explaining how to deal with permission issues.

Background – how I heard that there is more then one way to upgrade R

If you didn’t hear it by now – R 2.11.0 is out with a bunch of new features.

After Andrew Gelman recently lamented the lack of an easy upgrade process for R, a Stackoverflow thread (by JD Long) invited R users to share their strategies for easily upgrading R.

Upgrading strategy – moving to a global R library

In that thread, Dirk Eddelbuettel suggested another idea for upgrading R. His idea is of using a folder for R’s packages which is outside the standard directory tree of the installation (a different strategy then the one offered on the R FAQ).

The idea of this upgrading strategy is to save us steps in upgrading. So when you wish to upgrade R, instead of doing the following three steps:

  • download new R and install
  • copy the “library” content from the old R to the new R
  • upgrade all of the packages (in the library folder) to the new version of R.

You could instead just have steps 1 and 3, and skip step 2 (thus, saving us time…).

For example, under windows XP, you might have R installed on:
C:\Program Files\R\R-2.11.0\
But (in this alternative model for upgrading) you will have your packages library on a “global library folder” (global in the sense of independent of a specific R version):
C:\Program Files\R\library

So in order to use this strategy, you will need to do the following steps (all of them are performed in an R code provided later in the post)-

  1. In the OLD R installation (in the first time you move to the new system of managing the upgrade):
    1. Create a new global library folder (if it doesn’t exist)
    2. Copy to the new “global library folder” all of your packages from the old R installation
    3. After you move to this system – the steps 1 and 2 would not need to be repeated. (hence the advantage)
  2. In the NEW R installation:
    1. Create a new global library folder (if it doesn’t exist – in case this is your first R installation)
    2. Premenantly point to the Global library folder whenever R starts
    3. (Optional) Delete from the “Global library folder” all the packages that already exist in the local library folder of the new R install (no need to have doubles)
    4. Update all packages. (notice that you picked a mirror where the packages are up-to-date, you sometimes need to choose another mirror)

Thanks to help from Dirk, David Winsemius and Uwe Ligges, I was able to write the following R code to perform all the tasks I described :-)

So first you will need to run the following code:
Continue reading

The difference between “letters[c(1,NA)]” and “letters[c(NA,NA)]“

In David Smith’s latest blog post (which, in a sense, is a continued response to the latest public attack on R), there was a comment by Barry that caught my eye. Barry wrote:

Even I get caught out on R quirks after 20 years of using it. Compare letters[c(12,NA)] and letters[c(NA,NA)] for the most recent thing that made me bang my head against the wall.

So I did, and here’s the output:

> letters[c(12,NA)]
[1] "l" NA 
>  letters[c(NA,NA)] 
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>

Interesting isn’t it?
I had no clue why this had happened but luckily for us, Barry gave a follow-up reply with an explanation. And here is what he wrote:
Continue reading

Correlation scatter-plot matrix for ordered-categorical data

When analyzing a questionnaire, one often wants to view the correlation between two or more Likert questionnaire item’s (for example: two ordered categorical vectors ranging from 1 to 5).

When dealing with several such Likert variable’s, a clear presentation of all the pairwise relation’s between our variable can be achieved by inspecting the (Spearman) correlation matrix (easily achieved in R by using the “cor.test” command on a matrix of variables).
Yet, a challenge appears once we wish to plot this correlation matrix. The challenge stems from the fact that the classic presentation for a correlation matrix is a scatter plot matrix – but scatter plots don’t (usually) work well for ordered categorical vectors since the dots on the scatter plot often overlap each other.

There are four solution for the point-overlap problem that I know of:

  1. Jitter the data a bit to give a sense of the “density” of the points
  2. Use a color spectrum to represent when a point actually represent “many points”
  3. Use different points sizes to represent when there are “many points” in the location of that point
  4. Add a LOWESS (or LOESS) line to the scatter plot – to show the trend of the data

In this post I will offer the code for the  a solution that uses solution 3-4 (and possibly 2, please read this post comments). Here is the output (click to see a larger image):

And here is the code to produce this plot:

Continue reading

Siegel-Tukey: a Non-parametric test for equality in variability (R code)

Daniel Malter just shared on the R mailing list (link to the thread) his code for performing the Siegel-Tukey (Nonparametric) test for equality in variability.
Excited about the find, I contacted Daniel asking if I could republish his code here, and he kindly replied “yes”.
From here on I copy his note at full.

The R function can be downloaded from here
Corrections and remarks can be added in the comments bellow, or on the github code page.

* * * *
Continue reading

Post hoc analysis for Friedman’s Test (R code)

My goal in this post is to give an overview of Friedman’s Test and then offer R code to perform post hoc analysis on Friedman’s Test results. (The R function can be downloaded from here)

Preface: What is Friedman’s Test

Friedman test is a non-parametric randomized block analysis of variance. Which is to say it is a non-parametric version of a one way ANOVA with repeated measures. That means that while a simple ANOVA test requires the assumptions of a normal distribution and equal variances (of the residuals), the Friedman test is free from those restriction. The price of this parametric freedom is the loss of power (of Friedman’s test compared to the parametric ANOVa versions).

The hypotheses for the comparison across repeated measures are:

  • H0: The distributions (whatever they are) are the same across repeated measures
  • H1: The distributions across repeated measures are different

The test statistic for the Friedman’s test is a Chi-square with [(number of repeated measures)-1] degrees of freedom. A detailed explanation of the method for computing the Friedman test is available on Wikipedia.

Performing Friedman’s Test in R is very simple, and is by using the “friedman.test” command.

Post hoc analysis for the Friedman’s Test

Assuming you performed Friedman’s Test and found a significant P value, that means that some of the groups in your data have different distribution from one another, but you don’t (yet) know which. Therefor, our next step will be to try and find out which pairs of our groups are significantly different then each other. But when we have N groups, checking all of their pairs will be to perform [n over 2] comparisons, thus the need to correct for multiple comparisons arise.
The tasks:
Our first task will be to perform a post hoc analysis of our results (using R) – in the hope of finding out which of our groups are responsible that we found that the null hypothesis was rejected. While in the simple case of ANOVA, an R command is readily available (“TukeyHSD”), in the case of friedman’s test (until now) the code to perform the post hoc test was not as easily accessible.
Our second task will be to visualize our results. While in the case of simple ANOVA, a boxplot of each group is sufficient, in the case of a repeated measures – a boxplot approach will be misleading to the viewer. Instead, we will offer two plots: one of parallel coordinates, and the other will be boxplots of the differences between all pairs of groups (in this respect, the post hoc analysis can be thought of as performing paired wilcox.test with correction for multiplicity).

R code for Post hoc analysis for the Friedman’s Test

The analysis will be performed using the function (I wrote) called “friedman.test.with.post.hoc”, based on the packages “coin” and “multcomp”. Just a few words about it’s arguments:

  • formu – is a formula object of the shape: Y ~ X | block (where Y is the ordered (numeric) responce, X is a group indicator (factor), and block is the block (or subject) indicator (factor)
  • data – is a data frame with columns of Y, X and block (the names could be different, of course, as long as the formula given in “formu” represent that)
  • All the other parameters are to allow or suppress plotting of the results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
friedman.test.with.post.hoc <- function(formu, data, to.print.friedman = T, to.post.hoc.if.signif = T,  to.plot.parallel = T, to.plot.boxplot = T, signif.P = .05, color.blocks.in.cor.plot = T, jitter.Y.in.cor.plot =F)
{
	# formu is a formula of the shape: 	Y ~ X | block
	# data is a long data.frame with three columns:    [[ Y (numeric), X (factor), block (factor) ]]
 
	# Note: This function doesn't handle NA's! In case of NA in Y in one of the blocks, then that entire block should be removed.
 
 
	# Loading needed packages
	if(!require(coin))
	{
		print("You are missing the package 'coin', we will now try to install it...")
		install.packages("coin")		
		library(coin)
	}
 
	if(!require(multcomp))
	{
		print("You are missing the package 'multcomp', we will now try to install it...")
		install.packages("multcomp")
		library(multcomp)
	}
 
	if(!require(colorspace))
	{
		print("You are missing the package 'colorspace', we will now try to install it...")
		install.packages("colorspace")
		library(colorspace)
	}
 
 
	# get the names out of the formula
	formu.names <- all.vars(formu)
	Y.name <- formu.names[1]
	X.name <- formu.names[2]
	block.name <- formu.names[3]
 
	if(dim(data)[2] >3) data <- data[,c(Y.name,X.name,block.name)]	# In case we have a "data" data frame with more then the three columns we need. This code will clean it from them...
 
	# Note: the function doesn't handle NA's. In case of NA in one of the block T outcomes, that entire block should be removed.
 
	# stopping in case there is NA in the Y vector
	if(sum(is.na(data[,Y.name])) > 0) stop("Function stopped: This function doesn't handle NA's. In case of NA in Y in one of the blocks, then that entire block should be removed.")
 
	# make sure that the number of factors goes with the actual values present in the data:
	data[,X.name ] <- factor(data[,X.name ])
	data[,block.name ] <- factor(data[,block.name ])
	number.of.X.levels <- length(levels(data[,X.name ]))
	if(number.of.X.levels == 2) { warning(paste("'",X.name,"'", "has only two levels. Consider using paired wilcox.test instead of friedman test"))}
 
	# making the object that will hold the friedman test and the other.
	the.sym.test <- symmetry_test(formu, data = data,	### all pairwise comparisons	
						   teststat = "max",
						   xtrafo = function(Y.data) { trafo( Y.data, factor_trafo = function(x) { model.matrix(~ x - 1) %*% t(contrMat(table(x), "Tukey")) } ) },
						   ytrafo = function(Y.data){ trafo(Y.data, numeric_trafo = rank, block = data[,block.name] ) }
						)
	# if(to.print.friedman) { print(the.sym.test) }
 
 
	if(to.post.hoc.if.signif)
		{
			if(pvalue(the.sym.test) < signif.P)
			{
				# the post hoc test
				The.post.hoc.P.values <- pvalue(the.sym.test, method = "single-step")	# this is the post hoc of the friedman test
 
 
				# plotting
				if(to.plot.parallel & to.plot.boxplot)	par(mfrow = c(1,2)) # if we are plotting two plots, let's make sure we'll be able to see both
 
				if(to.plot.parallel)
				{
					X.names <- levels(data[, X.name])
					X.for.plot <- seq_along(X.names)
					plot.xlim <- c(.7 , length(X.for.plot)+.3)	# adding some spacing from both sides of the plot
 
					if(color.blocks.in.cor.plot) 
					{
						blocks.col <- rainbow_hcl(length(levels(data[,block.name])))
					} else {
						blocks.col <- 1 # black
					}					
 
					data2 <- data
					if(jitter.Y.in.cor.plot) {
						data2[,Y.name] <- jitter(data2[,Y.name])
						par.cor.plot.text <- "Parallel coordinates plot (with Jitter)"				
					} else {
						par.cor.plot.text <- "Parallel coordinates plot"
					}				
 
					# adding a Parallel coordinates plot
					matplot(as.matrix(reshape(data2,  idvar=X.name, timevar=block.name,
									 direction="wide")[,-1])  , 
							type = "l",  lty = 1, axes = FALSE, ylab = Y.name, 
							xlim = plot.xlim,
							col = blocks.col,
							main = par.cor.plot.text)
					axis(1, at = X.for.plot , labels = X.names) # plot X axis
					axis(2) # plot Y axis
					points(tapply(data[,Y.name], data[,X.name], median) ~ X.for.plot, col = "red",pch = 4, cex = 2, lwd = 5)
				}
 
				if(to.plot.boxplot)
				{
					# first we create a function to create a new Y, by substracting different combinations of X levels from each other.
					subtract.a.from.b <- function(a.b , the.data)
					{
						the.data[,a.b[2]] - the.data[,a.b[1]]
					}
 
					temp.wide <- reshape(data,  idvar=X.name, timevar=block.name,
									 direction="wide") 	#[,-1]
					wide.data <- as.matrix(t(temp.wide[,-1]))
					colnames(wide.data) <- temp.wide[,1]
 
					Y.b.minus.a.combos <- apply(with(data,combn(levels(data[,X.name]), 2)), 2, subtract.a.from.b, the.data =wide.data)
					names.b.minus.a.combos <- apply(with(data,combn(levels(data[,X.name]), 2)), 2, function(a.b) {paste(a.b[2],a.b[1],sep=" - ")})
 
					the.ylim <- range(Y.b.minus.a.combos)
					the.ylim[2] <- the.ylim[2] + max(sd(Y.b.minus.a.combos))	# adding some space for the labels
					is.signif.color <- ifelse(The.post.hoc.P.values < .05 , "green", "grey")
 
					boxplot(Y.b.minus.a.combos,
						names = names.b.minus.a.combos ,
						col = is.signif.color,
						main = "Boxplots (of the differences)",
						ylim = the.ylim
						)
					legend("topright", legend = paste(names.b.minus.a.combos, rep(" ; PostHoc P.value:", number.of.X.levels),round(The.post.hoc.P.values,5)) , fill =  is.signif.color )
					abline(h = 0, col = "blue")
 
				}
 
				list.to.return <- list(Friedman.Test = the.sym.test, PostHoc.Test = The.post.hoc.P.values)
				if(to.print.friedman) {print(list.to.return)}				
				return(list.to.return)
 
			}	else {
					print("The results where not significant, There is no need for a post hoc test")
					return(the.sym.test)
				}					
	}
 
# Original credit (for linking online, to the package that performs the post hoc test) goes to "David Winsemius", see:
# http://tolstoy.newcastle.edu.au/R/e8/help/09/10/1416.html
}

Example

(The code for the example is given at the end of the post)

Let’s make up a little story: let’s say we have three types of wine (A, B and C), and we would like to know which one is the best one (in a scale of 1 to 7). We asked 22 friends to taste each of the three wines (in a blind fold fashion), and then to give a grade of 1 till 7 (for example sake, let’s say we asked them to rate the wines 5 times each, and then averaged their results to give a number for a persons preference for each wine. This number which is now an average of several numbers, will not necessarily be an integer).

After getting the results, we started by performing a simple boxplot of the ratings each wine got. Here it is:

The plot shows us two things: 1) that the assumption of equal variances here might not hold. 2) That if we are to ignore the “within subjects” data that we have, we have no chance of finding any difference between the wines.

So we move to using the function “friedman.test.with.post.hoc” on our data, and we get the following output:

$Friedman.Test
Asymptotic General Independence Test
data:  Taste by
Wine (Wine A, Wine B, Wine C)
stratified by Taster
maxT = 3.2404, p-value = 0.003421
$PostHoc.Test
Wine B – Wine A 0.623935139
Wine C – Wine A 0.003325929
Wine C – Wine B 0.053772757

The conclusion is that once we take into account the within subject variable, we discover that there is a significant difference between our three wines (significant P value of about  0.0034). And the posthoc analysis shows us that the difference is due to the difference in tastes between Wine C and Wine A (P value 0.003). and maybe also with the difference between Wine C and Wine B (the P value is 0.053, which is just borderline significant).

Plotting our analysis will also show us the direction of the results, and the connected answers of each of our friends answers:

Here is the code for the example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 
source("http://www.r-statistics.com/wp-content/uploads/2010/02/Friedman-Test-with-Post-Hoc.r.txt")  # loading the friedman.test.with.post.hoc function from the internet
 
	### Comparison of three Wine ("Wine A", "Wine B", and
	###  "Wine C") for rounding first base. 
	WineTasting <- data.frame(
		  Taste = c(5.40, 5.50, 5.55,
					5.85, 5.70, 5.75,
					5.20, 5.60, 5.50,
					5.55, 5.50, 5.40,
					5.90, 5.85, 5.70,
					5.45, 5.55, 5.60,
					5.40, 5.40, 5.35,
					5.45, 5.50, 5.35,
					5.25, 5.15, 5.00,
					5.85, 5.80, 5.70,
					5.25, 5.20, 5.10,
					5.65, 5.55, 5.45,
					5.60, 5.35, 5.45,
					5.05, 5.00, 4.95,
					5.50, 5.50, 5.40,
					5.45, 5.55, 5.50,
					5.55, 5.55, 5.35,
					5.45, 5.50, 5.55,
					5.50, 5.45, 5.25,
					5.65, 5.60, 5.40,
					5.70, 5.65, 5.55,
					6.30, 6.30, 6.25),
					Wine = factor(rep(c("Wine A", "Wine B", "Wine C"), 22)),
					Taster = factor(rep(1:22, rep(3, 22))))
 
	with(WineTasting , boxplot( Taste  ~ Wine )) # boxploting 
	friedman.test.with.post.hoc(Taste ~ Wine | Taster ,WineTasting)	# the same with our function. With post hoc, and cool plots

If you find this code useful, please let me know (in the comments) so I will know there is a point in publishing more such code snippets…

Barnards exact test - p-value based on the nuisance parameter

Barnard’s exact test – a powerful alternative for Fisher’s exact test (implemented in R)

(The R code for Barnard’s exact test is at the end of the article, and you could also just download it from here, or from github)

Barnards exact test - p-value based on the nuisance parameter

Barnards exact test - p-value based on the nuisance parameter

About Barnard’s exact test

About half a year ago, I was studying various statistical methods to employ on contingency tables. I came across a promising method for 2×2 contingency tables called “Barnard’s exact test“. Barnard’s test is a non-parametric alternative to Fisher’s exact test which can be more powerful (for 2×2 tables) but is also more time-consuming to compute (References can be found in the Wikipedia article on the subject).

The test was first published by George Alfred Barnard (1945) (link to the original paper in Nature). Mehta and Senchaudhuri (2003) explain why Barnard’s test can be more powerful than Fisher’s under certain conditions:

When comparing Fisher’s and Barnard’s exact tests, the loss of power due to the greater discreteness of the Fisher statistic is somewhat offset by the requirement that Barnard’s exact test must maximize over all possible p-values, by choice of the nuisance parameter, π. For 2 × 2 tables the loss of power due to the discreteness dominates over the loss of power due to the maximization, resulting in greater power for Barnard’s exact test. But as the number of rows and columns of the observed table increase, the maximizing factor will tend to dominate, and Fisher’s exact test will achieve greater power than Barnard’s.

About the R implementation of Barnard’s exact test

After finding about Barnard’s test I was sad to discover that (at the time) there had been no R implementation of it. But last week, I received a surprising e-mail with good news. The sender, Peter Calhoun, currently a graduate student at the University of Florida, had implemented the algorithm in R. Peter had  found my posting on the R mailing list (from almost half a year ago) and was so kind as to share with me (and the rest of the R community) his R code for computing Barnard’s exact test. Here is some of what Peter wrote to me about his code:

On a side note, I believe there are more efficient codes than this one.  For example, I’ve seen codes in Matlab that run faster and display nicer-looking graphs.  However, this code will still provide accurate results and a plot that gives the p-value based on the nuisance parameter.  I did not come up with the idea of this code, I simply translated Matlab code into R, occasionally using different methods to get the same result.  The code was translated from:

Trujillo-Ortiz, A., R. Hernandez-Walls, A. Castro-Perez, L. Rodriguez-Cardozo. Probability Test.  A MATLAB file. URL

http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=6198

My goal was to make this test accessible to everyone.  Although there are many ways to run this test through Matlab, I hadn’t seen any code to implement this test in R.  I hope it is useful for you, and if you have any questions or ways to improve this code, please contact me at calhoun.peter@gmail.com

Continue reading