Siegel-Tukey: a Non-parametric test for equality in variability (R code)

Daniel Malter just shared on the R mailing list (link to the thread) his code for performing the Siegel-Tukey (Nonparametric) test for equality in variability.
Excited about the find, I contacted Daniel asking if I could republish his code here, and he kindly replied “yes”.
From here on I copy his note at full.

The R function can be downloaded from here
Corrections and remarks can be added in the comments bellow, or on the github code page.

* * * *
Continue reading

Post hoc analysis for Friedman’s Test (R code)

My goal in this post is to give an overview of Friedman’s Test and then offer R code to perform post hoc analysis on Friedman’s Test results. (The R function can be downloaded from here)

Preface: What is Friedman’s Test

Friedman test is a non-parametric randomized block analysis of variance. Which is to say it is a non-parametric version of a one way ANOVA with repeated measures. That means that while a simple ANOVA test requires the assumptions of a normal distribution and equal variances (of the residuals), the Friedman test is free from those restriction. The price of this parametric freedom is the loss of power (of Friedman’s test compared to the parametric ANOVa versions).

The hypotheses for the comparison across repeated measures are:

  • H0: The distributions (whatever they are) are the same across repeated measures
  • H1: The distributions across repeated measures are different

The test statistic for the Friedman’s test is a Chi-square with [(number of repeated measures)-1] degrees of freedom. A detailed explanation of the method for computing the Friedman test is available on Wikipedia.

Performing Friedman’s Test in R is very simple, and is by using the “friedman.test” command.

Post hoc analysis for the Friedman’s Test

Assuming you performed Friedman’s Test and found a significant P value, that means that some of the groups in your data have different distribution from one another, but you don’t (yet) know which. Therefor, our next step will be to try and find out which pairs of our groups are significantly different then each other. But when we have N groups, checking all of their pairs will be to perform [n over 2] comparisons, thus the need to correct for multiple comparisons arise.
The tasks:
Our first task will be to perform a post hoc analysis of our results (using R) – in the hope of finding out which of our groups are responsible that we found that the null hypothesis was rejected. While in the simple case of ANOVA, an R command is readily available (“TukeyHSD”), in the case of friedman’s test (until now) the code to perform the post hoc test was not as easily accessible.
Our second task will be to visualize our results. While in the case of simple ANOVA, a boxplot of each group is sufficient, in the case of a repeated measures – a boxplot approach will be misleading to the viewer. Instead, we will offer two plots: one of parallel coordinates, and the other will be boxplots of the differences between all pairs of groups (in this respect, the post hoc analysis can be thought of as performing paired wilcox.test with correction for multiplicity).

R code for Post hoc analysis for the Friedman’s Test

The analysis will be performed using the function (I wrote) called “friedman.test.with.post.hoc”, based on the packages “coin” and “multcomp”. Just a few words about it’s arguments:

  • formu – is a formula object of the shape: Y ~ X | block (where Y is the ordered (numeric) responce, X is a group indicator (factor), and block is the block (or subject) indicator (factor)
  • data – is a data frame with columns of Y, X and block (the names could be different, of course, as long as the formula given in “formu” represent that)
  • All the other parameters are to allow or suppress plotting of the results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
friedman.test.with.post.hoc <- function(formu, data, to.print.friedman = T, to.post.hoc.if.signif = T,  to.plot.parallel = T, to.plot.boxplot = T, signif.P = .05, color.blocks.in.cor.plot = T, jitter.Y.in.cor.plot =F)
{
	# formu is a formula of the shape: 	Y ~ X | block
	# data is a long data.frame with three columns:    [[ Y (numeric), X (factor), block (factor) ]]
 
	# Note: This function doesn't handle NA's! In case of NA in Y in one of the blocks, then that entire block should be removed.
 
 
	# Loading needed packages
	if(!require(coin))
	{
		print("You are missing the package 'coin', we will now try to install it...")
		install.packages("coin")		
		library(coin)
	}
 
	if(!require(multcomp))
	{
		print("You are missing the package 'multcomp', we will now try to install it...")
		install.packages("multcomp")
		library(multcomp)
	}
 
	if(!require(colorspace))
	{
		print("You are missing the package 'colorspace', we will now try to install it...")
		install.packages("colorspace")
		library(colorspace)
	}
 
 
	# get the names out of the formula
	formu.names <- all.vars(formu)
	Y.name <- formu.names[1]
	X.name <- formu.names[2]
	block.name <- formu.names[3]
 
	if(dim(data)[2] >3) data <- data[,c(Y.name,X.name,block.name)]	# In case we have a "data" data frame with more then the three columns we need. This code will clean it from them...
 
	# Note: the function doesn't handle NA's. In case of NA in one of the block T outcomes, that entire block should be removed.
 
	# stopping in case there is NA in the Y vector
	if(sum(is.na(data[,Y.name])) > 0) stop("Function stopped: This function doesn't handle NA's. In case of NA in Y in one of the blocks, then that entire block should be removed.")
 
	# make sure that the number of factors goes with the actual values present in the data:
	data[,X.name ] <- factor(data[,X.name ])
	data[,block.name ] <- factor(data[,block.name ])
	number.of.X.levels <- length(levels(data[,X.name ]))
	if(number.of.X.levels == 2) { warning(paste("'",X.name,"'", "has only two levels. Consider using paired wilcox.test instead of friedman test"))}
 
	# making the object that will hold the friedman test and the other.
	the.sym.test <- symmetry_test(formu, data = data,	### all pairwise comparisons	
						   teststat = "max",
						   xtrafo = function(Y.data) { trafo( Y.data, factor_trafo = function(x) { model.matrix(~ x - 1) %*% t(contrMat(table(x), "Tukey")) } ) },
						   ytrafo = function(Y.data){ trafo(Y.data, numeric_trafo = rank, block = data[,block.name] ) }
						)
	# if(to.print.friedman) { print(the.sym.test) }
 
 
	if(to.post.hoc.if.signif)
		{
			if(pvalue(the.sym.test) < signif.P)
			{
				# the post hoc test
				The.post.hoc.P.values <- pvalue(the.sym.test, method = "single-step")	# this is the post hoc of the friedman test
 
 
				# plotting
				if(to.plot.parallel & to.plot.boxplot)	par(mfrow = c(1,2)) # if we are plotting two plots, let's make sure we'll be able to see both
 
				if(to.plot.parallel)
				{
					X.names <- levels(data[, X.name])
					X.for.plot <- seq_along(X.names)
					plot.xlim <- c(.7 , length(X.for.plot)+.3)	# adding some spacing from both sides of the plot
 
					if(color.blocks.in.cor.plot) 
					{
						blocks.col <- rainbow_hcl(length(levels(data[,block.name])))
					} else {
						blocks.col <- 1 # black
					}					
 
					data2 <- data
					if(jitter.Y.in.cor.plot) {
						data2[,Y.name] <- jitter(data2[,Y.name])
						par.cor.plot.text <- "Parallel coordinates plot (with Jitter)"				
					} else {
						par.cor.plot.text <- "Parallel coordinates plot"
					}				
 
					# adding a Parallel coordinates plot
					matplot(as.matrix(reshape(data2,  idvar=X.name, timevar=block.name,
									 direction="wide")[,-1])  , 
							type = "l",  lty = 1, axes = FALSE, ylab = Y.name, 
							xlim = plot.xlim,
							col = blocks.col,
							main = par.cor.plot.text)
					axis(1, at = X.for.plot , labels = X.names) # plot X axis
					axis(2) # plot Y axis
					points(tapply(data[,Y.name], data[,X.name], median) ~ X.for.plot, col = "red",pch = 4, cex = 2, lwd = 5)
				}
 
				if(to.plot.boxplot)
				{
					# first we create a function to create a new Y, by substracting different combinations of X levels from each other.
					subtract.a.from.b <- function(a.b , the.data)
					{
						the.data[,a.b[2]] - the.data[,a.b[1]]
					}
 
					temp.wide <- reshape(data,  idvar=X.name, timevar=block.name,
									 direction="wide") 	#[,-1]
					wide.data <- as.matrix(t(temp.wide[,-1]))
					colnames(wide.data) <- temp.wide[,1]
 
					Y.b.minus.a.combos <- apply(with(data,combn(levels(data[,X.name]), 2)), 2, subtract.a.from.b, the.data =wide.data)
					names.b.minus.a.combos <- apply(with(data,combn(levels(data[,X.name]), 2)), 2, function(a.b) {paste(a.b[2],a.b[1],sep=" - ")})
 
					the.ylim <- range(Y.b.minus.a.combos)
					the.ylim[2] <- the.ylim[2] + max(sd(Y.b.minus.a.combos))	# adding some space for the labels
					is.signif.color <- ifelse(The.post.hoc.P.values < .05 , "green", "grey")
 
					boxplot(Y.b.minus.a.combos,
						names = names.b.minus.a.combos ,
						col = is.signif.color,
						main = "Boxplots (of the differences)",
						ylim = the.ylim
						)
					legend("topright", legend = paste(names.b.minus.a.combos, rep(" ; PostHoc P.value:", number.of.X.levels),round(The.post.hoc.P.values,5)) , fill =  is.signif.color )
					abline(h = 0, col = "blue")
 
				}
 
				list.to.return <- list(Friedman.Test = the.sym.test, PostHoc.Test = The.post.hoc.P.values)
				if(to.print.friedman) {print(list.to.return)}				
				return(list.to.return)
 
			}	else {
					print("The results where not significant, There is no need for a post hoc test")
					return(the.sym.test)
				}					
	}
 
# Original credit (for linking online, to the package that performs the post hoc test) goes to "David Winsemius", see:
# http://tolstoy.newcastle.edu.au/R/e8/help/09/10/1416.html
}

Example

(The code for the example is given at the end of the post)

Let’s make up a little story: let’s say we have three types of wine (A, B and C), and we would like to know which one is the best one (in a scale of 1 to 7). We asked 22 friends to taste each of the three wines (in a blind fold fashion), and then to give a grade of 1 till 7 (for example sake, let’s say we asked them to rate the wines 5 times each, and then averaged their results to give a number for a persons preference for each wine. This number which is now an average of several numbers, will not necessarily be an integer).

After getting the results, we started by performing a simple boxplot of the ratings each wine got. Here it is:

The plot shows us two things: 1) that the assumption of equal variances here might not hold. 2) That if we are to ignore the “within subjects” data that we have, we have no chance of finding any difference between the wines.

So we move to using the function “friedman.test.with.post.hoc” on our data, and we get the following output:

$Friedman.Test
Asymptotic General Independence Test
data:  Taste by
Wine (Wine A, Wine B, Wine C)
stratified by Taster
maxT = 3.2404, p-value = 0.003421
$PostHoc.Test
Wine B – Wine A 0.623935139
Wine C – Wine A 0.003325929
Wine C – Wine B 0.053772757

The conclusion is that once we take into account the within subject variable, we discover that there is a significant difference between our three wines (significant P value of about  0.0034). And the posthoc analysis shows us that the difference is due to the difference in tastes between Wine C and Wine A (P value 0.003). and maybe also with the difference between Wine C and Wine B (the P value is 0.053, which is just borderline significant).

Plotting our analysis will also show us the direction of the results, and the connected answers of each of our friends answers:

Here is the code for the example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 
source("http://www.r-statistics.com/wp-content/uploads/2010/02/Friedman-Test-with-Post-Hoc.r.txt")  # loading the friedman.test.with.post.hoc function from the internet
 
	### Comparison of three Wine ("Wine A", "Wine B", and
	###  "Wine C") for rounding first base. 
	WineTasting <- data.frame(
		  Taste = c(5.40, 5.50, 5.55,
					5.85, 5.70, 5.75,
					5.20, 5.60, 5.50,
					5.55, 5.50, 5.40,
					5.90, 5.85, 5.70,
					5.45, 5.55, 5.60,
					5.40, 5.40, 5.35,
					5.45, 5.50, 5.35,
					5.25, 5.15, 5.00,
					5.85, 5.80, 5.70,
					5.25, 5.20, 5.10,
					5.65, 5.55, 5.45,
					5.60, 5.35, 5.45,
					5.05, 5.00, 4.95,
					5.50, 5.50, 5.40,
					5.45, 5.55, 5.50,
					5.55, 5.55, 5.35,
					5.45, 5.50, 5.55,
					5.50, 5.45, 5.25,
					5.65, 5.60, 5.40,
					5.70, 5.65, 5.55,
					6.30, 6.30, 6.25),
					Wine = factor(rep(c("Wine A", "Wine B", "Wine C"), 22)),
					Taster = factor(rep(1:22, rep(3, 22))))
 
	with(WineTasting , boxplot( Taste  ~ Wine )) # boxploting 
	friedman.test.with.post.hoc(Taste ~ Wine | Taster ,WineTasting)	# the same with our function. With post hoc, and cool plots

If you find this code useful, please let me know (in the comments) so I will know there is a point in publishing more such code snippets…

Is it harder to advertise to the more educated? Correlation in US States data will not be enough to answer us…

“Chitika research” published today a fun small dataset (you can download it from here) in a post titled “The Educated are Harder to Advertise To”.

In this post I have three goals in mind:

  1. Suggesting another plot instead of the one used in the original post.
  2. Emphasizing the “Correlation does not imply causation” rule.
  3. Inviting other R lovers (as myself) to find fun things to do with this (and similar) dataset.

The Data

The data set is comprised of 51 rows, one for each US states with the two variables (columns):

  • CTR - The CTR means “Click Through Rate” and is from chitika data base and collected from over two random days in January (a total of 31,667,158 total impressions), and is from the full range of Internet users (they don’t have traditional demographic data – every impression is completely anonymous).
  • Percent of the population who graduated college.

Super basic analysis and plot

This data presents a stunning -0.63 correlation between the two measurements. Hinting that “The Educated are Harder to Advertise To” (as the original post suggested). The data can be easily visualized using a scatter plot:

Created using just a few lines of R code:

aa <- read.table("http://www.r-statistics.com/wp-content/uploads/2010/02/State_CTR_Date.txt", sep = "\t", header = T)
aa[,2:3] <- aa[,2:3] * 100
plot(aa[,2] ~ aa[,3], sub = paste("Correlation: ", round(cor(aa[,2], aa[,3]), 2)),
	main = "Scatter plot of %CTR VS %College_Grad per State",
	xlab = "%College_Grad per State",
	ylab = "%CTR per State"
	)
abline(lm(aa[,2] ~ aa[,3]), col = "blue")

My conclusion from the analysis

I was asked in the comments (by Eyal) to add my own conclusions to the analysis. Does higher intelligence imply lower chances of clicking ads, my answer (under the present data) is simple “I don’t know”. The only real conclusion I can make of the data is that there might be a point in checking this effect in a more rigorous way (which I am sure is already being done).

What should we have done in order to know? When doing scientific research, we often ask ourselves how sure are we of our results. The rule of thumb for this type of question is called “the pyramid of evidence“. It is a way to organize various ways of getting “information” about the world, in an hierarchy of reliability. Here is a picture of this pyramid:

(Credit: image source)

We can see that the most reliable source is a systematic review of randomized controlled trials. In our case, that would mean having controlled experiments where you take groups of people with different levels of “intelligence” (how would you measure that?), and check their CTR (click through rates) on banner ads. This should be done in various ways, correcting for various confounders , and later the results and conclusions (from several such experiments) should be systematically reviewed by experts on the subject.

All of this should be done in order to make a real assessment of the underlying question – how does smarts effects banner clicking.
And the reason we need all of this work is because of what is said in the title of the next section:

Correlation does not imply causation

As is written in the article on wikipedia:

“Correlation does not imply causation” is a phrase used in science and statistics to emphasize that correlation between two variables does not automatically imply that one causes the other (though it does not remove the fact that correlation can still be a hint, whether powerful or otherwise). The opposite belief, correlation proves causation, is a logical fallacy by which two events that occur together are claimed to have a cause-and-effect relationship.

But a much clearer explenation of it was given by the following XKCD comic strip:
Correlation on XKCD

Next step: other resources to play with

The motivation for my post is based on this digg post trying to hint how Religiousness is connected to “negative” things such as crimes, poverty and so on. That post was based on the following links:

  • http://www.gallup.com/poll/114022/state-states-importance-religion.aspx#2
  • http://www.top50states.com/average-iq-score.html
  • http://www.census.gov/cgi-bin/saipe/national.cgi?year=2008&ascii=
  • http://www.census.gov/compendia/statab/cats/law_enforcement_courts_prisons/crimes_and_crime_rates.html
  • http://www.infoplease.com/ipa/a0923080.html
  • http://www.fraserinstitute.org/researchandpublications/publications/7071.aspx
  • http://www.gallup.com/poll/122333/political-ideologt-conservative-label-prevails-south.aspx#2
  • http://www.ahiphiwire.org/wellbeing/display.aspx?doc_code=RWBStateRanks

If someone is motivated, he/she can extract that data and combine it with the current provided data.

In conclusion: this simplistic dataset, combined with other data resources, provides opportunity for various fun demonstrations of pairs correlation plots and of nice spatial plots (of states colored by their matching variable). It is a good opportunity to emphasize (to students, friends and the like) that “Correlation does not imply causation!”.
And finally – If you are an R lover/blogger and feel like playing with this – please let me know :-) .

R Web Application – “Hello World” using RApache (~7min video tutorial)

I just noticed a google buzz from Jeroen ooms, with a Youtube video titled “RApache Hello World + POST arguments + catching errors.

In this ~7 min video tutorial, Jeroen shares with us:

  1. How to write “Hello World” in a website using RApache.
  2. How to extract arguments from a form submited by the website visitor (and then inserting it into an “rnorm” function so to control the output). And finally,
  3. How to catch an error in case of an invalid argument on an R Web Application.

Thank you Jeroen for a very simple, step by step, tutorial:

p.s: For more videos by Jeroen, have a look at

Highlight the R syntax on your (WordPress) blog using the wp-syntax plugin

Update (11.10.10): I found a better solution for R syntax highlighting then the one presented in this post. The plugin is called WP-CodeBox, and I wrote about it on the post – WP-CodeBox: A better R syntax highlighter plugin for WordPress
Download link for WP-Syntax plugin (with GeSHi version 1.0.8.6)

In case you have a self hosted WordPress blog, and you wish to show your R code in it, how would you do it?

The simplest solution would be to just paste the code as plain text, which will look like this:

x <- rnorm(100, mean = 2, sd = 3)
plot(x, xlab = “index”, main = “Example code”)

But if you would like to help our readers orient themselves inside your code by giving different colors to different commands in the code (a.k.a: syntax highlighting). So it would like something like this:

1
2
x <- rnorm(100, mean = 2, sd = 3) # Creating a vector
plot(x, xlab = "index", main = "Example code") # Plotting it

How then would you do it?

Plugin Installation

The easiest way to do this inside a self hosted WordPress blog is by installing a plugin called WP-Syntax:

WP-Syntax provides clean syntax highlighting using GeSHi — supporting a wide range of popular languages (including R). It supports highlighting with or without line numbers and maintains formatting while copying snippets of code from the browser.

But there is a problem. The current WP-Syntax version is using an old version of GeSHi, and only the newer version (currently GeSHi version 1.0.8.6) includes support for R syntax. In order to solve this I patched the plugin and I encourage you to download (the fixed version of) WP-Syntax from here, which will allow you to highlight your R code.

Usage

After installing (and activating) the plugin, in order to add R code to your post you will need to:
1) Only work in HTML mode (not the Visual mode). Or else, the code you will paste will be messed up.
2) Put your code between the <pre> tag, like this:

(Note: make sure that you rewrite the ” – so it will work.)

<pre lang=”rsplus” line=”1″>
…Your R code here…
</pre>

Final note: R Syntax highlight in other ways

If you wish to have R syntax higlight inside an HTML file, I encourage you can have a look at the highlight package, by Romain Francois.

If you want to higlight your R syntax inside wordpress.com, here is a blog post by Erik Iverson showing how to do that using Emacs.

p.s: If you have a blog in which you write about R, please let me know about it in the comments (Or just join R-bloggers.com) – I’d love to follow you :-)

Update: Stephen Turner wrote about a syntax highlighting solution for R and blogger using github gist. And also mentioned there another solution for self hosted wordpress blogs, via J.D. Long: a Github Gist plugin for WordPress. Go publish code :)

Barnards exact test - p-value based on the nuisance parameter

Barnard’s exact test – a powerful alternative for Fisher’s exact test (implemented in R)

(The R code for Barnard’s exact test is at the end of the article, and you could also just download it from here, or from github)

Barnards exact test - p-value based on the nuisance parameter
Barnards exact test - p-value based on the nuisance parameter

About Barnard’s exact test

About half a year ago, I was studying various statistical methods to employ on contingency tables. I came across a promising method for 2×2 contingency tables called “Barnard’s exact test“. Barnard’s test is a non-parametric alternative to Fisher’s exact test which can be more powerful (for 2×2 tables) but is also more time-consuming to compute (References can be found in the Wikipedia article on the subject).

The test was first published by George Alfred Barnard (1945) (link to the original paper in Nature). Mehta and Senchaudhuri (2003) explain why Barnard’s test can be more powerful than Fisher’s under certain conditions:

When comparing Fisher’s and Barnard’s exact tests, the loss of power due to the greater discreteness of the Fisher statistic is somewhat offset by the requirement that Barnard’s exact test must maximize over all possible p-values, by choice of the nuisance parameter, π. For 2 × 2 tables the loss of power due to the discreteness dominates over the loss of power due to the maximization, resulting in greater power for Barnard’s exact test. But as the number of rows and columns of the observed table increase, the maximizing factor will tend to dominate, and Fisher’s exact test will achieve greater power than Barnard’s.

About the R implementation of Barnard’s exact test

After finding about Barnard’s test I was sad to discover that (at the time) there had been no R implementation of it. But last week, I received a surprising e-mail with good news. The sender, Peter Calhoun, currently a graduate student at the University of Florida, had implemented the algorithm in R. Peter had  found my posting on the R mailing list (from almost half a year ago) and was so kind as to share with me (and the rest of the R community) his R code for computing Barnard’s exact test. Here is some of what Peter wrote to me about his code:

On a side note, I believe there are more efficient codes than this one.  For example, I’ve seen codes in Matlab that run faster and display nicer-looking graphs.  However, this code will still provide accurate results and a plot that gives the p-value based on the nuisance parameter.  I did not come up with the idea of this code, I simply translated Matlab code into R, occasionally using different methods to get the same result.  The code was translated from:

Trujillo-Ortiz, A., R. Hernandez-Walls, A. Castro-Perez, L. Rodriguez-Cardozo. Probability Test.  A MATLAB file. URL

http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=6198

My goal was to make this test accessible to everyone.  Although there are many ways to run this test through Matlab, I hadn’t seen any code to implement this test in R.  I hope it is useful for you, and if you have any questions or ways to improve this code, please contact me at calhoun.peter@gmail.com

Continue reading

Web Development with R – an HD video tutorial of Jeroen Ooms talk

Here is a HD version of a video tutorial on web development with R, a lecture that was given by Jeroen Ooms (the guy who made A web application for R’s ggplot2). This talk was given at the Bay Area UseR Group meeting on R-Powered Web Apps.

You can also view the slides for his talk and view (great) examples for: stockplotlme4, and gpplot2.

Thanks again to Jeroen for sharing his knowledge and experience!

Statistics plugins for WordPress

Today I came across a post named “24 Noble WordPress Plugins To Determine The Performance of your Blog” through Weblog Tools Collection (one of my favorite places to stay updates on wordpress). The post provided a good solid list of statistics plugins for wordpress. Some of them are too old to count (pun intended), others are much more recent and relevant.

As a statistics (and WordPress) lover myself, I was inspired to extend the list of wordpress statistics plugins for the hope of benefiting the community:
Blog Metrics
This plugin is based on ideas in an excellent post by Avinash Kaushik (Whom I consider a Web analytics guru and a brilliant blogger!).

it calculates:

  • Raw Author Contribution:
    • average number of posts per month
    • average number of words per post
  • Conversation Rate:
    • average number of comments per postwithout your own comments
    • average number of words used in comments to posts

Both for all the time you’ve been blogging, and for the last month, it then adds these values in a page on your WordPress dashboard.

Blog Metrics for a single author blogBlog metrics per author

Search Meter

This plugin is a must for any blogger. Period.

If you have a Search box on your blog, Search Meter automatically records what people are searching for — and whether they are finding what they are looking for. Search Meter’s admin interface shows you what people have been searching for in the last couple of days, and in the last week or month. It also shows you which searches have been unsuccessful. If people search your blog and get no results, they’ll probably go elsewhere. With Search Meter, you’ll be able to find out what people are searching for, and give them what they want by creating new posts on those topics.  [...]

Google analytics Dashboard

Google Analytics Dashboard gives you the ability to view your Google Analytics data in your WordPress dashboard. You can also alow other users to see the same dashboard information when they are logged in or embed parts of the data into posts or as part of your theme.

The biggest advantage of this plugin in my view is that it adds sparklines in the “posts -> edit” page in the admin area.

Analytics360
I don’t use this one much. But one feature it has that I find interesting is that is adds information of when you posted something with the trend line of the google analytics traffic data. It also mixes data from MailChimp’s, which I don’t use.

MailChimp’s Analytics360 plugin allows you to pull Google Analytics and MailChimp data directly into your dashboard, so you can access robust analytics tools without leaving WordPress.

Broken Link Checker
This plugin is also a must.

This plugin will monitor your blog looking for broken links and let you know if any are found.

  • Monitors links in your posts, pages, the blogroll, and custom fields (optional).
  • Detects links that don’t work and missing images.
  • Notifies you on the Dashboard if any are found.
  • Also detects redirected links.
  • Makes broken links display differently in posts (optional).
  • Link checking intervals can be configured.
  • New/modified posts are checked ASAP.
  • You view broken links, redirects, and a complete list of links used on your site, in the Tools -> Broken Links tab.
  • Searching and filtering links by URL, anchor text and so on is also possible.
  • Each link can be edited or unlinked directly via the plugin’s page, without manually editing each post.

Piwik + WP-Piwik

This plugin adds a Piwik stats site to your WordPress dashboard. It’s also able to add the Piwik tracking code to your blog.
Piwik is an open source (GPL licensed) web analytics software program. It provides you with detailed real time reports on your website visitors: the search engines and keywords they used, the language they speak, your popular pages and so on…

You can install Piwik more or less like you install WordPress, and then you are left to integrate it into your blog. The only real down side of it for me (compared to google analytics) is the advanced segmentation and pivoting. But in general it is a free, great (and growing!) Web analytics solution.

Woopra Analytics Plugin
I have been using Woopra since their release thanks to lorelle. I enjoy the ability to follow the live actions that are happening inside the blog. Although since woopra went from BETA to GOLD, I lost most interest because the total blogs I track have more traffic volume then woopra allow tracking in their free account. But small bloggers could find the service gratifying.

Woopra is the world’s most comprehensive, information rich, easy to use, real-time Web tracking and analysis application.

Features include:

  • Live Tracking and Web Statistics
  • A rich user interface and client monitoring application
  • Real-time Analytics
  • Manage Multiple Blogs and Websites
  • Deep analytic and search capabilities
  • Click-to-chat
  • Visitor and member tagging
  • Real-time notifications
  • Easy Installation and Update Notification

Final notes

If you are into web analytics, I also encourage you to give the following a try: Nuconomy,ClickTale, Crazy Egg. And of course, Google analytics. Each of them (and also Woopra) strips you and your visitors a bit more from their privacy. But that is the ultimate price we pay for the strong Web analytics solutions that exists out there.
If you got any more statistics plugins I missed, feel encouraged to share them with me in the comments :)

Statistics with R, and open source stuff (software, data, community)