Managing a statistical analysis project – guidelines and best practices

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

The R tag page on Stackoverflow, and
Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project. They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together. And with his permission, I am now republishing it here. I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

From here on is what “chl” wrote:

These guidelines where compiled from SO (as suggested by @Shane), Biostar (hereafter, BS), and SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.

Data management

create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni/BS)
never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
check data consistency (whuber /SE)

Coding

organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
separate source code from editing stuff, especially for large project — partly overlapping with previous item and reporting
document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file
[R] custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)

Analysis

don’t forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
for Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatramay be a good candidate, giovanni /BS)
don’t limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)

Versioning

use some kind of CVS for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) — this follows from nice questions asked by @Jeromy and @Tal
backup everything, on a regular basis (Sharpie/JD Long /SO)
keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) — partly redundant with the previous item since it is available in Git

Editing/Reporting

[R] Sweave (Matt Parker /SO)
[R] brew (Shane /SO)
[R] [R2HTML]20 or ascii

23 thoughts on “Managing a statistical analysis project – guidelines and best practices”

jarrod says:
September 30, 2010 at 1:03 pm
for a comprehensive overview of efficient workflow, i’d recommend “The Workflow of Data Analysis” by Scott Long. not sure which came first, the (much needed) interest in workflow or his book, but he makes several great suggestions.
it should be noted that his book uses examples/suggestions specific to Stata. but i’ve never used the program and still found the book extremely helpful.
Reply
1. Christian says:
  October 1, 2010 at 3:19 am
  I totally agree – Long’s book is very helpful, even if you do use another package as stata.
  Reply
  1. Tal Galili says:
    October 1, 2010 at 12:32 pm
    Good to know Christian, I’ll leave a note to myself to go over to Lon’s book. Thanks.
    Reply
2. Tal Galili says:
  October 1, 2010 at 1:38 pm
  Thanks Jarrod for the suggestion – much appreciated.
  Best,
  Tal
  Reply
Ken says:
September 30, 2010 at 1:26 pm
How about code testing? Anything better than RUnit these days?
Reply
1. Tal Galili says:
  October 1, 2010 at 1:37 pm
  Ken,
  I don’t know anything about code testing or RUnit. Any links/tutorials you can offer?
  p.s: There is a new package I recently heard about, dealing with R code logging called “log4r”, if it is relevant for you, you can read more about it here –
  http://www.johnmyleswhite.com/notebook/2010/09/25/two-new-r-packages-log4r-and-sortablehtmltables/
  Reply
2. Vinh Nguyen says:
  October 1, 2010 at 1:54 pm
  http://www.johnmyleswhite.com/notebook/2010/08/17/unit-testing-in-r-the-bare-minimum/ talks about RUnit and testthat.
  Reply
  1. Tal Galili says:
    October 1, 2010 at 2:46 pm
    Interesting tutorial, I wonder how this could be part of the process of R package creation.
    Thank you for the link Vinh .
    Reply
    1. Vinh Nguyen says:
      October 1, 2010 at 2:49 pm
      Indeed. I’m writing packages for the first time right now. I’m trying to incorporate Rcpp, Roxygen, unit testing, vignette generation, git for version control, and R-forge. I’m currently in the exploratory phase. Once I get a good workflow going, I’m definitely going to document it at my blog.
      Reply
Vinh Nguyen says:
September 30, 2010 at 1:51 pm
You read my mind! I’m always looking for guidelines and best practices, especially now that I’m looking to create R packages and version controlling my code. Thanks so much for this post.
Reply
1. Tal Galili says:
  October 1, 2010 at 12:36 pm
  My pleasure Vinh, I’m happy to know it helped you!
  Best,
  Tal
  Reply
Pingback: Managing a statistical analysis project – guidelines and best practices
Erik Iverson says:
September 30, 2010 at 5:46 pm
I would highly recommend combining *all* of these steps to produce a research compendium as described by Gentleman and Temple Lang in this paper:
http://www.bepress.com/bioconductor/paper2/
You can use many methods to accomplish this, including Sweave and org-mode for Emacs. Org-mode is especially nice as the markup is trivial, and there are multiple export targets including PDF and HTML.
Reply
1. Tal Galili says:
  October 1, 2010 at 12:33 pm
  Good link Erik, thank you.
  Reply
Pingback: Linee guida per gestire l’analisi statistica di un progetto | Rante.org
Pingback: R-ohjelmointi.org » Blog Archive » Pidä analyysiprojekti hallussa
Pingback: Best statistical practices for managing a statistical project « Statistical Ramblings of a Jobless Mind
Akhil Behl says:
March 10, 2011 at 5:42 am
Dear Tal,
Thanks for consolidating this research. Has helped me a lot. Was looking for stuff like this since sometime.
I had been thinking lately of writing a set of bash scripts that shall create a template project with template folders etc. etc. and your research really helped shape my ideas. (I know there are existing R packages which do this, but I tend to like re-inventing the wheel if it is educating.)
The problem that I just came across was of portability and reproducible research. Even though sh/bash is omnipresent in the *NIX world, it is probably going to stump most of the MS users. The question that I have is what tools exist which allow the parts of the workflow that happens outside the R system that are portable across OSes. I have seen people use scripting languages like bash, python, ruby and makefiles, all of which are not entirely portable.
Any ideas on this?
Reply
PRG says:
April 26, 2012 at 7:22 am
I’ve just stumbled across knitr, which is an improvement to Sweave. You might consider adding this as another bullet point under Editing/Reporting…
http://yihui.name/knitr/
It’s an [R] package, so I believe that a combination of this and the system() command in R would accomplish what Akhil Behl was attempting to do… all that would be required when moving from one OS to another would be to change the shell commands inside the system() command to reflect the differences.
Reply
Pingback: Organizing a Data Analysis « Reudismam
PolSci Replication says:
May 13, 2014 at 2:22 pm
This is great. I’m doing reproducibility course on coursera.com where they give video tutorials on R Markdown and knitr. I quite like it! See http://politicalsciencereplication.wordpress.com/2014/05/08/learning-about-reproducible-research-on-coursera-recap-week-1/ for my experiences with the course.
Reply
Pingback: This is how I did it…learned R. – NREL CMS
Pingback: Best practices in R version control – CougRstats

23 thoughts on “Managing a statistical analysis project – guidelines and best practices”

Leave a Reply Cancel reply