Managing a statistical analysis project – guidelines and best practices

In the past two years, a growing community of R users (and statisticians in general) have been participating in two major Question-and-Answer websites:

  1. The R tag page on Stackoverflow, and
  2. Stat over flow (which will soon move to a new domain, no worries, I’ll write about it once it happens)

In that time, several long (and fascinating) discussion threads where started, reflecting on tips and best practices for managing a statistical analysis project.  They are:

On the last thread in the list, the user chl, has started with trying to compile all the tips and suggestions together.  And with his permission, I am now republishing it here.  I encourage you to contribute from your own experience (either in the comments, or by answering to any of the threads I’ve linked to)

From here on is what “chl” wrote:

These guidelines where compiled from SO (as suggested by @Shane), Biostar (hereafter, BS), and SE. I tried my best to acknowledge ownership for each item, and to select first or highly upvoted answer. I also added things of my own, and flagged items that are specific to the [R] environment.

Data management

  • create a project structure for keeping all things at the right place (data, code, figures, etc., giovanni/BS)
  • never modify raw data files (ideally, they should be read-only), copy/rename to new ones when making transformations, cleaning, etc.
  • check data consistency (whuber /SE)

Coding

  • organize source code in logical units or building blocks (Josh Reich/hadley/ars /SO; giovanni/Khader Shameer /BS)
  • separate source code from editing stuff, especially for large project — partly overlapping with previous item and reporting
  • document everything, with e.g. [R]oxygen (Shane /SO) or consistent self-annotation in the source file
  • [R] custom functions can be put in a dedicated file (that can be sourced when necessary), in a new environment (so as to avoid populating the top-level namespace, Brendan OConnor /SO), or a package (Dirk Eddelbuettel/Shane /SO)

Analysis

  • don’t forget to set/record the seed you used when calling RNG or stochastic algorithms (e.g. k-means)
  • for Monte Carlo studies, it may be interesting to store specs/parameters in a separate file (sumatramay be a good candidate, giovanni /BS)
  • don’t limit yourself to one plot per variable, use multivariate (Trellis) displays and interactive visualization tools (e.g. GGobi)

Versioning

  • use some kind of CVS for easy tracking/export, e.g. Git (Sharpie/VonC/JD Long /SO) — this follows from nice questions asked by @Jeromy and @Tal
  • backup everything, on a regular basis (Sharpie/JD Long /SO)
  • keep a log of your ideas, or rely on an issue tracker, like ditz (giovanni /BS) — partly redundant with the previous item since it is available in Git

Editing/Reporting

  • jarrod

    for a comprehensive overview of efficient workflow, i’d recommend “The Workflow of Data Analysis” by Scott Long. not sure which came first, the (much needed) interest in workflow or his book, but he makes several great suggestions.

    it should be noted that his book uses examples/suggestions specific to Stata. but i’ve never used the program and still found the book extremely helpful.

    • Christian

      I totally agree – Long’s book is very helpful, even if you do use another package as stata.

      • http://www.talgalili.com Tal Galili

        Good to know Christian, I’ll leave a note to myself to go over to Lon’s book. Thanks.

    • http://www.talgalili.com Tal Galili

      Thanks Jarrod for the suggestion – much appreciated.
      Best,
      Tal

  • http://www.linkedin.com/in/kenahoo Ken

    How about code testing? Anything better than RUnit these days?

    • http://www.talgalili.com Tal Galili

      Ken,
      I don’t know anything about code testing or RUnit. Any links/tutorials you can offer?

      p.s: There is a new package I recently heard about, dealing with R code logging called “log4r”, if it is relevant for you, you can read more about it here –
      http://www.johnmyleswhite.com/notebook/2010/09/25/two-new-r-packages-log4r-and-sortablehtmltables/

    • http://blog.nguyenvq.com/ Vinh Nguyen
      • http://www.talgalili.com Tal Galili

        Interesting tutorial, I wonder how this could be part of the process of R package creation.
        Thank you for the link Vinh .

        • http://blog.nguyenvq.com/ Vinh Nguyen

          Indeed. I’m writing packages for the first time right now. I’m trying to incorporate Rcpp, Roxygen, unit testing, vignette generation, git for version control, and R-forge. I’m currently in the exploratory phase. Once I get a good workflow going, I’m definitely going to document it at my blog.

  • http://blog.nguyenvq.com/ Vinh Nguyen

    You read my mind! I’m always looking for guidelines and best practices, especially now that I’m looking to create R packages and version controlling my code. Thanks so much for this post.

    • http://www.talgalili.com Tal Galili

      My pleasure Vinh, I’m happy to know it helped you!
      Best,
      Tal

  • Pingback: Managing a statistical analysis project – guidelines and best practices

  • Erik Iverson

    I would highly recommend combining *all* of these steps to produce a research compendium as described by Gentleman and Temple Lang in this paper:
    http://www.bepress.com/bioconductor/paper2/

    You can use many methods to accomplish this, including Sweave and org-mode for Emacs. Org-mode is especially nice as the markup is trivial, and there are multiple export targets including PDF and HTML.

    • http://www.talgalili.com Tal Galili

      Good link Erik, thank you.

  • Pingback: Linee guida per gestire l’analisi statistica di un progetto | Rante.org

  • Pingback: R-ohjelmointi.org » Blog Archive » Pidä analyysiprojekti hallussa

  • Pingback: Best statistical practices for managing a statistical project « Statistical Ramblings of a Jobless Mind

  • Akhil Behl

    Dear Tal,

    Thanks for consolidating this research. Has helped me a lot. Was looking for stuff like this since sometime.

    I had been thinking lately of writing a set of bash scripts that shall create a template project with template folders etc. etc. and your research really helped shape my ideas. (I know there are existing R packages which do this, but I tend to like re-inventing the wheel if it is educating.)

    The problem that I just came across was of portability and reproducible research. Even though sh/bash is omnipresent in the *NIX world, it is probably going to stump most of the MS users. The question that I have is what tools exist which allow the parts of the workflow that happens outside the R system that are portable across OSes. I have seen people use scripting languages like bash, python, ruby and makefiles, all of which are not entirely portable.

    Any ideas on this?

  • PRG

    I’ve just stumbled across knitr, which is an improvement to Sweave. You might consider adding this as another bullet point under Editing/Reporting…

    http://yihui.name/knitr/

    It’s an [R] package, so I believe that a combination of this and the system() command in R would accomplish what Akhil Behl  was attempting to do… all that would be required when moving from one OS to another would be to change the shell commands inside the system() command to reflect the differences. 

  • Pingback: Organizing a Data Analysis « Reudismam