Writing a MS-Word document using R (with as little overhead as possible)

The problem: producing a Word (.docx) file of a statistical report created in R, with as little overhead as possible.
The solution: combining R+knitr+rmarkdown+pander+pandoc (it is easier than it is spelled).

If you get what this post is about, just jump to the “Solution: the workflow” section.

rmd_to_docx

Preface: why is this a problem (/still)

Before turning to the solution, let’s address two preliminary questions:

Q: Why is it important to be able to create report in Word from R?

A: Because many researchers we may work with are used to working with Word for editing their text, tracking changes and merging edits between different authors, and copy-pasting text/tables/images from various sources.
This means that a report produced as a PDF file is less useful for collaborating with less-tech-savvy researchers (copying text or tables from PDF is not fun). Even exchanging HTML files may appear somewhat awkward to fellow researchers.

Q: But wasn’t this problem solved already?

A: Yes and no. There have been many attempts at solving the problem in the past several years, but many of them came with an overhead which made the solutions un-friendly (the developers and heavy users of these technologies are asked to not be offended – this is only my opinion, and you’re welcome to respond and expand my point of view).
Previous solutions include SWord and R2wd, both rely on the rcom package (and the statconnDCOM or RDCOMClient servers). Or using online converters to turn PDF files into Word files.

Q: Any more issues?
A: Yes. Another big issue is formatting the output. If I would like my tables to look nice in the output file, I would often need to start wrapping ALL of my output functions with the some function (taken from packages such as xtable, rms, quantreg, stargazer, pander, and more.

Sources/links

The solution I propose here is a combination of using the following R packages: knitr, rmarkdown, pander. Combined with the external tool pandoc (easily installed using the installr package).

Combining these ideas together has been discussed before in various places in the past half year or so, here are just a few:

Solution: the workflow

An overview of the steps:

  1. Write text with R code chunks weaved-together (I do it using RStudio, markdown, knitr – in an .rmd file)
  2. At the beginning of the file – make sure to replace the “print” method with that of the markdown wrapping package (see example bellow)
  3. Compile the doc into .md using knitr
  4. Turn the .md into .docx using pandoc

Here is an example rmarkdown code for steps 1 and 2:

 
Doc header 1
============
```{r set_knitr_chunk_options}
opts_chunk$set(echo=FALSE,message=FALSE,results = "asis") # important for making sure the output will be well formatted.
```
 
```{r load_pander_methods}
require(pander)
replace.print.methods <- function(PKG_name = "pander") {
   PKG_methods <- as.character(methods(PKG_name))
   print_methods <- gsub(PKG_name, "print", PKG_methods)
   for(i in seq_along(PKG_methods)) {
      f <- eval(parse(text=paste(PKG_name,":::", PKG_methods[i], sep = ""))) # the new function to use for print
      assign(print_methods[i], f, ".GlobalEnv")
   }   
}
replace.print.methods()
## The following might work with some tweaks:
## print <- function (x, ...) UseMethod("pander")
```
Some text explaining the analysis we are doing
```{r}
summary(cars)# a summary table
fit <- lm(dist~speed, data = cars)
fit
plot(cars) # a plot
```

The above code can be saved into an .rmd file, for example: example.rmd
This file can now be compiled using knitr:

library(knitr)
knit2html("example.rmd")

This will produce an example.md file, which can be compiled into a Word file using pandoc.
If you don’t yet have pandoc, and are running a Windows OS, you can quickly install pandoc by running the following code in R:

# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)} #load / install+load installr 
# Installing pandoc
install.pandoc()

Once pandoc is installed, simply run:

FILE <- "example"
system(paste0("pandoc -o ", FILE, ".docx ", FILE, ".md"))

And your .docx file is ready!

Possible expansions and caveats

The first caveat of this method is that it relies on markdown and pander, which is (by definition) more limited than using something like LaTeX. For that purpose, one can decide to work with LaTeX based solutions. Here is an example of how to do it with several existing packages (this code bellow is not very debugged – so more careful attention should be given to using it – I welcome comments and suggestions):

 
```{r load_pander_methods}
replace.print.methods <- function(PKG_name = "pander") {
   PKG_methods <- as.character(methods(PKG_name))
   print_methods <- gsub(PKG_name, "print", PKG_methods)
   for(i in seq_along(PKG_methods)) {
      f <- eval(parse(text=paste(PKG_name,":::", PKG_methods[i], sep = ""))) # the new function to use for print
      assign(print_methods[i], f, ".GlobalEnv")
   }   
}
require(xtable)
replace.print.methods("xtable")
```

Similar solutions can probably be found for HTML documents also. (credit: The above code is based on the help of Ramnath to my question on SO)

The second caveat is that the above solution (at least the part that makes sure we can use the R code as is, without wrapping it with things like “pander(summary(cars))”), is basically a dirty hack. It is a hack in the sense that it overrides basic R commands (which is quite ugly really). This issue is being thought about and discussed for over a month now in the knitr github page, I hope a better solution will come out of it.

The third issue is that if you use a function for which there is an issue with the method, it might cause problems in compiling the code (for example, pander still needs a pander.summary.lm method…).

To conclude: Thanks to the amazing work by Yihui on knitr, by the people at RStudio, by Jeffrey Horner on markdown, Gergely Daróczi for pander, and many others – it is now easier than ever to quickly create a docx report based on analysis performed using R. It seems that 2012 was a great year for reproducible research, I’m looking forward to 2013…

  • Tyler Rinker

    In the first release of the reports package I included a function `tex2docx` for easy conversion when conferring with colleagues. Based on your blog I see it was remiss not to include an `md2docx` in the package. In the devel version of reports I have added this. Thanks for the markdown to docx perspective. It has been a great year for reproducible research. In the purely` sweave` days I knew how to, but never made reproducible (integrated) reports. I do so no freely with so many great advances by the people/packages you’ve mentioned that make the doing so easy.

    • http://www.r-statistics.com/ Tal Galili

      Hi Tyler, that sounds great.

      Feel welcome to leave a comment once you get around to updating the function – I’d be happy to update the post with it.

      Cheers,
      Tal

    • Maxim Kovalenko

      I suppose a conversion of raw LaTeX code (even if only for tables) to docx is rather a wishful thinking?

      • http://www.r-statistics.com/ Tal Galili

        Hi Maxim,
        Look at some of the other comments – it is possible (both with pandoc, and with latex2rtf).
        However, I suspect no conversion will be possible for many of the options that LaTeX offers (which is part of the reason LaTeX exists :)).

        Cheers,
        Tal

  • http://twitter.com/_inundata Karthik Ram

    Doing all of this in R seems a little clunky. Why not just create a nice make file in your folder and run that?
    You can knit everything using:
    `Rscript -e “library(knitr); knit(‘file.Rmd’)”;`

    Then you can run the pandoc call with all the bells and whistles. I do things like make a pdf (for myself), make a Word and move that to a shared Dropbox folder (for some colleagues), and clean up stuff. That way I don’t clutter my R script with system calls.

    • http://www.r-statistics.com/ Tal Galili

      Hi Karthik,

      On Windows 7 (which is what I currently use), using makefile will probably just add another layer of complexity, see:
      http://stackoverflow.com/questions/2532234/how-to-run-a-makefile-in-windows

      Thanks for sharing your workflow, it sounds great :)

      I often keep my entire project on dropbox, so no moving around is needed.

      Cheers,
      Tal

  • dugite

    Well, Word opens reasonably well-formed HTML, and so I’ve been generating simple ‘Word’ reports for quite some time using just R2HTML to spit out (X)HTML. That said, the R+knitr+rmarkdown+pander+pandoc looks much more flexible.

    • http://www.r-statistics.com/ Tal Galili

      Hi Dugite,
      Thanks for your input.

      Notice that the function in the post can also be used for R2HTML (though it would need some playing with).

      Cheers,
      Tal

    • http://www.facebook.com/profile.php?id=596764567 David Scott

      I use a similar approach, but have extended hwriter in my package hwriterPlus (on CRAN). This produces HTML openable in Word. It is a very low technology solution with hardly anything of a learning curve. hwriter is very similar to R2HTML in conception but the implementation is in my view much cleaner. hwriterPlus uses MathJax rather than ASCIIMathML for rendering LaTeX expressions in a browser. MathJax is becoming the standard for this, and in particular works in all modern browsers.

      • http://www.r-statistics.com/ Tal Galili

        Hi David,

        I’ve used hwriterPlus a few months ago in creating this website:
        http://statil.org/

        (Another project I should write a post about one day…)

        So THANK YOU very much for writing the {hwriterPlus} package :)

        Also, thank you for the MathJax recommendation, it is good to know.

        Only to re-post one of my other comments:

        My rule of thumb is that if most of my R code is analysis, and only a few lines are of output, than packages like {hwriterPlus} are good to use (which is what I’ve done in the above project). However, if I have a good deal of text with the R code, then weaving/knitting is (in my view) a better solution.

        I strongly believe both solutions have a place in the R community.

        Yours,
        Tal

  • Maxim Kovalenko

    It would be nice to have some more details explained in regard to the step 2 (the print method substitution) for somewhat less experienced users (such as myself). At least some references that would allow to understand this better, so that it is possible to amend the code when necessary.

    Also, for word output of tables I have found that latex2rtf does a decent job of converting the tables, better than pandoc actually.

    • http://www.r-statistics.com/ Tal Galili

      Hi Maxim,

      Explaining step 2 requires you to know what S3 methods are (which is a bit beyond the scope of this post). The best tutorial I came across thus far is that of Friedrich Leisch (see page 4): http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

      Once you understand what are S3 classes and methods, you would understand that what I’ve done is simply to substitute the “print” methods (for various classes), from that of base R, with those given in the {pander} package.

      Regarding latex2rtf, I came across it before (even installed it) but didn’t delve much into it. Maybe I should add calls to it from the installr package. Thanks.

      • Maxim Kovalenko

        >Once you understand what are S3 classes and methods, you would >understand that what I’ve done is simply to substitute the “print” methods (for >various classes), from that of base R, with those given in the {pander} >package.

        Ah, this is quite sufficient information, thank you.

        • http://www.r-statistics.com/ Tal Galili

          Cool :)

    • http://www.r-statistics.com/ Tal Galili

      Hi Maxim,

      I wanted to let you know that, based on your comment, I’ve added the function “install.LaTeX2RTF()” to the {installr} package (see https://github.com/talgalili/installr), to allow people to more easily install latex2rtf on Windows.

      Cheers,
      Tal

  • Pingback: R e Word (Momento R do Dia) | De Gustibus Non Est Disputandum

  • http://twitter.com/raffdoc Rafik Margaryan

    Hi Tal
    I will agree with @twitter-267256091:disqus it is little bit to much for R to manage everything. I have tried your workflow on mac and it works fine. My idea was to use google docs platform for collaboration ( majority of my colleagues use it for collaboration). I found RGoogleDocs which can download and upload csv and doc files, which is fine. But it is not yet able to upload into shared folders in my Gdrive. Do you have some tips for this kind of workflow?
    Great job Tal

    • http://www.r-statistics.com/ Tal Galili

      Hi Rafik,

      Thanks for the input regarding mac :)

      As to Gdrive, since it is now working like dropbox, then you might as well focus on how to move your files into gdrive on your computer (rather than finding how to upload the files from R on-to google).

      I suspect the function “file.copy” might solve most of the issue you raise.

      p.s: personally, I just have my entire project folder within dropbox.

      Cheers,
      Tal

  • Jean Adams

    Another solution that can be used to produce a somewhat less-refined report in rich text format is the rtf package in R. I have used this package to enable my clients to automatically generate reports with headers, paragraphs, figures with captions and tables with captions. The reports can be viewed by and edited in Word and saved as a Word document.

    • http://www.r-statistics.com/ Tal Galili

      Thanks for the pointer Jean.
      I admit this is the first time I’ve heard of the rtf package, so thanks for sharing!

      I see that {rtf} uses a similar strategy as R2HTML or hwriter, which is a legitimate solution.
      My rule of thumb is that if most of my R code is analysis, and only a few lines are of output, than packages like {rtf} are fine. However, if I have a good deal of text with the R code, then weaving/knitting is (in my view) a better solution.

      With regards,
      Tal

  • Kevin

    I know that it is clunky and not very automated, but my current workflow is to write the code within code block in an org-mode file in emacs. I can then keep the code and output directly in one file that can easily be turned in to a pdf, word document or html. Not the perfect solution, but one that works for me.

  • http://rapporter.net/ rapporter.net

    Awesome post, thank you, Tal!

    Just to propagate “pander” even a bit more, I have created a “pander”-only, rather short version of the above solution, so creating a MS Word document from R with the help of Pandoc: https://gist.github.com/daroczig/5238455

    • http://www.r-statistics.com/ Tal Galili

      Nice!
      Thanks Daroczig :)

    • http://twitter.com/stevepowell99 Steve Powell

      Yes, I use pander myself too sometimes, and I was wondering why, Tal, you don’t take the direct route using pander all the way through rather than knitr with pander functions? What do you see as the disadvantages? I find pander’s caching as good as knitr’s in most cases. I guess pander doesn’t work quite as sweet in Rstudio.

  • Jeff Laake

    I use Sweave or knitr with LaTeX but I’m intrigued by your post because most of my colleagues do work with Word. My only question is how you handle their edits. Aren’t you stuck re-doing their edits from Word in your original document if you want to maintain the reproducibility? This is why I like LyX because it allows my collaborators to make changes to the text with Track Changes and it isn’t difficult to use (although some are even reluctant to use LyX). Then I can accept or reject their changes in the original document and then recreate the document as needed without re-doing their changes. I’ve not used Sword but something or R2wd but something along that line would be preferable to avoid duplicative work.

    • http://rapporter.net/ rapporter.net

      For cooperation I would also share the markdown file along with the MS Word document, so that the changes may happen in rather that. And there is no need to “track changes” in any text editor, you may fire a `diff` at any time of course.
      So I really think that converting the markdown to a Word file is just rather a final step, or an aid to let you know how your document would look like in the long run. And of course sharing the images with co-workers in a markdown/tex file is just not possible, where a rendered pdf or Word document might be handy.

      • Jeff Laake

        So they would be making changes with a text editor on the markdown file. I guess that would work but really isn’t much different than them using LyX which is really a text editor with add-ons for LaTeX. The disadvantage of using markdown vs LaTeX is that the latter has far greater capabilities. The one strong advantage of what you are proposing is that the many journals will only accept Word. Thanks again for the post. –jeff

        • http://rapporter.net/ rapporter.net

          Yeah, it is definitely a trade-off. It is just rather my personal opinion that I prefer markdown (after a few years of LaTeX usage) because:
          * it can be read a lot easier without any special text editor,
          * non-technical users can also easily tweak the text,
          * it is becoming a standard nowadays
          * and most importantly: you can transform to pdf/docx/odt/HTML with a simple command, which is not a real option with LaTeX.
          – daroczig

          • http://www.r-statistics.com/ Tal Galili

            Daroczip – I agree (and I’m also after well over a year of heavily using LaTeX).

    • http://www.r-statistics.com/ Tal Galili

      Hello Jeff,

      I admit that my current solution would have been to manually include the edits.

      Personally, I look at the work process as one that creates two documents:

      1) The analysis report

      2) The paper

      Where the paper is mostly a copy-paste of statistics from the analysis report, and most of the textual editing is done in “the paper” document.

      With the workflow proposed in this post, it is just easier to make the copy-pasting.

      In an ideal world, I would have loved to have something that takes a Word-file (let’s say that the document just went through revisions and I’ve approved them all), and would traceback the elements in the word file which are outside the code chinks – and will re-introduce them to the original rmarkdown (or latex, or whatever).

      I believe this is possible, but both of us would probably guess, it is not a likely solution to be developed.

      I admit that with the people I work with, the option of using LyX will be completely out of the question. The same is true for giving them the raw rmarkdown file.

      However, since we are on the subject, another idea might be to upload the markdown file to google-docs, there to edit it with co-workers, and then to re-use the updated version in R.
      I don’t think I will need this solution in the near future, but thanks for the brainstorming any ways :)

      With regards,
      Tal

      • Jeff Laake

        Your view will work for some situations but I really like to avoid cut/paste. The reason I like the Sweave/knitr paradigm is the ability to completely recreate the paper after changing the data or some nuance without all the cut/paste to create the paper. I guess I was never very good at kindergarten because I usually screw up somewhere with cut/paste.

        I work in ecology with monitoring situations. When data are added each year with just a little work, I can create the report for the new year with confidence that I didn’t screw something up. Also, I work with other scientists that provide me the data and there is usually some screw up there that is caught in the late stages of the paper and with the Sweave/LaTeX approach this is not a problem. If I had to cut and paste the document again I would not be pleased. So if anything I would work with the markdown file and copy their edits from Word into it to maintain the integrity of the document. But from my first look at markdown doesn’t yet seem to be anywhere near as capable as LaTeX for formatting/publishing documents. Now if I had to learn all the LaTeX commands I probably wouldn’t go that route either, but LyX removes that issue for me. I still have a LaTeX book but only have to refer to it infrequently.

        The nice thing is we have a multitude of tools to suit everyone’s needs. –jeff

        • http://www.r-statistics.com/ Tal Galili

          I understand and agree with you Jeff.

          I can easily imagine myself working on my own projects (with fewer collaborators), where I would take the same strategy you described.

          With regards,
          Tal

  • Jaroslaw Piskorski

    I am surprised nobody is mentioning Odfweave. It’s syntax is very similar to Sweave and it is a very fast and convenient way to produce a doc document with Libre/Open Office.

    • http://www.r-statistics.com/ Tal Galili

      Hi Jaroslaw,
      You make a valid point, my apologies for not mentioning Odfweave.
      However, its issue is the same as SWord.
      The reason I like the idea of working with knitr, is because that is the system which is going through the most heavy development/debugging. So it is most likely to give access to the most modern possible solutions (for example, caching, just to name one).

      • Jaroslaw Piskorski

        Hello Tal,

        I was not criticising you – I am sure you are right, but odfWeave is what I use and I thought it was worth mentioning.

        I would like to take this opportunity to thank you for R-bloggers. This is a terrific idea! I always start my day by reading it over a cup of coffee. You are changing the world!

        regards
        Jarek

        • http://www.r-statistics.com/ Tal Galili

          Thank you very much Jarek :)

          With regards,
          Tal

    • Jeff Laake

      I’ve not been very successful in getting the created document into Word from Open Office. Still usually some fiddling and the only reason I’d move to word is to get it in .docx format for journals.

      • Jaroslaw Piskorski

        That’s fair enough. I never have any problems with odt->doc, so I stick with what I know. But I will definitely give Tal’s solution a try.

        Jarek

    • http://twitter.com/webbedfeet Abhijit Dasgupta

      I’ve found odfWeave to be a bit quirky and not the easiest to set up and run. The Rmarkdown -> knitr path is more attractive to me since (1) I can easily incorporate math if I need to (yes you can do that in odt), and (2) I can use the same base file to create docx, pdf, LaTeX, HTML, HTML5 using pandoc, incorporating bibliographies and formatting using either CSS or templates. Seems to make my life easier, once its set up.

  • jjap

    Thanks for the useful post Tal. Any idea how you can pipe the system call to pandoc through iconv ? From the regular command line i would: iconv -t utf-8 example.md | pandoc -o example.pdf

    • jjap

      Sorry, more specifically: system(paste0(“iconv -t utf-8 “, FILE, “.md | pandoc -o “, FILE, “.pdf”)) does not work. The pasting the output from the R console in the Win console does work however. It appears to be something with the path of iconv when called from R.

      • http://www.r-statistics.com/ Tal Galili

        Hi jjap,

        Interesting.
        I would be surprised if R can not run from using “shell” (or “system”), something that does run properly from Windows’s cmd.
        Can this problem be easily reproduced?

        p.s: In one of my recent commits to the {installr} package, I’ve added the “system.PATH” to easily check the paths on your Windows machine from within R. https://github.com/talgalili/installr

        • jjap

          I probably erred in mentioning a problem with the path , in fact the error messages I get when running my code above are:

          C:STRAWB~1cbiniconv.exe: |: Invalid argument
          C:STRAWB~1cbiniconv.exe: pandoc: No such file or directory
          C:STRAWB~1cbiniconv.exe: -o: No such file or directory

          Probably something trivial, but I can’t seem to nail it down…

  • http://twitter.com/webbedfeet Abhijit Dasgupta

    Hi Tal, nice post.

    I’m going to point out a couple of things. First, an alternative path (not literate programming, mind you) to getting R results into Word on Windows is using R2wd. This solution does very nice formatting of tables, and includes captions and the like. It is based on the COM interface, so it is of course limited to Windows.

    This segues into my second point. I have used this workflow in my work (not using “pander” per se, but essentially Rmarkdown + ascii -> markdown -> pandoc -> docx). One great advantage of this is that it is cross-platform, and doesn’t require you to run Windows or even own Microsoft Office. The pandoc -> docx conversion is actually quite good.

    Why still Word? Well, my collaborators, who are doctors and business people, use Word almost exclusively for their report writing. Heck, even my senior mathematical colleague uses Word instead of LaTeX and gets tons of grief from some of us. So this works very well in sending them tables and summary reports of analyses which they can just cut and paste into their documents. If edits are required or analyses re-run, I can just change the appropriate code and re-run, getting nice formatted tables again without any effort.

    • http://www.r-statistics.com/ Tal Galili

      Hello Abhijit,
      It’s great to see you commenting, and thanks for the compliment :)

      As to your remarks:
      1) Regarding R2wd, I agree (it’s also mentioned in the post)
      2) Regarding your workflow – I’d tell you that the first time I’ve heard about markdown was from something you said in a useR we both attended some years ago. And while I didn’t understand what you were talking about back than – later on it stuck in my mind and led me to delve deeper into what is presented in this post.
      3) I totally agree with you regarding the need to have a Word file (as I’ve mentioned at length in the post itself).

      Yours,
      Tal

      • http://twitter.com/webbedfeet Abhijit Dasgupta

        Hi Tal,

        Yes, that useR in Gaithersburg was fun!! I had just started looking at markdown then, but I had discovered reStructured Text, which was the Pythonic way to go. Markdown, and specially pandoc, is now my basic tool, and the RStudio folks have made it VERY EASY to use.

        • http://www.r-statistics.com/ Tal Galili

          :)

  • http://twitter.com/richierocks Richie Cotton

    This is very useful! In the absence of easy tools, I’ve been using knitr to create HTML pages and manually resaving to docx if necessary. Word is mediocre at this task (it mangles a lot of the formatting), so it requires effort and isn’t scaleable. I can see me making good use of this.

    Since there are a lot of different ways of solving this problem (your toolset/R2wd/rapport/reports, etc.) it would be useful to have a state-of-the-union paper on document creation. The Journal of Stats Software would love that. We should write one! One day, when less busy. And after Yihui’s knitr book has come out.

    • http://www.r-statistics.com/ Tal Galili

      Hi Richie, I’m glad you found my post useful.

      Regarding a joint paper, let’s talk about it when it’s relevant.

      Great to have you visit the blog :)

      Cheers,
      Tal

  • arne

    Tal, thank you for bringing it to my attention.

    Your post was the reason I converted to R-Studio today. I have to use word for the sake of my co-authors, and I always was annoyed when I had to re-format tables in word because of changes in the results after new data became available.

    I am trying your suggested workflow as of tomorrow.
    This could be the game changer I have been waiting for.

    • http://www.r-statistics.com/ Tal Galili

      Hi Arne,

      Many thanks for this comment, you brought a smile to my face.

      BTW, I came across this post today:
      http://timelyportfolio.blogspot.com/2013/04/tables-are-like-cockroaches.html

      Which shows how to construct complex HTML tables using R.
      This means that if you’d choose to do your writing using knitr+HTML (instead of knitr+rmarkdown), you can also produce more complexly structured HTML tables (which could then be changed into .docx using pandoc).
      I’m not sure how well it will look, but my guess is that it would work well…

      With regards,
      Tal

      • arne

        Tal, thanks a lot for the link. Just two weeks ago, I started using knitr. Earlier today, a peer and I presented knitr at a lab meeting, and again met the question “but what about Word documents, we need to submit Word documents to journals. Thanks to you, we could elaborate a little.

        However, I’m still struggling with the construction of more complicated stuff. For example, quite some gO.ogling did not bring up any hints on how to change page orientation (for larger tables).

        Also, I could not find any possibility to center column content on the decimal point. It even seems that this is quite a challenge in HTML as well.

        I would like to borrow your code to reblog a multi-part, step-by-step guide for our internal lab blog, and I hope a lot of other folks also reblog and elaborate on it, as happened below in some reactions.