Reproducible research with R, Knitr, Pandoc and Word

Add references and a style sheet

Below I briefly outline why Pandoc is an essential part of my research workflow, and demonstrate how to seamlessly integrate it with a bibliographic system and code written in R to produce high quality word or pdf documents. I also include all the functions needed to get this working fast.

Knitr is great. I'm writing this in it right now. It 'knits' markdown together with R code and outputs some pretty excellent html pages. The difficulty is getting these into Word for final editing, emailing to colleagues, or similar. Try copy and paste, for instance, and you'll get the text and the formatting no problem, but any plots or tables will likely be replaced by Word's 'missing image' graphic. The solution: Pandoc.

There is an R library, Pander, which works well. But for full functionality you're best off downloading the Pandoc application from here.

Write a markup document in RStudio, set your working directory to the location of your file, then compile it as follows:

name = "demo"
library(knitr)
knit(paste0(name, ".Rmd"), encoding = "utf-8")
system(paste0("pandoc -o ", name, ".docx ", name, ".md"))

The code above works by running the command line from within R

And like magic the document is created.

Add references and a style sheet

Now let's make things a bit more interesting. What about adding references? Go to your reference manager of choice, export a BibTeX file with your library, save it in the same directory as above. For this step I have so far found Mendeley the best, because it will automatically synchronise with the BibTeX file - so there's no need to re-export the library every singly time you add a reference.

Now add references as follows:

O'Hara et. al. ran numerous tests to illustrate how a log transforming data consistently gave suboptimal results [@OHara2010].

Which results in:

O'Hara et. al. ran numerous tests to illustrate how a log transforming data consistently gave suboptimal results (O’Hara and Kotze 2010).

You can also add footnotes, which Word will read in the correct order. Just give them all a unique label, indicating where the note should go, and then subsequently specify the footnote text:

A linear model is inappropriate for count data, as it will predict values below 0 [^mynote1].

[^mynote1]: O'Hara et. al. ran numerous tests to illustrate how a log transforming data consistently gave suboptimal results [@OHara2010].

Output:

A linear model is inappropriate for count data, as it will predict values below 0.1

1 O'Hara et. al. ran numerous tests to illustrate how a log transforming data consistently gave suboptimal results (O’Hara and Kotze 2010).

And that's just about the basics covered. Compile this document with the function below

knitsDoc <- function(name) {
    library(knitr)
    knit(paste0(name, ".Rmd"), encoding = "utf-8")
    system(paste0("pandoc -o ", name, ".docx ", name, ".md --bibliography library.bib --csl taylor-and-francis-harvard-x.csl"))
}

As before the function will use the command line to execute, only now we've added in a few extra options: we' ve specified where our bibliography lies (remember, we we saves this as library.bib), and we've also specified the style format using the option 'csl'.

Any number of style formats can be downloaded from here, to match whatever journal or style you need to use. Or write your own. Just save it in your working directory, and call it by name, as above.

The end product:

The simplest way to really see how great Pandoc is is to try some of this code. Or compare my markup document with its output.

Other approaches:

A number of other good options exist. You could for instance use the package 'pander' together with a very clever set of utilities knitcitation to keep the operation within R. It's a bit more fiddly, and in my experience sllightly more buggy (Importing into R adds an extra step in which references can get corrupted), but it works remarkably well. For this, see the code below:

# For exporting to word from within R
library(pander)
name = "demo"
knit(paste0(name, ".Rmd"), encoding = "utf-8")
Pandoc.brew(file = paste0(name, ".md"), output = paste0(-name, "docx"), convert = "docx")

# Importing references to within R.
library(devtools)
install_github("knitcitations", "cboettig")
library(knitcitations)
bib <- read.bibtex("library.bib.part")

If you do go with the latter option, you might find this function useful - it will allow you to search within your reference library, and return the citation key for any matches. Very handy for that reference where you can't remember year of publication:

ref <- function(x) {
    bib[grep(x, bib, ignore.case = T)]$key
}

13 comments:

  1. Although the "pander" package can be useful for converting markdown text to other formats building on Pandoc, it's worth mentioning that the package was built to provide similar features like "knitr". beside "converting" almost any R object to markdown.
    Please see http://rapporter.github.com/pander/#brew-to-pandoc for details.

    ReplyDelete
    Replies
    1. Awesome, thanks. It looks a really handy package

      Delete
  2. I recently wrote a blog post about this where I wrap this up into a package called `reports`: http://trinkerrstuff.wordpress.com/2013/02/24/workflow-w-reports-package/ I believe we're working on very similar ideas at the same time. In fact I now look and see our blog posts went to Rbloggers the same day.

    ReplyDelete
    Replies
    1. Hi Tyler, I haven't had a play with 'reports' just yet, but I have found your qdap package extremely useful. Do you have any ambitions of scaling it up to be more friendly for languages other than English? I keep finding in my work that half the tests I might want to do are dependent on English language dictionaries/lists/syntax. Integration with a translation api would go a long way to solve this! Cheers, R

      Delete
    2. Hey Rolf I just saw you commented back (2 months later) sorry about that. Rolf right now qdap is definitely geared toward English. I'd love to see integration but I don't think that's possible. Right now I don't think it is because every language functions so differently from another. Many of the algorithms use an English specific algorithm to compute descriptives. Integration would require a team of people with heavy duty skills, language abilities, and logic beyond my own. This would essentially require a complete rewrite of 1/3 of qdap's functions. In my own work I have encountered a need to extend qdap to Korean but lack the knowledge of the language to even understand if my coding is correct. I am very open to a team of people making this a reality but do not think this is likely.

      Delete
  3. Im having some issues with getting images into the pdf, latex, docx from pandoc. I can knit up my rmd files nicely and they put the images into the html as raw data. I can even not do that and verify that the *.md file links to the cunk-*-.png files as necessary but none of the pandoc output have images embedded in them. Any suggestions?

    R

    ReplyDelete
    Replies
    1. Sorry to have ignored your comment for so long. I have found pandoc to do well with R generated graphics, but that for src links they have to be stored in the root folder and linked appropriately. This is what I did the other day when I encountered this problem: copied my images into the figure folder generated by R in the root, and used this notation:

      ![](figure/memoryEvent.png)

      Hope that works for you, best, R

      Delete
  4. It is easy to get started with Word, but it will be hard (or simply hell) to maintain, unless your colleagues promise not to touch the Word document when they read it. I had the pandoc integration in mind since the very beginning of knitr: https://github.com/yihui/knitr/issues/206 and I'm still thinking what would be the ideal approach. Hopefully there will be something interesting coming out in the near future. We are working on it.

    BTW, if the results are read only, I do not quite understand how Word could be an advantage for reading purposes. Isn't PDF way better? :)

    ReplyDelete
    Replies
    1. Thanks for the comment, Yihui! I must admit my need for Word integration is not really compatible with proper reproducible research - more that just so many people express a strong preference for seeing word files, all from article submissions, to colleagues making 'track changes' style alterations. I also keep finding that I need multi language spell checking options, but that's just my research. If there was any way of getting these changes back into the Rmd file that would be great, but at the moment it's manual labour for me on that count!

      Delete
  5. Thanks for this article! I invested a lot of time into learning LaTex as part of literate programming principles (in Emacs first, to boot). I have--reluctantly, at first--switched to RStudio and markdown (via knitr--Yay! knitr). I really appreciate the advice on how to integrate pandoc. I need to produce long reports (with internal links) with the option to publish as PDF or html AND Word. If I want my supervisor's blessing to invest time into the literate programming approach, it needs to be really smooth. In my field, clinical epidemiology, manuscripts are submitted to journals as Word files. Also, my colleagues do not work with Linux and are not comfortable with a wide variety of scripting languages. Installing new software on office PCs means going through IT... I want to create a folder that I can archive for others that contains finished documents in whatever format they want and that allows them to recreate documents from the source files as simply as possible. It's pretty important that they have to install as little new software as possible, e.g. RStudio plus the required packages, and do everything from within RStudio. (It's nice for me, too, because I'm not much more advanced in computing.) I feel a little uneasy that my supervisor can change the Word report and it won't be reflected in the source files, but that's how she works and I'm there to help. I think I will sell literate programming better by being accommodating than being purist.

    In any case, I am grateful to you R gurus!

    Tanya

    ReplyDelete
    Replies
    1. Hi Tanya, I have the same situation with my supervisor, which is why I looked into Pandoc in the first place. I just cant think of a good way to get changes back into the Rmd file. Possibly by saving as a txt file and finding some clever way to substitute edited text chunks. It's not high on my todo list right now, though! And Yihui is sure to come up with something much cleverer soon, anyway =) Best, R

      Delete
  6. Thanks for the nice post! I have been using knitr for my data analyses, but was hesitant for using it for writing anything more substantial because I did not know how to incorporate references before I found this.

    One piece of advice. The `knit` function has the optional argument `envir`. By default the RMarkdown file is knit in the parent environment. Thus variables defined in the RMarkdown file can clash with variables in the parent environment. For example, if `demo.Rmd` assigns a value to the variable `name`, your next line using pandoc would not work. The solution is to set `envir = new.env()` when calling `knit`. This way no matter what variables your blog readers use in their RMarkdown files, your code would still work.

    ReplyDelete
  7. Thanks for the great post, and an excellent blog in general.

    -Matthew A. Simonson PhD

    ReplyDelete