Analyzing coverage of R unit tests in packages – the {testCoverage} package

(guest post by Andy Nicholls and the team of Mango Business Solutions)

Introduction

Testing is a crucial component in ensuring that the correct analyses are deployed. However it is often considered unglamorous; a poor relation in terms of the time and resources allocated to it in the process of developing a package. But with the increasing popularity and commercial application of R it testing is a subject that is gaining significantly in importance.

At the time of writing there are 5987 packages on CRAN. Due to the nature of CRAN and the motivations of contributors the quality of packages can vary greatly. Some are very popular and well maintained, others are essentially inactive with development having all but ceased. As the number of packages on CRAN continues to grow, determining which packages are fit for purpose in a commercial environment is becomming an increasingly difficult task. There have been numerous articles and blog posts on the subject of CRAN’s growth and the quality of R packages. In particular, Francis Smart’s R-bloggers post entitled Does R have too many packages? highlights five perceived concerns with the growing number of R packages. I would like to expand on one of these themes in particular, namely the “inconsistent quality of individual packages”.

There are many ways in which a package can be assessed for quality. Popularity is clearly one: if lots of people use it then it must be quite good! But popular packages tend to also have authors that actively develop their packages and fix bugs as users identify them. Development activity is therefore another factor; the length of time that a package has existed for; the package dependency tree and the number of reverse ‘Depends’, ‘Imports’ and ‘Suggests’; the number of authors and their reputation; and finally there is testing. Francis briefly mentions testing in his post noting that “testing is still largely left up to the authors and users”. In other words there is no requirement for an author to write tests for their package and often they don’t!

Testing

It is standard practice to test commercial software at both the unit and system level. In other words tests are written for the both the individual components of the software and for the software as a whole. Through Continuous Integration (CI) any change to the source code results in a rebuild of the package and re-run of any unit tests. This is essentially what happens when a package is submitted to CRAN. However there is no requirement for an R package to contain any kind of formal test structure. Below I have written a brief script to count how often R’s 3 unit testing packages,testthat, RUnit and svUnit, are referenced by other packages via the ‘Depends’, ‘Imports’, ‘LinkingTo’ or ‘Suggests’ categories available when building an R package.

# Current packages on CRAN
download.file("https://cran.R-project.org/web/packages/packages.rds",
              "packages.rds", mode="wb")
cranPackages

These numbers equate to around 8% of R packages on CRAN containing any kind of recognised test framework. An author can also implement their own test framework which is not captured in the previous statistic. And during a Q&A session at the inaugural EARL Conference this year a fellow audience member pointed out that you could consider the examples in the help documentation to be tests since they must run successfully to pass an R CMD check. But I would argue that this only really tests that the code runs, not that it produces some expected output. Overall the level of testing of R packages is very low. Further, having a test framework does not necessarily mean that every line of source code within a package has actually been tested. This is where the testCoverage package can help.

testCoverage

The idea of test coverage, i.e. the percentage of the source code that is exercised by the test framework, is not new. It has been around pretty much since the beginning of formal software development. However until now it had not been implemented for R.

The basic idea is very simple. First consider the following simple function which forms our ‘source code’:

# sourceFile.R
absFun = 0 ){
    x
  }
}

Now let’s imagine a very simple unit test for the absFun function that will return TRUE if the test passes and print a useful message otherwise:

# testFile.R
result

Clearly this standalone unit test does not test what happens when x is zero or positive. In other words it never hits the ‘else’ section of the code. The framework does not therefore ‘cover’ 100% of the source code. Most observers would tend to conclude that it covers 50% of the source code. This coverage concept is the basis of Mango’s testCoverage package.

The testCoverage package makes use of the trace functionality in R. It does so as follows. First of all it reads the source files within a package and replaces symbols within the code with a unique identifier. This is then injected into a tracing function that will report each time the symbol is called/hit by your test framework. The first symbol at each level of the expression tree is traced, allowing the coverage of code branches to be checked.

Consider the the earlier absFun example. The symbol x appears 4 times in our source file. It is hit twice by ‘testFile.R’ and therefore the reported coverage is 50% (2/4). Within the testCoverage package the raw test coverage statistic testCoverage is presented along with an interactive HTML report in which it is clear to see exactly which lines of the source code have been hit by the test framework and which have not (see below).

testCoverage

To make best use of testCoverage it is important to understand how the coverage statistic is calculated. If an ‘else’ only had been used in this example (as opposed to an ‘if, else’) then x would appear only 3 times and would be hit twice by the test framework. The reported coverage in this case would therefore be 67% even though the source code is (arguably) functionally almost identical. What should be clear however is that if you hit all of your trace points then your package will score 100% and if you have no tests it will score 0%.

Road Map

So what is next for testCoverage? The package has been released publically and currently lives on GitHub. It is by no means complete however and there is a clear roadmap of features that Mango would like to incorporate into the package. For example testCoverage does not support S4 classes and this is currently a development priority.

It should also be noted that the current implementation masks core functionality in R and there is a start-up message which recommends restarting R after using testCoverage. We are investigating the possibility of running testCoverage to avoid this masking. If it can be avoided then this would be the point at which we would potentially look to release onto CRAN.

One of the key areas that we are looking at is integration with other services such as Continuous Integration (CI) platforms. We would also like to expand the scope beyond the R code to include C code called by R. This would involve inserting trace points into the C code and using a modified .C or .Call functions to hand back profiling data to R as traditional tools such as gcov do not integrate with the way R loads compiled C functions.

Conclusion

Testing is a vital aspect of software development which is often overlooked by R package authors. testCoverage provides a tool for assessing the level of test coverage within an R package. As the commercial adoption of R continues to increase, both the importance and the level of testing are also set to increase. Mango’s aim is that testCoverage becomes one of many tools and metrics that package developers use when building their packages and one that users look to when assessing the quality of an R package. A testCoverage league table is perhaps not too far away!

8 thoughts on “Analyzing coverage of R unit tests in packages – the {testCoverage} package”

  1. Unfortunately, your testing has failed to find the bugs in your absFun function. If you say absFun(NA), it will produce an error rather than return NA. And if its argument is an object of a class that returns FALSE for both comparisons (eg, a class that considers comparisons to NA to always be FALSE, as is the case in standard IEEE floating point), then absFun will return NULL, which doesn’t seem like it’s intended. Stylistically, it’s bad to put in what’s intended to be a redundant comparison in the else part, rather than just using the negation of the first comparison.

    1. … and what exactly does that have to do with the objective of the blog post…? Of course they could have presented a 100% error-proof implementation of abs, but this probably would have been three times as long and would not have served better for explaining how the coverage calculation works…
      Apart of that: Nice work! Just several weeks ago I was quite disappointed realizing that there’s currently no test coverage tool for R. I was really waiting for this! 🙂

      1. Well, it shows that looking at test coverage may not be all that good an indication of whether or not a function has been adequately tested…

        1. What we are trying to achieve with testCoverage (or indeed what the concept of test coverage in general tries to achieve) is a measure of code quality via a simple metric. When deciding whether to use a package this metric should be taken into consideration along with a number of other factors such as those outlined in the article’s introduction. As you suggest, a really comprehensive test suite would test the same piece of code multiple times with a variety of plausible input arguments but even then the code may still have bugs. So you 100% score on test coverage does not mean that the code works for all scenarios but it does indicate a degree of effort from the author.

  2. Sorry for the slightly off-topic comment, but there are some R packages that do not list any testing package in the DESCRIPTION file (as it has nothing to do with building the package, nor should CRAN servers run the tests when checking the package), but still having tests in “inst/tests” and also e.g. “.travis.yml” including the instructions for CI. Maybe the distribution of packages with at least one file in “inst/tests” would show a bit more positive image 🙂

    1. I think it would be an interesting exercise to try to look through inst/tests for packages on CRAN to get an idea of the level of testing on the whole. I would be curious to see how many authors write tests outside of one of the frameworks mentioned in the article.

      With regards to the DESCRIPTION file however whilst I agree that some authors may not list any testing package in the DESCRIPTION file I doubt that many get away without doing so when submitting to CRAN. Authors *should* be listing packages used for testing under the ‘Suggests’ field as this is one of the intended purposes of the field. The ‘official’ word on this in the ‘Writing R Extensions’ manual is that: “The ‘Suggests’ field uses the same syntax as ‘Depends’ and lists packages that are not necessarily needed. This includes packages used only in examples, tests or vignettes…”. Additionally, if the test suite loads the testing package via ‘require’ or ‘library’ then it will need to be included in one of ‘Depends’, ‘Imports’ or ‘Suggests’ fields in order to pass an R CMD check, which is required for CRAN submission.

      1. Although this makes sense for sure, running the tests are out of the package’s scope for some users, and `require` or `library` calls needed to run the testing packages can be found rather in the `.travis.yml` file that has nothing to do with “R CMD check” — e.g. in my packages. This is on purpose, I do not run tests on “R CMD check” to save resources on CRAN. Anyway, this is a really minor issue, and your package rocks. Can you please look at this issue I opened recently: https://github.com/MangoTheCat/testCoverage/issues/8 Thanks, Gergely

  3. This is good work. I also wrote function-level test coverage in a handful of extra lines presented in this pull request to testthat (in consideration by Hadley Wickham):
    https://github.com/jackwasey/testthat (see R/test-package.r ) I figured at least testing each function would be 90% of the battle. For all the code paths, your package is great, other than masking ‘library’, ‘require’ and ‘data’, which is pretty scary, but maybe unavoidable in your approach.

    ‘trace’ is definitely the correct approach since it captures every step, not the sampling approach from ‘profile’. Tracing every code path, however, hits performance very hard. Also, there is the danger of forcing the user into spending too much time into over-specifying tests for trivial functions: this could definitely be argued both ways, of course.

    It is also implemented as an independent function in
    https://github.com/jackwasey/jwutil

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.