What is R? It seems like a simple question, but I fear this is going to be a long post.
After all the positive comments I would like to raise some concern bout some of the not standard R packages. I experienced it twice that there was a serious error in R packages (not a bug, an error in the algorithm). The authors of the first one did not reply, the authors of the second one said they know about it but do not have the time to fix it. I wonder how many esp. of the PhD/postdoc written packages, which I am sure work for their project, are really working correctly in all situations? Not all of them work on their packages as hard and great work as e.g. D. Bates with his lme4 (GLMM) package and it he and users still discover bugs and flaws. I do not want to criticise R, I am using it and I believe that the core packages are as valid as from commercial software (or better) but as I said, I have got doubts with some hardly used ones.
It's a fair point, but needs some clarification for readers not familiar with R. The distinction is one between "official R" and user-contributed code (which is what the commenter above is discussing).
By "official R" I mean the
R project, under the control of the R Core Group. This is what you get when your download R from the
CRAN website, and what's included in
REvolution R distribution. This includes both the R interpreter (the code that implements the language at the heart of R), and the various statistical functions included in the official R distribution. These components and functions are all managed under a strict software development lifecycle, and have the highest reputation for accuracy and reliability. This is what makes R suitable for all statistical analysis applications where you need the utmost confidence in the result, such as the
analysis of clinical trial data.
This is R. Now, R isn't just a closed statistical analysis environment. It's also designed to be a platform for other individuals to create their own methods and applications. Research institutions, academics and, yes, students, use R to implement brand-new statistical methods as part of research projects (or, sometimes, just for fun). They collect these new functions into collections called "packages" and upload them to section of CRAN dedicated to user contributions. (This is distinct from the area in CRAN where the official R distribution is found.) Some of these user-contributed packages are major bodies of work in their own right, regularly maintained and tested by their respective authors. Some are student projects, long since abandoned. Just as when using a SAS macro downloaded from a website, or installing a third-party Excel add-in, you'll need to rely on the reputation of the author (or the recommendation of trusted peers) when deciding whether to use such third-party code.
If you're in the habit of downloading packages from CRAN, how do you tell if a function you're using is an official R function, and not a user-contributed one? One easy way is to use the function find, which will tell you which package the function comes from. For example, let's check the function nls (nonlinear least squares):
> find("nls")
[1] "package:stats"
This tells me that the nls function comes from the stats package. The official R distribution includes a number of standard packages. (These packages are divided into two groups -- the "base" and "recommended" packages -- but the distinction isn't important here as they all fall under the same software development lifecycle and are all part of "official R" as defined above.) If the comes from any of the following packages, it's considered official:
Official R packages (Base and Recommended)
base, boot, class, cluster, codetools, datasets, foreign, graphics, grDevices, grid, KernSmooth, lattice, MASS, methods, mgcv, nlme, nnet, rpart, spatial, splines, stats, stats4, survival, tcltk, tools, utils
This list has grown as R has matured, but the list above is applicable to R version 2.7.2 and above, and REvolution R version 1.2.3 and above.
So, to sum up: R, drawing on the expertise and control of the R Core Group, has an excellent reputation for accuracy and reliability, on par with or even exceeding that of commercial software packages like SAS or SPSS. It's suitable for any statistical analysis where you must rely on the results. All of this applies to the R distribution on CRAN, and to the REvolution R distribution, both of which comprise the official packages listed above. When it comes to user-contributed packages you download and install yourself, you're no longer using code under the control of the R Core Group, in which case -- as with all third-party code -- you must rely on the reputation of the author of that package.
That's a long answer to a seemingly simple question. But I hope it clears things up.
Hi,
If one goes to the CRAN page for contributed packages, it states that there are over 1715 packages, and that "All packages are tested regularly on machines running Debian GNU/Linux. Packages are also checked under MacOS X and Windows, but only at the day the package appears on CRAN." It makes no mention of any of the points you make, nor does it tell the user that R does not support these packages (as stated in
http://www.r-project.org/doc/R-FDA.pdf. ) Would it not be best to have this information on the same page as that from which these packages can be downloaded ?
Posted by: Martin Holt | March 26, 2009 at 12:58
Some are student projects, long since abandoned. Just as when using a SAS macro downloaded from a website, or installing a third-party Excel add-in, you'll need to rely on the reputation of the author (or the recommendation of trusted peers) when deciding whether to use such third-party code.
This is true up to a certain point, but to the difference of other statistical software, an R package needs to meet formal criteria and pass the R package checker (R CMD check) before being admitted to CRAN. This is not at all the same as getting a macro from a web page as the R package checker assures: presence of documentation, consistency of documentation and code, syntactic correctness, platform-independence etc.
This does not certify the software does what one expects it to do, but at least assures it meets minimum quality standards.
The ability of R packages to include automated software testing inside the package (and let the R package checker
run these tests) is another main quality of R packages, but
that is a bit off topic as this is merely a tool that can
be used with different levels of sophistication or not at all.
Posted by: Tobias Verbeke | March 26, 2009 at 13:38
As with many products, one has to be aware of the potential limitations and familiarize oneself with the system. From time to time I have found problems in R, but I have also found problems in other systems. Besides the classical problems in Excel, around 2002/3 I remember facing problems with SAS GLM, where factors would be significant using one version and not when using the latest version. The problem was subsequently fixed.
The core of R is becoming more polished with each release, but one always need to be aware of the distinction between core and contributed packages. Good and informative article.
Posted by: Luis | March 26, 2009 at 19:27
A reviewing system for contributed packages could alleviate the problem. This could be as simple as a commenting system coupled to CRAN, where package users can post their feedback on the packages.
This won't be a sure-fire way to guarantee package quality and correctness, but one could at least get an immediate idea about possible issues with a package.
Posted by: wwwald | March 27, 2009 at 01:34
The link to "analysis of critical trial data" is definitely worth following up, for those who work in a regulatory environment. The links must be read: promoting the enthusiasm whilst honestly presenting some of the cons. For example, FDA guidance currently emphasizes end-user validation of software, and the summary slide of the Novartis link emphasizes the considerable resources required. More background information can be found on the MedStats mailing list,
http://groups.google.com/group/MedStats
in particular the posting by Marc Schwartz on 26/03/2009 16:22 (which bears my name because I forwarded it to the list). Note that the above concern about resources applies to SAS, SPSS, etc..not just R. I guess the successful supplier will be the one which best facilitates this end-user validation.
Posted by: Martin Holt | March 29, 2009 at 08:27