R user John Christie points out a handy feature introduced in R 2.10: you can now read directly from a text file compressed using gzip or other file-compression tools. He notes:
R added transparent decompression for certain kinds of compressed files in the latest version (2.10). If you have your files compressed with bzip2, xvz, or gzip they can be read into R as if they are plain text files. You should have the proper filename extensions.
The command...
myData <- read.table('myFile.gz')
#gzip compressed files have a "gz" extension
Will work just as if 'myFile.gz' were the raw text file.
Compressing a large ASCII data file can certainly save disk space: for a file containing mostly numbers, a 50%+ reduction in file size is typical. (John's example of reducing a 100Mb file to 500Kb is surprising to me though -- perhaps it was binary data?) But does this space saving come at a cost in speed when it comes to read in the file? On the downside, the CPU does have to decompress the file before R can read it in. On the other hand, CPUs are pretty fast these days, and perhaps the time required to decompress the file is less that the additional time it would take to read the uncompressed data from disk. Let's try it out and see.
First, let's create a big object in R and write it to a file. To optimize the potential for compression, we'll use purely numeric data.
X <- matrix(rnorm(1e7), ncol=10)
write.table(X, file="bigdata.txt", sep=",", row.names=FALSE, col.names=FALSE)
Next, let's compress the file with gzip (making a copy first, so we can retain the uncompressed version):
system("cp bigdata.txt bigdata-compressed.txt")
system("rm bigdata-compressed.txt.gz")
system("gzip bigdata-compressed.txt")
Compressing the file reduces its uncompressed 182Mb size on disk by about 55%:
> compr <- file.info("bigdata-compressed.txt.gz")$size
> big <- file.info("bigdata.txt")$size
> print(c(big, compr))
[1] 181819246 84528258
> print(1-compr/big)
[1] 0.5350973
So now for the acid test: is it quicker to read the compressed or uncompressed file?
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
170.901 1.996 192.137
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
65.511 0.937 66.198
There you go: reading the compressed file is nearly three times faster, despite the time required by the CPU to decompress it first. Impressive! By the way, you get similar (if less striking) results using the low-level (and faster) function scan instead of read.table:
> system.time(scan("bigdata-compressed.txt.gz", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
19.582 0.310 20.071
> system.time(scan("bigdata.txt", sep=",", what=rep(0,10)))
Read 10000000 items
user system elapsed
30.781 0.270 31.369
For comparison, writing the uncompressed file in the first place took about 30 seconds. All of these timings were done on a fairly powerful dual-core MacBook Pro, so as always your mileage may vary. But it does seem that on modern hardware where the CPU performance exceeds the disk performance, compressing files is the way to go.
r-help mailing list: Dec 1 2009 Tip of the Day (John Christie)
Recent Comments