By David Smith
If you want to get data out of R and into another application or system, simply copying the data as it resides in memory generally isn’t an option. Instead you have to serialize the data (into a file, usually), which the other application can then deserialize to recreate the original data. R has several options to serialize data frames:
- You can serialize (export) data to comma-separated files (CSVs), which can be imported by just about any application. R has several packages to read and write CSVs, including
freadfrom the data.table package. The downside is that as an ASCII format, CSVs are inefficient, particularly for numeric data.
- The base R function
saveRDS(and its deserialization counterpart,
readRDS) can write any R object to a file. This is fairly efficient binary representation of the data, but not many applications can read RDS files.
- The feather package provides the functions
write_feather, an efficient binary format based on the open Apache Arrow framework.
And now there’s a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).
The fst package achieves these impressive benchmarks thanks to the magic of compression. Even in this modern age of fast, solid-state storage, it’s still (usually) faster to spend time using the CPU to compress the data first, rather than simply writing a larger file to disk. (The same applies to de-compression, and because that’s an easier task than compression, there are even more performance gains to be had when reading.) The benefits are dependent on the data itself though: data without a lot of repetition (or, in the worst case, truly random numerical data) won’t see performance gains like this.
Nonetheless, fst looks like it will be a useful package for applications that need to export data from R and into another R session as quickly as possible. (The fst format isn’t supported by any systems other than R, as far as I know.) The package is still in its early days — the authors warn that the file format is likely to change in the future — but it likely has a place in high-performance R applications that rely on data transfer.
fst package: Lightning Fast Serialization of Data Frames for R
Source:: R News