Benchmarking a SSD drive in reading and writing files with R

By Marcelo S. Perlin

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

I recently bought a new computer for home and it came with two drives,
one HDD and other SSD. The later is used for the OS and the former for
all of my files. From all computers I had, both home and work, this is
definitely the fastest. While some of the merits are due to the newer
CPUS and RAM, the SSD drive can make all the difference in file
operations.

My research usually deals with large files from financial markets. Being
efficient in reading those files is key to my productivity. Given that,
I was very curious in understanding how much I would benefit in speed
when reading/writing files in my SSD drive instead of the HDD. For that,
I wrote a simple function that will time a particular operation. The
function will take as input the number of rows in the data (1..Inf), the
type of function used to save the file (rds, csv, fst) and the
type of drive (HDD or SSD). See next.

bench.fct 

Now that we have my function, its time to use it for all combinations
between number of rows, the formats of the file and type of drive:

library(purrr)
df.grid 

Lets check the result in a nice plot:

library(ggplot2)

p 

As you can see, the csv-base format is messing with the y axis. Let’s
remove it for better visualization:

library(ggplot2)

p 

When it comes to the file format, we learn:

  • By far, the fst format is the best. It takes less time to read
    and write than the others. However, it’s probably unfair to compare
    it to csv and rds as it uses many of the 16 cores of my
    computer.

  • readr is a great package for writing and reading csv files.
    You can see a large difference of time from using the base
    functions. This is likely due to the use of low level functions to
    write and read the text files.

  • When using the rds format, the base function do not differ much
    from the readr functions
    .

As for the effect of using SSD, its clear that it DOES NOT effect
the time of reading and writing. The differences between using HDD and
SSD looks like noise. Seeking to provide a more robust analysis, let’s
formally test this hypothesis using a simple t-test for the means:

tab %
  group_by(type.file, type.time) %>%
  summarise(mean.HDD = mean(times[type.hd == 'HDD']),
            mean.SSD = mean(times[type.hd == 'SSD']),
            p.value = t.test(times[type.hd == 'SSD'],
                             times[type.hd == 'HDD'])$p.value)


print(tab)

## # A tibble: 10 x 5
## # Groups:   type.file [?]
##    type.file type.time mean.HDD mean.SSD p.value
##    
##  1 csv-base  read       0.554    0.463    0.605 
##  2 csv-base  write      0.405    0.405    0.997 
##  3 csv-readr read       0.142    0.126    0.687 
##  4 csv-readr write      0.0711   0.0706   0.982 
##  5 fst       read       0.015    0.0084   0.0584
##  6 fst       write      0.00900  0.00910  0.964 
##  7 rds-base  read       0.0321   0.0303   0.848 
##  8 rds-base  write      0.0253   0.025    0.969 
##  9 rds-readr read       0.0323   0.0304   0.845 
## 10 rds-readr write      0.0251   0.0247   0.957

As we can see, the null hypothesis of equal means easily fails to be
rejected for almost all types of files and operations at 10%. The
exception was for the fst format in a reading operation. In other
words, statistically, it does not make any difference in time from using
SSD or HDD to read or write files in different formats.

I am very surprised by this result. Independently of the type of format,
I expected a large difference as SSD drives are much faster within an
OS. Am I missing something? Is this due to the OS being in the SSD? What
you guys think?

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.