Tidyverse and data.table, sitting side by side… and then base R walks in

By Iñaki Úcar

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

Of course, I’m paraphrasing Dirk’s fifteenth post in the rarely rational R rambling series: #15: Tidyverse and data.table, sitting side by side … (Part 1). I very much liked it, because, although I’m a happy tidyverse user, I’m always trying not to be tied into that verse too much by replicating certain tasks with other tools (and languages) as an exercise. In this article, I’m going to repeat Dirk’s exercise in base R.

First of all, I would like to clean up the tidyverse version a little, because the original was distributed in chunks and was a little bit too verbose. We can also avoid using lubridate, because readr already parses the end_date column as a date (and that’s why it is significantly slower, among other reasons). This is how I would do it:

## Getting the polls

library(tidyverse)
library(zoo)

polls_2016 "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))

## Wrangling the polls

polls_2016 %
  filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
  right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")

## Average the polls

rolling_average %
  group_by(end_date) %>%
  summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
  mutate(Clinton.Margin = Clinton-Trump,
         Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right"))

ggplot(rolling_average) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

which, by the way, has exactly the very same number of lines of code than the data.table version:

## Getting the polls

library(data.table)
library(zoo)
library(ggplot2)

pollsDT "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")

## Wrangling the polls

pollsDT in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT "days")), on="end_date"]

## Average the polls

pollsDT 14,
                                   FUN=function(x){mean(x, na.rm=TRUE)},
                                   by=1, partial=TRUE, fill=NA, align="right")]

ggplot(pollsDT) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

Let’s translate this into base R. It is easier to start from the data.table version, mainly because filtering and assigning have a similar look and feel. Unsurprisingly, we have base::merge for the merge operation and stats::aggregate for the aggregation phase. base::as.Date works just fine for these dates and utils::read.csv has the only drawback that you have to specify the separator. Without further ado, this is my version in base R:

## Getting the polls

library(zoo)

pollsB "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"), sep="t")

## Wrangling the polls

pollsB in% c("Adults","Likely Voters","Registered Voters"), ]
pollsB$end_date "days"))
pollsB "end_date", all=TRUE)

## Average the polls

pollsB 14,
                                FUN=function(x){mean(x, na.rm=TRUE)},
                                by=1, partial=TRUE, fill=NA, align="right")

plot(pollsB$end_date, pollsB$Clinton.Margin, pch=16)
lines(pollsB$end_date, pollsB$Clinton.Avg, col="blue", lwd=2)

which is the shortest one! Finally, let’s repeat the benchmark too:

library(microbenchmark)

url "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"
file "/tmp/poll-responses-clean.tsv"
download.file(url, destfile=file, quiet=TRUE)
res FALSE),
                      base=read.csv(file, sep="t"))
res
## Unit: milliseconds
##  expr       min        lq      mean    median        uq        max neval
##  tidy 13.877036 15.127885 18.549393 15.861311 17.813541 202.389391   100
##    dt  4.084022  4.505943  5.152799  4.845193  5.652579   7.736563   100
##  base 29.029366 30.437742 32.518009 31.449916 33.600937  45.104599   100

Base R is clearly the slowest option for the reading phase. Or, one might say, both readr and data.table have done a great job in improving things! Let’s take a look at the processing part now:

tvin FALSE)
bsin "t")

library(tidyverse)
library(data.table)
library(zoo)

transformTV function(polls_2016) {
  polls_2016 %
    filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
    right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")
  
  rolling_average %
    group_by(end_date) %>%
    summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
    mutate(Clinton.Margin = Clinton-Trump,
           Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                    FUN=function(x){mean(x, na.rm=TRUE)},
                                    by=1, partial=TRUE, fill=NA, align="right"))
}

transformDT function(dtin) {
  pollsDT ## extra work to protect from reference semantics for benchmark
  pollsDT in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsDT[, end_date := as.IDate(end_date)]
  pollsDT "days")), on="end_date"]
  
  pollsDT 14,
                                     FUN=function(x){mean(x, na.rm=TRUE)},
                                     by=1, partial=TRUE, fill=NA, align="right")]
}

transformBS function(pollsB) {
  pollsB in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsB$end_date "days"))
  pollsB "end_date", all=TRUE)
  
  pollsB 14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right")
}

res 
## Unit: milliseconds
##  expr      min       lq     mean   median       uq       max neval
##  tidy 20.68435 22.58603 26.67459 24.56170 27.85844  84.55077   100
##    dt 17.25547 18.88340 21.43256 20.24450 22.26448  41.65252   100
##  base 28.39796 30.93722 34.94262 32.97987 34.98222 109.14005   100

I don’t see so much difference between the tidyverse and data.table as Dirk showed, perhaps because I’ve simplified the script a bit, and removed some redundant parts. Again, base R is the slowest option, but don’t set it aside: it is the shortest one, and it is always there, out of the box!

Article originally published in Enchufa2.es: Tidyverse and data.table, sitting side by side… and then base R walks in.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.