Fastest Way to Add New Variables to A Large Data.Frame

By statcompute

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)
pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table")
lapply(pkgs, require, character.only = T)

data(hflights)

benchmark(replications = 10, order = "user.self", relative = "user.self",
  transform = {
    ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ###
    transform(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay)
  },
  within    = {
    ### EVALUATE THE EXPRESSION WITHIN THE LOCAL ENVIRONMENT ###
    within(hflights, {wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'); delay = ArrDelay + DepDelay})
  },
  mutate   = {
    ### THE SPECIFIC FUNCTION IN DPLYR PACKAGE TO ADD VARIABLES ###
    mutate(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay)
  },
  foreach = {
    ### SPLIT AND THEN COMBINE IN PARALLEL ###
    registerDoParallel(cores = 2)
    v <- c(names(hflights), 'wday', 'delay')
    f <- expression(ifelse(hflights$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'),
                    hflights$ArrDelay + hflights$DepDelay)
    df <- foreach(fn = iter(f), .combine = mutate, .init = hflights) %dopar% {
      eval(fn)
    }
    names(df) <- v
  },
  data.table = {
    ### DATA.TABLE ###
    data.table(hflights)[, c("wday", "delay") := list(ifelse(hflights$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), hflights$ArrDelay + hflights$DepDelay)]
  }
)

#         test replications elapsed relative user.self sys.self user.child
# 4    foreach           10   1.442    1.000     0.240    0.144      0.848
# 2     within           10   0.667    2.783     0.668    0.000      0.000
# 3     mutate           10   0.679    2.833     0.680    0.000      0.000
# 5 data.table           10   0.955    3.983     0.956    0.000      0.000
# 1  transform           10   1.732    7.200     1.728    0.000      0.000

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.