Why I rarely use apply

By Florian Privé

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this short post, I talk about why I’m moving away from using function apply.

With matrices

It’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.

N  M  8000
X  matrix(rnorm(N * M), N)
system.time(res1  apply(X, 2, mean))
##    user  system elapsed 
##    0.73    0.05    0.78
system.time(res2  colMeans(X))
##    user  system elapsed 
##    0.05    0.00    0.05
stopifnot(isTRUE(all.equal(res2, res1)))

“Yeah, there are colSums and colMeans, but what about computing standard deviations?”

There are lots of apply-like functions in package {matrixStats}.

system.time(res3  apply(X, 2, sd))
##    user  system elapsed 
##    0.96    0.01    0.97
system.time(res4  matrixStats::colSds(X))
##    user  system elapsed 
##     0.2     0.0     0.2
stopifnot(isTRUE(all.equal(res4, res3)))

With data frames

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
apply(head(iris), 2, identity)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
## 1 "5.1"        "3.5"       "1.4"        "0.2"       "setosa"
## 2 "4.9"        "3.0"       "1.4"        "0.2"       "setosa"
## 3 "4.7"        "3.2"       "1.3"        "0.2"       "setosa"
## 4 "4.6"        "3.1"       "1.5"        "0.2"       "setosa"
## 5 "5.0"        "3.6"       "1.4"        "0.2"       "setosa"
## 6 "5.4"        "3.9"       "1.7"        "0.4"       "setosa"

A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of apply with a data frame?

  • If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).

    sapply(iris, typeof)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
    ##     "double"     "double"     "double"     "double"    "integer"
  • If you want to operate on all rows, I recommend you to watch this webinar.

With sparse matrices

The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.

library(Matrix)

X.sp  rsparsematrix(N, M, density = 0.01)

## X.sp is converted to a dense matrix when using `apply`
system.time(res5  apply(X.sp, 2, mean))  
##    user  system elapsed 
##    0.78    0.46    1.25
system.time(res6  Matrix::colMeans(X.sp))
##    user  system elapsed 
##    0.01    0.00    0.02
stopifnot(isTRUE(all.equal(res6, res5)))

You could implement your own apply-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of non-null elements, and x storing values of these elements). Then, you could use a group_bysummarize approach.

For instance, for the previous example, you can do this in base R:

apply2_sp  function(X, FUN) {
  res  numeric(ncol(X))
  X2  as(X, "dgTMatrix")
  tmp  tapply(X2@x, X2@j, FUN)
  res[as.integer(names(tmp)) + 1]  tmp
  res
}

system.time(res7  apply2_sp(X.sp, sum) / nrow(X.sp))
##    user  system elapsed 
##    0.03    0.00    0.03
stopifnot(isTRUE(all.equal(res7, res5)))

Conclusion

Using apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.

To leave a comment for the author, please follow the link and comment on their blog: Florian Privé.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News