In this short post, I talk about why I’m moving away from using function apply
.
With matrices
It’s okay to use apply
with a dense matrix, although you can often use an equivalent that is faster.
N M 8000
X matrix(rnorm(N * M), N)
system.time(res1 apply(X, 2, mean))
## user system elapsed
## 0.73 0.05 0.78
system.time(res2 colMeans(X))
## user system elapsed
## 0.05 0.00 0.05
stopifnot(isTRUE(all.equal(res2, res1)))
“Yeah, there are colSums
and colMeans
, but what about computing standard deviations?”
There are lots of apply
like functions in package {matrixStats}.
system.time(res3 apply(X, 2, sd))
## user system elapsed
## 0.96 0.01 0.97
system.time(res4 matrixStats::colSds(X))
## user system elapsed
## 0.2 0.0 0.2
stopifnot(isTRUE(all.equal(res4, res3)))
With data frames
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
apply(head(iris), 2, identity)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 "5.1" "3.5" "1.4" "0.2" "setosa"
## 2 "4.9" "3.0" "1.4" "0.2" "setosa"
## 3 "4.7" "3.2" "1.3" "0.2" "setosa"
## 4 "4.6" "3.1" "1.5" "0.2" "setosa"
## 5 "5.0" "3.6" "1.4" "0.2" "setosa"
## 6 "5.4" "3.9" "1.7" "0.4" "setosa"
A DATA FRAME IS NOT A MATRIX (it’s a list).
The first thing that apply
does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).
What can you use as a replacement of apply
with a data frame?

If you want to operate on all columns, since a data frame is just a list, you can use
sapply
instead (ormap*
if you are a purrrist).sapply(iris, typeof)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "double" "double" "double" "double" "integer"

If you want to operate on all rows, I recommend you to watch this webinar.
With sparse matrices
The memory problem is even more important when using apply
with sparse matrices, which makes using apply
very slow for such data.
library(Matrix)
X.sp rsparsematrix(N, M, density = 0.01)
## X.sp is converted to a dense matrix when using `apply`
system.time(res5 apply(X.sp, 2, mean))
## user system elapsed
## 0.78 0.46 1.25
system.time(res6 Matrix::colMeans(X.sp))
## user system elapsed
## 0.01 0.00 0.02
stopifnot(isTRUE(all.equal(res6, res5)))
You could implement your own apply
like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i
and j
storing positions of nonnull elements, and x
storing values of these elements). Then, you could use a group_by
–summarize
approach.
For instance, for the previous example, you can do this in base R:
apply2_sp function(X, FUN) {
res numeric(ncol(X))
X2 as(X, "dgTMatrix")
tmp tapply(X2@x, X2@j, FUN)
res[as.integer(names(tmp)) + 1] tmp
res
}
system.time(res7 apply2_sp(X.sp, sum) / nrow(X.sp))
## user system elapsed
## 0.03 0.00 0.03
stopifnot(isTRUE(all.equal(res7, res5)))
Conclusion
Using apply
with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News