The Bayesian approach to ridge regression

By tony.fischetti@gmail.com

mtcars-loocv-mse

(This article was first published on R – On the lambda, and kindly contributed to R-bloggers)

In a previous post, we demonstrated that ridge regression (a form of regularized linear regression that attempts to shrink the beta coefficients toward zero) can be super-effective at combating overfitting and lead to a greatly more generalizable model. This approach to regularization used penalized maximum likelihood estimation (for which we used the amazing glmnet package). There is, however, another approach… an equivalent approach… but one that allows us greater flexibility in model construction and lends itself more easily to an intuitive interpretation of the uncertainty of our beta coefficient estimates. I’m speaking, of course, of the bayesian approach.

As it turns out, careful selection of the type and shape of our prior distributions with respect to the coefficients can mimic different types of frequentist linear model regularization. For ridge regression, we use normal priors of varying width.

Though it can be shown analytically that shifting the width of normal priors on the beta coefficients is equivalent to L2 penalized maximum likelihood estimation, the math is scary and hard to follow. In this post, we are going to be taking a computational approach to demonstrating the equivalence of the bayesian approach and ridge regression.

This post is going to be a part of a multi-post series investigating other bayesian approaches to linear model regularization including lasso regression facsimiles and hybrid approaches.

mtcars

We are going to be using the venerable mtcars dataset for this demonstration because (a) it’s multicollinearity and high number of potential predictors relative to its sample size lends itself fairly well to ridge regression, and (b) we used it in the elastic net blog post

Before, you lose interest… here! have a figure! An explanation will follow.

After scaling the predictor variables to be 0-centered and have a standard deviation of 1, I described a model predicting mpg using all available predictors and placed normal priors on the beta coefficients with a standard deviation for each value from 0.05 to 5 (by 0.025). To fit the model, instead of MCMC estimation via JAGS or Stan, I used quadratic approximation performed by the awesome rethinking package written by Richard McElreath written for his excellent book, Statistical Rethinking. Quadratic approximation uses an optimization algorithm to find the maximum a priori (MAP) point of the posterior distribution and approximates the rest of the posterior with a normal distribution about the MAP estimate. I use this method chiefly because as long as it took to run these simulations using quadratic approximation, it would have taken many orders of magnitude longer to use MCMC. Various spot checks confirmed that the quadratic approximation was comparable to the posterior as told by Stan.

As you can see from the figure, as the prior on the coefficients gets tighter, the model performance (as measured by the leave-one-out cross-validated mean squared error) improves—at least until the priors become too strong to be influenced sufficiently by the evidence. The ribbon about the MSE is the 95% credible interval (using a normal likelihood). I know, I know… it’s pretty damn wide.

The dashed vertical line is at the prior width that minimizes the LOOCV MSE. The minimum MSE is, for all practical purposes, identical to that of the highest performing ridge regression model using glmnet. This is good.

Another really fun thing to do with the results is to visualize the movement of the beta coefficient estimates and different penalties. The figure below depicts this. Again, the dashed vertical line is the highest performing prior width.

mtcars-coef-shrinkage

One last thing: we’ve heretofore only demonstrated that the bayesian approach can perform as well as the L2 penalized MLE… but it’s conceivable that it achieves this by finding a completely different coefficient vector. The figure below shows the same figure as above but I overlaid the coefficient estimates (for each predictor) of the top-performing glmnet model. These are shown as the dashed colored horizontal lines.

mtcars-coef-shrinkage-net-overlay

These results are pretty exciting! (if you’re the type to not get invited to parties). Notice that, at the highest performing prior width, the coefficients of the bayesian approach and the glmnet approach are virtually identical.

Sooooo, not only did the bayesian variety produce an equivalently generalizable model (as evinced by equivalent cross-validated MSEs) but also yielded a vector of beta coefficient estimates nearly identical to those estimated by glmnet. This suggests that both the bayesian approach and glmnet‘s approach, using different methods, regularize the model via the same underlying mechanism.

A drawback of the bayesian approach is that its solution takes many orders of magnitude more time to arrive at. Two advantages of the Bayesian approach are (a) the ability to study the posterior distributions of the coefficient estimates and ease of interpretation that they allows, and (b) the enhanced flexibility in model design and the ease by which you can, for example, swap out likelihood functions or construct more complicated hierarchal models.

If you are even the least bit interested in this, I urge you to look at the code (in this git repository) because (a) I worked really hard on it and, (b) it demonstrates cool use of meta-programming, parallelization, and progress bars… if I do say so myself

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

To leave a comment for the author, please follow the link and comment on their blog: R – On the lambda.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Regular Expressions Exercises – Part 1

By Avi Blinder

pattern1

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A common task performed during data preparation or data analysis is the manipulation of strings.

Regular expressions are meant to assist in such and similar tasks.

A regular expression is a pattern that describes a set of strings.

Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes).

R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perl-like regular expressions.

Also, sometimes specific patterns may or may not be found, according to the system locales.

The implementation of those patterns can be performed thru several base-r functions, such as:

  • grep
  • grepl
  • regexpr
  • gregexpr
  • sub
  • gsub
  • strsplit

Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3-sets series.

Answers to the exercises are available here.

Although with regex, you can get correct results in more than one way, if you have different solutions, feel free to post them.

Character class

A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list.
For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”.
There are some “shortcuts” that allow us finding specific ranges of digits or characters:

  • [0-9] means any digit
  • [A-Z] means any upper case character
  • [a-z] means any lower case character

Let’s create a variable called text1 and populate it with the value “The current year is 2016”

Exercise 1
Create a variable called my_pattern and implement the required pattern for finding any digit in the variable text1.
Use function grepl to verify if there is a digit in the string variable

Exercise 2
Use function gregexpr to find all the positions in text1 where there is a digit.
Place the results in a variable called string_position

Predefined classes of characters

In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc).
For this purpose, there are several predefined classes of characters that save us a lot of typing.

Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale.

Below are some “popular” predefined classes and their meaning:
1. [:alnum:]
Alphanumeric characters: [:alpha:] and [:digit:].

2. [:alpha:]
Alphabetic characters: [:lower:] and [:upper:] can also be used.

3. [:digit:]
Digits: 0 1 2 3 4 5 6 7 8 9.

4. [:blank:]
Blank characters: space and tab, and possibly other locale-dependent characters
such as non-breaking space.

Exercise 3
Create a variable called my_pattern and implement the required pattern for finding one digit and one uppercase alphanumeric character, in variable text1.
This time, combine predefined classes in the regex pattern.
Use function grepl to verify if the searched pattern exists on the string.

Exercise 4
Use function regexpr to find the position of the first space in text1.
Place the results in a variable called first_space and

Special single character

The period (“.”) matches any single character.
Exercise 5
Create a pattern that checks in text1 if there is a lowercase character, followed by any character and then by a digit.

Exercise 6
Find the starting position of the above string. Place the results in a variable called string_pos2

Special symbols

There are several “special symbols” that assist in the definition of specific patterns.
Pay attention that in R, you should append an extra backslash when using those special symbols:
The symbol w matches a ‘word’ character and W is its negation.
Symbols d, s, D and S denote the digit and space classes and their negations.
As you may have noticed, some special symbols have their parallel “predefined classes”.
(For example, d equals [0-9] and equals [:digit:])

Exercise 7
Find the following pattern: one space followed by two lowercase letters and one more space.
Use a function that returns the starting point of the found string and place its result in string_pos3.

Metacharacters

There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones:
The caret ("^") – means: find a pattern starting from the beginning of the string
The dollar sign ("$") – means: find a pattern starting from the end of the string.

Exercise 8
Using the sub function, replace the pattern found on the previous exercice by the string ” is not ”
Place the resulting string in text2 variable.

Repetition Characters

There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax:
{n} The preceding item is matched exactly n times.

{n,} The preceding item is matched n or more times.

{n,m} The preceding item is matched at least n times, but not more than m times.

By default repetition is greedy, so the maximal possible number of repeats is used.

Exercise 9
Find in text2 the following pattern: Four digits starting at the end of the string.
Use a function that returns the starting point of the found string and place its result in string_pos4.

Exercise 10
Using the substr function, and according to the position of the string found in the previous excercise, extract the first two digits found at the end of text2.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RProtoBuf 0.4.7: Mostly harmless

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

CRAN requested a release updating any URLs for Omegahat to the (actually working) omegahat.net URL. The RProtoBuf package had this in one code comment (errr…) and on bibfile entry. Oh well — so that caused this 0.4.7 release which arrived on CRAN today. It contains the requested change, and pretty much nothing else.

RProtoBuf provides R bindings for the Google Protocol Buffers (“Protobuf”) data encoding and serialization library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

The NEWS file summarises the release as follows:

Changes in RProtoBuf version 0.4.7 (2016-10-27)

  • At the request of CRAN, two documentation instances referring to the Omegehat repository were updated to http://omegahat.net

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

drat 0.1.2: Mostly harmless

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

CRAN requested a release updating any URLs for Omegahat to the (actually working) omegahat.net URL. So that caused this 0.1.2 release which arrived on CRAN yesterday. It contains the requested change along with one or two other mostly minor changes which accumulated since the last release.

drat stands for drat R Archive Template, and helps with easy-to-create and easy-to-use repositories for R packages. Since its inception in early 2015 it has found reasonably widespread adoption among R users because repositories is what we use. In other words, friends don’t let friends use install_github(). Just kidding. Maybe. Or not.

The NEWS file summarises the release as follows:

Changes in drat version 0.1.2 (2016-10-28)

  • Changes in drat documentation

    • The FAQ vignette added a new question Why use drat

    • URLs were made canonical, omegahat.net was updated from .org

    • Several files (README.md, Description, help pages) were edited

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A quick exploration of the ReporteRs package

By Abhijit

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.

This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr. For example, a report I wrote today maintained my flow while writing R code to create the report.

library(ReporteRs)
library(magrittr)

mydoc <- docx() %>%
  addParagraph(value = 'Correlation matrix', style='Titre2') %>%
  addParagraph(value='Estimates') %>%
  addFlexTable(FlexTable(cormat)) %>%
  addParagraph(value = 'P-values') %>%
  addFlexTable(FlexTable(corpval)) %>%
  addParagraph(value = "Boxplots", style='Titre2') %>%
  addPlot(fun=print, x = plt, height=3, width=5) %>%
  writeDoc(file = 'Report.docx)

Note that plt is a ggplot object and so we actually have to print it rather than just put the object in the addPlot command.

This was my first experience in a while using ReporteRs, and it seemed pretty good to me.

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Join Hadley Wickham’s Master R Workshop in Melbourne, Australia December 12 & 13

By Roger Oberg

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

It’s nearly summeRtime in Australia! Join RStudio Chief Data Scientist Hadley Wickham for his popular Master R workshop in Melbourne.

Register here: https://www.eventbrite.com/e/master-r-developer-workshop-melbourne-tickets-22546200292

Melbourne will be Hadley’s first and only scheduled Master R workshop in Australia. Whether you live or work nearby or you just need one more good reason to visit Melbourne in the Southern Hemisphere spring, consider joining him at the Cliftons Melbourne on December 12th and 13th. It’s a rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

Hadley’s workshops usually sell out. This is his final Master R in 2016 and he has no plans to offer another in the area in 2017. If you’re an active R user and have been meaning to take this class, now is the perfect time to do it!

We look forward to seeing you in Melbourne!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Online R courses at Udemy for only $10 (until November 1st)

By Tal Galili

2015-11-18 21_23_47-Clipboard

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their coursesincluding many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses:

Introductory R courses:

Non R courses for data science:

Source:: R News

Clickstream Data, R Graph Gallery, and a package that will instantly increase your karma points at work!

By Onno Dijt

R news

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

To stay on top of R in the news, we’re sharing some stories related to R published last week.

The R Graph Gallery Has Made a Combeback(Yan Holtz)

I must admit, the first edition of this gallery must have been before my time! Announced here, I was immediately most intrigued by what obscure visualizations were hidden under ‘Miscellaneous’.

A Paper on Analyzing Clickstream Data in R(Michael Scholz)

A super interesting read for anyone who has a website, builds software, has a marketing goal or wants to do market research, basically the world population?

A ReporteRs Package(David Gohel)

This package created some buzz this week on the internet, although I believe its been around for a while, as there are pre-releases on the github here that date back to january 2014. My personal experience with rmarkdown and business reports in word is not optimal, so this looks like a valid alternative.

Calculating Moving Averages and Historical Flow Quantiles(Laura DeCicco)

Trend analysis is one of the main subjects I enjoy and this post is particularly awesome in detail and bulk of R code. Now all we need is some forecasting!

Other then the news, we added some new functionality to the website, as you can see in this image:

These new buttons (circled in red) allow you to download an exercise set in PDF format, or print it straight away! We hope this will make it easier for you to work with our exercise sets.

Alright that was it for this week, we will keep uploading sets this week and come back with another update as scheduled. To stay on top of the news during the week, visit our R News section.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Free e-book: Data Science with SQL Server 2016

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

There’s a new e-book available to download free from Microsoft Academy: Data Science with Microsoft SQL Server 2016.

This 90-page e-book is aimed at data scientists who already have some experience in R, but want to learn how to use R wirth SQL Server. The book was written by some of my most experienced colleagues from Microsoft’s data science team at Microsoft: Buck Woody, Danielle Dean, Debraj GuhaThakurta, Gagan Bansal, Matt Conners, and Wee-Hyong Tok, and begins with an introduction by Joseph Sirosh. It includes everything you need to know to use R and SQL Server:

  • How to install and configure your data science tool-set: SQL Server 2016, Microsoft R Client, RStudio/RTVS, etc.
  • How to download data from SQL Server into a local R client
  • How to create and update tables in SQL Server from R
  • Use a remote SQL Server instance as an R compute engine, driven from your local R client
  • Write SQL Server stored procedures that run R code on the server, and how to share them with others.

(If you’re new to R or data science, there are also links to learning resources in Chapter 1.) The book also includes several fully-worked examples following the data science process, with links to data and code so you can try them out yourself:

  • Creating a model to predict whether a tip is given for a taxi ride in New York City
  • Building a customer churn model to find customers likely to switch to a competing service provider
  • Analyzing “Internet of Things” data as part of a predictive maintenenance program

Data Science with Microsoft SQL Server 2016 is available now as a download from Microsoft Academy. (Desktop and mobile PDF formats available. Free registration required.)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Comparing Symmetric Eigenvalue Performance

By wrathematics

(This article was first published on wrathematics, and kindly contributed to R-bloggers)

Lazy reader’s guide: skip to the pretty pictures, skim the conclusions section, ignore the rest.

Background

I think a lot about eigenvalue and singular value decompositions. I won’t get into it right now, but I have been quoted in the past as saying that singular values are eigenvalues for grownups. But today we need to go back to gradeschool to talk about singular values for babies: eigenvalues.

In R, most of your linear algebra is handled by the BLAS and LAPACK libraries. The BLAS handles really basic vector and matrix operations, while LAPACK has all the fun stuff like factorizations, including eigenvalues. This bit is important but has been written about ad infinitum, so I’ll keep it brief: use good BLAS. R ships with the Reference BLAS, which are slow. If you don’t know what any of this means, the easiest way forward is to just download Microsoft R Open (formerly from Revolution Analytics). It’s basically R compiled with Intel compilers and shipped with MKL (good BLAS, and more actually).

There are about a million ways to compute eigenvalues in LAPACK, but if your data looks like the way we store matrices in R, there are two significant ones: relatively robust representations (RRR) and divide and conquer (D&C). If you search around, you will find people generally say that

  1. RRR is faster than D&C
  2. RRR uses less RAM than D&C

But I couldn’t really find much more detail than that. I’ve been doing some experiments of my own with computing eigenvalues in the band package, which is a new framework for handling banded matrices in R. In the course of my experiments, I made a wrapper for the D&C method, and ended up comparing it against R. My implementation was faster, by a margin that made me raise an eyebrow. Eventually I figured out that R uses the RRR method; but this seemed to run counter to the internet wisdom that D&C is slower.

The great thing about doing computational work is that you don’t have to accept anybody else’s bullshit. We can test this, and I don’t need a team of researchers or grant money or IRB signatures or anything. We just need a free afternoon and a little elbow grease.

The Benchmark

The quick summary is that we’ll be measuring RRR versus D&C wall clock run times with 5 replications with the following variations:

  1. 9 different problem sizes (“n”), namely 25, 50, 100, 250, 500, 1000, 2000, 4000, and 8000.
  2. 4 different BLAS libraries: Reference, Atlas, OpenBLAS, and MKL.

Hardware is a 4 core Intel Core i5-2500K CPU @ 3.30GHz. Additional software is as follows:

  • R 3.1.1
  • gcc 6.2.0-5
  • Microsoft R Open 3.3.1 and Intel MKL
  • Whatever versions of Reference, Atlas, and OpenBLAS that are in Ubuntu 16.10 at the time of writing.

Reference and Atlas timings are with one core; OpenBLAS and MKL are with 4.

Results

Displaying all of this is kind of difficult, but ggplot2 certainly does make the process less painful. And I’ve never claimed to be a vis expert; so I’d better not see a bunch of vis hipsters posting sassy bullshit on twitter over my plots.

There are several things that are striking in this first plot:

  • MKL is only slightly faster than OpenBLAS; both are much faster than Atlas and Reference BLAS (old news).
  • There’s very little variability in the run times, particularly for the larger problem sizes.
  • We still haven’t really answered our original question.

To better address the RRR vs D&C bit, let’s focus in only on the two largest problem sizes:

For both the 4000 and 8000 problem sizes, it’s pretty obvious that D&C is faster with OpenBLAS and MKL, and slower with Reference and Atlas. So it feels like we have our answer. But feels can be deceiving.

Now after seeing this second plot, I was just about ready to call it a day. But then I started thinking; the two benchmarks where D&C consistently won were OpenBLAS and MKL, which each were using 4 cores. By contrast, Atlas and the Reference BLAS (the two where D&C lost) were strictly serial. What if these didn’t actually have faster D&C implementations, but that D&C was simply more scalable?

Turns out my hunch was correct:

This run was with OpenBLAS only; we would almost certainly see the same results with MKL. But I also use this machine to watch Netflix, so I’m gonna stop heating up my house with benchmarks and go watch Star Trek.

So there. We settled it…

But Wait! There’s More!

We still haven’t looked at the memory allocation issue. This is actually much easier than you might expect if you’ve never worked with LAPACK before. The reason is, LAPACK was originally written in Fortran 77 — a language that does not support dynamic memory allocation. So any time a function (subroutine in Fortran) needed some local workspace, then you had to allocate it for that function. So all LAPACK functions are very up front in the documentation about allocation sizes. They even have a cute way of telling you the allocation requirements at run time.

Anyway, D&C uses the dsyevd() function, and RRR dsyevr() (links are to original Fortran source code). You can skim the source to see if I’m lying to you, or take my word for it that the memory requirements are:

library(memuse)

dnc = function(n) mu(1+6*n + 2*n*n)*8 + mu(3 + 5*n)*4
rrr = function(n) mu(26*n)*8 + mu(10*n)*4

Now, if you’re willing to destroy the input data, then this calculation is a little unfair to D&C; they get much closer in terms of the total amount of memory required to allocate to perform the calculation. But in R, we can’t destroy the input data, so for any R bindings, that is ultimately irrelevant. Also, the RRR version is a lower bound of the additional memory requirements. I went to the trouble to check what it’s actually using on my machine, and it’s never more than 2x the lower bound. You’re about to see why this factor of 2 is irrelevant.

Here are some figures for our n=8000 problem size:

n = 8000

# Size of an 8000x8000 matrix
howbig(n, n)
## 488.281 MiB

# Extra memory use for D&C method
dnc(n)
## 977.081 MiB

# Extra memory use for RRR method
rrr(n)
## 1.892 MiB

Pretty striking. If we had a problem size that pushed the RAM limits of most modern workstations, it doesn’t get any better:

n = 46340

# Size of a 46340x46340 matrix
howbig(n, n)
## 15.999 GiB

# Extra memory use for D&C method
dnc(n)
## 32.002 GiB

# Extra memory use for RRR method
rrr(n)
## 10.960 MiB

Conclusions

  1. If you don’t know how to manage the BLAS that you’re using, for god’s sake just use the Revolution/Microsoft distribution of R.
  2. RRR has better runtime performance than D&C in serial, and worse in parallel.
  3. D&C gobbles up RAM at an astounding rate.
  4. People who say that the D&C method is slower are not quite right, or at least not completely right.
  5. People who say that the D&C method uses more RAM are underselling their case.
  6. It would be really interesting to see how this scales with accelerator cards, like GPU’s and Intel MIC.

Data and R Scripts

Here’s the data:

And here’s the benchmarking script. I gave it the very descriptive name x.r:

type = commandArgs(trailingOnly=TRUE)

library(band)
library(coop)

data = function(n)
{
  x = rnorm(n*n)
  dim(x) = c(n, n)
  covar(x)
}


nreps = 5

N = c(25, 50, 100, (2^(0:5))*250)

rowlen = length(N)*nreps
df = data.frame(numeric(rowlen), character(rowlen), numeric(rowlen), numeric(rowlen), stringsAsFactors=FALSE)
colnames(df) = c("n", "type", "RRR", "D&C")

row = 1L
for (n in N){
  x = data(n)
  
  for (rep in 1:nreps){
    eigen_r = system.time(eigen(x))[3]
    eigen_me = system.time(eig(x))[3]
    df[row, ] = c(n, type, eigen_r, eigen_me)
    row = row+1L
  }
}

Invoke this via Rscript x.r NameOfBLASLib. So for example, if we were running with OpenBLAS, we’d use:

Rscript x.r OpenBLAS

Now unfortunately, this naming scheme doesn’t actually set the BLAS library. It’s only for plotting later. I ran this on an Ubuntu machine, where changing BLAS is trivial. Simply run:

sudo update-alternatives --config libblas.so.3

from the command line and choose the one you want. On other systems you have to swap things out manually, or possibly even rebuild R.

As for the plots, if you read in the eig_bench.csv data above to the R object df, then this is straightforward:

library(reshape2)
library(plyr)
dfm = melt(df, id.vars=c("n", "type"))

se = ddply(dfm, c("n", "type", "variable"), summarise, mean=mean(value), se=sd(value)/sqrt(length(value)))
se = transform(se, lower=mean-se, upper=mean+se)

library(ggplot2)
library(scales)
ggplot(se, aes(n, mean, fill=type)) + 
  theme_bw() + 
  geom_bar(stat="identity") + 
  guides(fill=FALSE) +
  facet_grid(type ~ variable) + 
  xlab("Problem Size") +
  ylab("Log10 Average Wall Clock Time (5 Runs)") +
  scale_y_log10() +
  geom_errorbar(data=se, position=position_dodge(0.9), aes(ymin=lower, ymax=upper))

The other two plots are pretty obvious variations of this; subset for the second, use the other dataset for the third. Consider it an exercise.

A Note About Benchmarking

Choosing sizes for benchmarking is important. The problem sizes I chose aren’t arbitrary. Well…they’re kind of arbitrary, but not completely.

When you’re benchmarking, it’s very important to pick the right size of data. If you’re only interested in microsecond performance (say you need to execute something millions+ times per second), for one, stop using R immediately, you’ve made a terrible mistake. And two, you probably don’t care about large problem sizes. The same is true in reverse. If you only care about large problems, then who cares if it’s “slow” on small ones; you’re probably only thinking about something executing once.

So what if you care about both? A lot of people get this wrong for some reason, but the short version is that you should look at lots of varying problem sizes. In particular, you should make sure your data no longer comfortably fits in cache. If you haven’t measured the performance of your implementation at, say, 2x the size of your largest cache, then whatever claims you have about performance are complete bullshit. Period.

Fortunately, in R we can look up cache sizes very easily with the memuse package. This is the result from my machine (the one I ran the benchmarks on):

memuse::Sys.cachesize()
## L1I:   32.000 KiB 
## L1D:   32.000 KiB 
## L2:   256.000 KiB 
## L3:     6.000 MiB 

My biggest cache is L3, at 6 megs. Some server hardware even has L4 cache these days! Anyway, always make sure that your data doesn’t fit in cache, because this can hide terrible performance patterns.

There’s a very simple test we can do to prove this. We’ll use compiled code by way of Rcpp, because I need to have more control over the demonstration than I can get with pure R code (R functions have a lot of overhead and variability that get in the way, but this is still valid for pure R code). We’re going to create a function that takes an integer input n and output an nxn matrix of all 1’s. And we’ll do this in two ways, first by iterating over the rows then columns of the matrix, and second by iterating columns first then rows. We’ll name them, oh I don’t know, bad() and good() respectively, for no particular reason. Also for no particular reason, you should know that matrices in R are column-major.

Anyway, here they are:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericMatrix bad(const int n)
{
  Rcpp::NumericMatrix x(n, n);
  
  for (int i=0; i<n; i++)
    for (int j=0; j<n; j++)
      x(i, j) = 1.;
  
  return x;
}

// [[Rcpp::export]]
Rcpp::NumericMatrix good(const int n)
{
  Rcpp::NumericMatrix x(n, n);
  
  for (int j=0; j<n; j++)
    for (int i=0; i<n; i++)
      x(i, j) = 1.;
  
  return x;
}

Save them as cache_eg.cpp and run the following in R to build them:

library(Rcpp)
library(microbenchmark)
library(memuse)

sourceCpp("cache_eg.cpp")

We can test on a small matrix:

n = 100L
howbig(n, n)
## 78.125 KiB

mb = microbenchmark(bad(n), good(n))
boxplot(mb)

And these look pretty close; maybe we shouldn’t have called one of them bad() after all?

Let’s try a larger data size…

n = 4000L
howbig(n, n)
## 122.070 MiB

mb = microbenchmark(bad(n), good(n))
boxplot(mb)

HOLY SHIT WHAT HAPPENED?!

This problem was deliberately cooked up to make this kind of comparison; we could actually make it worse by adding parallelism, but generally it won’t look this bad even if you screw your data accesses up. But if you’re only ever comparing two functions on teensy tiny data, you can not possibly know how well it will perform on datasets that don’t fit in cache.

To better understand why this is so important, here’s a fun little interactive visualization showing the relative speeds in fetching data from RAM versus cache. You may also be interested in some of my older posts Cache Rules Everything Around Me and Advanced R Profiling with pbdPAPI.

To leave a comment for the author, please follow the link and comment on their blog: wrathematics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News