RMarkdown and Metropolis/Mtheme

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Nick Tierney asked on Twitter about rmarkdown and metropolis about whether folks had used RMarkdown-driven LaTeX Beamer presentations. And the answer is a firm hell yeah. I have been using mtheme (and/or a local variant I called ‘m2′) as well as the newer (renamed) release mtheme for the last year or two for all my RMarkdown-based presentations as you can see from my presentations page.

And earlier this year back I cleaned this up and wrote myself local Ubuntu packages which are here on Launchpad. I also have two GitHub repos for the underlying .deb package code: – the pkg-latex-metropolis package for the LaTeX part (which is also in TeXlive in an older version) – the pkg-fonts-fira for the underlying (free) font (and this sadly cannot build on launchpad as it needs a download step).

To round things up, I now also created a public ‘sample’ repo on GitHub. It is complete for all but the custom per-presenteation header.tex that modifies colours, add local definitions etc as needed for each presentation.

With that, Happy Canada Day (tomorrow, though) — never felt better to be part of something Glorious and Free, and also free of Brexit, Drumpf and other nonsense.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to write good tests in R

By Brian Lee Yung Rowe

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

Testing is an often overlooked yet critical component of any software system. In some ways this is more true of models than traditional software. The reason is that computational systems must function correctly at both the system level and the model level. This article provides some guidelines and tips to increase the certainty around the correctness of your models.

Testing is a critical part of any system

Guiding Principles

One of my mantras is that a good tool extends our ability and never gets in our way. I avoid many libraries and applications because the tool gets in my way more than it helps me. I look at tests the same way. If it takes too long to test, then the relative utility of the test is lower than the utility of my time. When that happens I stop writing tests. To ensure tests are worthwhile, I follow a handful of principles when writing tests.

In general, tests should be:

  • self contained – don’t rely on code outside of the test block;
  • isolated – independent and not commingled with other test cases;
  • unique – focus on a unique behavior or property of a function that has not been previously tested;
  • useful – focus on edge cases and inputs where the function may behave erratically or wildly;
  • scalable – easy to write variations without a lot of ceremony.

By following these principles, you can maximize the utility of the tests with minimal effort.

Testing as Metropolis-Hastings

I like to think about testing as an application of MCMC. Think about the function you want to test. If written without side effects, then for each input you get an output that you can examine. Covering every single input is typically untenable, so the goal is to optimize this process. That’s where the MCMC concept comes into play, specifically with the Metropolis-Hastings algorithm (M-H). This technique produces random samples that follows an arbitrary probability distribution. What this means is that where a distribution is dense, there will be more points versus in an area with low probability.

Now think about testing. Usually we care about edge cases and boundary conditions as this is where a function may behave unexpectedly. We need more tests for these conditions and less tests where we know values are well-behaved. In terms of M-H, given a probability distribution of likely inputs, we want to create test cases according to the inverse of this distribution! Following this approach, it’s actually possible to generate test cases randomly with a high degree of certainty that they cover the most critical inputs of the function.

Sane Test Creation and Management

It’s not uncommon for tests to be written at the get-go and then forgotten about. Remember that as code changes or incorrect behavior is found, new tests need to be written or existing tests need to be modified. Possibly worse than having no tests is having a bunch of tests spitting out false positives. This is because humans are prone to habituation and desensitization. It’s easy to become habituated to false positives to the point where we no longer pay attention to them.

Temporarily disabling tests may be acceptable in the short term. A more strategic solution is to optimize your test writing. The easier it is to create and modify tests, the more likely they will be correct and continue to provide value. For my testing, I generally write code to automate a lot of wiring to verify results programmatically.

The following example is from one of my interns. Most of our work at Pez.AI is in natural language processing, so we spend a lot of time constructing mini-models to study different aspects of language. One of our tools splits sentences into smaller phrases based on grammatical markers. The current test looks like

df1<-data.frame(forum="Forums", 
                title="My printer is offline, how do I get it back on line?", 
                thread.id=5618300, 
                timestamp="2016-05-13 08:50:00",
                user.id="Donal_M",
                text="My printer is always offline, how do I get it back on line?",
                text.id=1)

output_wds1<- data.frame(thread.id=c(5618300,5618300,5618300),
                        text.id= unlist(lapply(c(1,1,1), as.integer)),
                        fragment.pos=c('C','C','P'),
                        fragment.type=c('how','it','on'),
                        start=unlist(lapply(c(1,5,7), as.numeric)),
                        end=unlist(lapply(c(4,6,8), as.numeric)),
                        text=c('how do i get','it back','on line'))

output_wds1$fragment.pos<-as.character(output_wds$fragment.pos)
output_wds1$fragment.type<-as.character(output_wds$fragment.type)

test_that("Test for HP forum", {
  expect_equivalent(mark_sentence(df1), output_wds1)
  expect_equivalent(mark_sentence(df2), output_wds2)
})

The original code actually contains a second test case, which is referenced in the test_that block. There are a number of issues with this construction. The first three principles are violated (can you explain why?) not to mention that it’s difficult to construct new tests easily. Fixing the violated principles is easy, since it just involves rearranging the code. Making the tests easier to write takes a bit more thought.

test_that("Output of mark_sentence is well-formed", {
  df <- data.frame(forum="Forums", 
    title="My printer is offline, how do I get it back on line?", 
    thread.id=5618300, 
    timestamp="2016-05-13 08:50:00",
    user.id="Donal_M",
    text="My printer is always offline, how do I get it back on line?",
    text.id=1)

  # act = actual, exp = expected
  act <- mark_sentence(df)

  expect_equal(act$fragment.pos, c('C','C','P'))
  expect_equal(act$fragment.type, c('how','it','on'))
  expect_equal(act$start, c(1,5,7))
  expect_equal(act$end, c(4,6,8))
})

Now the test is looking better. That said, it can still be a pain to construct the expected results. An alternative approach is to use a data structure that embodies the test conditions and use that directly.

exp <- read.csv(text='
fragment.pos,fragment.type,start,end
C,how,1,4
C,it,5,6
P,on,7,8
', header=TRUE, stringsAsFactors=FALSE)
fold(colnames(exp), function(col) expect_that(exp[,col], act[,col]))

(Note that fold is in my package lambda.tools.)

The rationale behind this approach is that it can be tedious to construct complex data structures by hand. Instead, you can produce the result by calling the function to test directly, and then copy it back into the test case. The caveat is that doing this blindly doesn’t test anything, so you need to review the results before “blessing” them.

It’s easy to see how such a structure can easily extend to providing an input file and an expected output file, so even more scenarios can be run in a single shot.

Conclusion

For tests to work, we must remain sensitized to them. They need to cause us a bit of anxiety and motivate us to correct them. To maintain their effectiveness, it’s important that tests are easy to create, maintain, and produce correct results. Following the principles above will help you optimize the utility of your tests.

This site generously supported by Datacamp. Datacamp offers a free interactive introduction to R coding tutorial as an additional resource. Already over 100,000 people took this free tutorial to sharpen their R coding skills.

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The useR! 2016 Tutorials

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Over the years I have seen several excellent tutorials at useR!conferences that were not only very satisfying “you had to be there” experiences but were also backed up with meticulously prepared materials of lasting value. This year, quite a few useR!20i6 tutorials measure up to this level of quality. My take on why things turned out this way is that GitHub, Markdown, and Jupyter notebooks have been universally adopted as workshop / tutorial creation tools, and that having the right tools encourages creativity and draws out one’s best efforts.

Jenny Bryan’s tutorial Using Git and GitHub with R, Rstudio, and R Markdown and the tutorial by Andrie de Vries and Micheleen Harris: Using R with Jupyter notebooks for reproducible research are two superb, Escheresque self-referencing examples of what I am talking about. Bryan’s tutorial which uses GitHub and R Markdown to teach GitHub and R Markdown is an impressive introduction to these two essential resources. And, the tutorial by de Vries and Harris makes very effective use of GitHub and Jupyter Notebooks. Moreover, this tutorial sets the gold standard for how to set up a system for interactive user participation. Harris and de Vries staged their tutorial on Microsoft’s Azure Data Science VM. The Linux version of this VM comes provisioned with JupyterHub, a set of processes that enables a multi-user Jupyter Notebook server. Once the VM is loaded with the training materials, its only a matter of giving students a username and password to grant them immediate access to the interactive workshop materials. Have a look at notebook 06 to see how to set all of this up.

After seeing this, and comparing it to other tutorials where instructors wasted the better part of an hour trying to get students up and running with local copies of their course materials I can’t see why everyone wouldn’t opt for a cloud solution to this problem. When word gets out, the Data Science VM is going to be the standard for delivering technical workshops.

Unfortunately, I couldn’t get around to see all of the tutorials, but two more that I can heartily recommend are MoRe than woRds, Text and Context: Language Analytics in Finance with R, the introduction to text mining by Sanjiv Das and Karthik Mokashi and Machine Learning Algorithmic Deep Dive by Erin Ledell. Sanjiv Das is an inspired educator and I have never seen a presentation either at R/Finance, useR! or even BARUG, the Bay Area useR Group where he wasn’t on his game and super prepared. The tutorial he and Karthik gave this year at useR!2016 is a self-contained course in text mining.
Erin Ledell also came prepared with more tutorial material then she could ever present in three hours. But because of her thoughtful use of GitHub, Markdown and notebooks we have a machine learning resource that is well worth studying. Just being introduced to this incredible visualization of decision trees by Tony Chu and Stephanie Yee made my day.

Ledell is also a gifted teacher who anticipates where here audience may have have difficulties. Her historical approach to understanding gradient boosting machines provides an opportunity to clarify the differences between various versions of the boosting algorithms. Sometimes understanding how something came to be is halfway towards understanding how it works.

The bar for presenting lectures, tutorials and workshops has been set pretty high. Anyone who is serious about delivering a high quality education probably needs to develop some skills with GitHub, Markdown and Notebooks. Studying the tutorial materials from useR! 2016 is a good place to start.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Join us at rstudio::conf 2017!

By Roger Oberg

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Following our initial and very gratifying Shiny Developer Conference this past January, which sold out in a few days, RStudio is very excited to announce a new and bigger conference today!

rstudio::conf, the conference about all things R and RStudio, will take place January 13 and 14, 2017 in Orlando, Florida. The conference will feature talks and tutorials from popular RStudio data scientists and developers like Hadley Wickham, Yihui Xie, Joe Cheng, Winston Chang, Garrett Grolemund, and J.J. Allaire, along with lightning talks from RStudio partners and customers.

Preceding the conference, on January 11 and 12, RStudio will offer two days of optional training. Training attendees can choose from Hadley Wickham’s Master R training, a new Intermediate Shiny workshop from Shiny creator Joe Cheng or a new workshop from Garrett Grolemund that is based on his soon-to-be-published book with Hadley: Introduction to Data Science with R.

rstudio::conf is for R and RStudio users who want to learn how to write better shiny applications in a better way, explore all the new capabilities of the R Markdown authoring framework, apply R to big data and work effectively with Spark, understand the RStudio toolchain for data science with R, discover best practices and tips for coding with RStudio, and investigate enterprise scale development and deployment practices and tools, including the new RStudio Connect.

Not to be missed, RStudio has also reserved Universal Studio’s The Wizarding World of Harry Potter on Friday night, January 13, for the exclusive use of conference attendees!

Conference attendance is limited to 400. Training is limited to 70 students for each of the three 2-day workshops. All seats are are available on a first-come, first-serve basis.

Please go to http://www.rstudio.com/conference to purchase.

We hope to see you in Florida at rstudio::conf 2017!

For questions or issues registering, please email conf@rstudio.com. To ask about sponsorship opportunities contact anne@rstudio.com.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Case Study: Customized R Training and a “Day 1” Curriculum

By Ari Lamstein

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Earlier this year I had the honor of training the research division of a financial services firm in R. I’ve been meaning to write a case study on this project for a while, but have put it off due to the size and complexity of the engagement.

In this post I’ll limit myself to talking about two aspects of the engagement: the Project Assessment and the “Day 1 Curriculum” that I created for the client.

Bonus: Download My “Day 1” Curriculum for R!

Project Assessment

After our initial discussions the client and I decided to move forward with a Project Assessment. A Project Assessment is more formal than a simple conversation in that:

  1. I sign a Non-Disclosure Agreement (NDA), which allows us to speak frankly about the changes that the client is seeking to make
  2. I submit a proposal after the meeting, which provides my recommendations to them

During the assessment I had three goals:

  1. Learn the current situation of the team
  2. Learn where the team wanted to be, and why
  3. Develop a mental model of how to get the team from A to B

During the assessment I learned that each team member had their own analytical tool of choice (e.g. R, Excel and SAS). For a variety of reasons, the team wanted all of the members to start using R for the majority of their work. In my opinion, the largest problem they faced was quickly getting all of their members to a point where they could effectively work in R.

I also learned that this wasn’t the team’s first attempt at migrating to R. Unfortunately, though, previous attempts to transition had led to some frustration.

Proposal

My proposal was to get each member of the team to a point where they could do their normal work in R. Because some of the team members had trepidation about the change, I decided to rely heavily on pair programming: that allowed me to work one-on-one with people to address any concerns that they had.

I also wanted to avoid teaching abstract techniques to people, while leaving it up to them to apply those techniques to their “real” work. To avoid this I familiarized myself with their data, and developed an R curriculum based on their actual data and normal analyses.

A “Day 1” Curriculum

While most of the training material I created was geared specifically towards the client’s data, my introductory material was plain R, and might be useful to other people as well. It’s an R script that I went through when pair programming on my first day with the analysts. It covers the basics of R, and includes both examples and exercises. If you’d like, you can download the file below.

Bonus: Download My “Day 1” Curriculum for R!

Please contact me if you are interested in me running a similar training session with you or your team.

The post Case Study: Customized R Training and a “Day 1” Curriculum appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Boost Your Data Munging with R

By Jan Górecki – R

SO questions monthly for data.table - Only data.table tagged questions, not ones with data.table (accepted) answers.

(This article was first published on Jan Gorecki – R, and kindly contributed to R-bloggers)

This article was first published on the toptal.com blog.

Additionally be noticed that my blog is migrating to new host due to GitHub Pages drops support for RDiscount, Redcarpet, and RedCloth (Textile) markup engines. Old host will be still available but new posts will be published on jangorecki.gitlab.io, drop-in replacement after changing from github.io to gitlab.io.


The R language is often perceived as a language for statisticians and data scientists. Quite a long time ago, this was mostly true. However, over the years the flexibility R provides via packages has made R into a more general purpose language. R was open sourced in 1995, and since that time repositories of R packages are constantly growing. Still, compared to languages like Python, R is strongly based around the data.

Speaking about data, tabular data deserves particular attention, as it’s one of the most commonly used data types. It is a data type which corresponds to a table structure known in databases, where each column can be of a different type, and processing performance of that particular data type is the crucial factor for many applications.

In this article, we are going to present how to achieve tabular data transformation in an efficient manner. Many people who use R already for machine learning are not aware that data munging can be done faster in R, and that they do not need to use another tool for it.

High-performance Solution in R

Base R introduced the data.frame class in the year 1997, which was based on S-PLUS before it. Unlike commonly used databases which store data row by row, R data.frame stores the data in memory as a column-oriented structure, thus making it more cache-efficient for column operations which are common in analytics. Additionally, even though R is a functional programming language, it does not enforce that on the developer. Both opportunities have been well addressed by data.table R package, which is available in CRAN repository. It performs quite fast when grouping operations, and is particularly memory efficient by being careful about materializing intermediate data subsets, such as materializing only those columns necessary for a certain task. It also avoids unnecessary copies through its reference semantics while adding or updating columns. The first version of the package has been published in April 2006, significantly improving data.frame performance at that time. The initial package description was:

This package does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames. This package defines a new class data.table which operates just like a data.frame, but uses up to 10 times less memory, and can be up to 10 times faster to create (and copy). It also takes the opportunity to allow subset() and with() like expressions inside the []. Most of the code is copied from base functions with the code manipulating row.names removed.

Since then, both data.frame and data.table implementations have been improved, but data.table remains to be incredibly faster than base R. In fact, data.table isn’t just faster than base R, but it appears to be one of the fastest open-source data wrangling tool available, competing with tools like Python Pandas, and columnar storage databases or big data apps like Spark. Its performance over distributed shared infrastructure hasn’t been yet benchmarked, but being able to have up to two billion rows on a single instance gives promising prospects. Outstanding performance goes hand-in-hand with the functionalities. Additionally, with recent efforts at parallelizing time-consuming parts for incremental performance gains, one direction towards pushing the performance limit seems quite clear.

Data Transformation Examples

Learning R gets a little bit easier because of the fact that it works interactively, so we can follow examples step by step and look at the results of each step at any time. Before we start, let’s install the data.table package from CRAN repository.

install.packages("data.table")

Useful hint: We can open the manual of any function just by typing its name with leading question mark, i.e. ?install.packages.

Loading Data into R

There are tons of packages for extracting data from a wide range of formats and databases, which often includes native drivers. We will load data from the CSV file, the most common format for raw tabular data. File used in the following examples can be found here. We don’t have to bother about CSV reading performance as the fread function is highly optimized on that.

In order to use any function from a package, we need to load it with the library call.

library(data.table)
DT <- fread("flights14.csv")
print(DT)
##         year month day dep_delay arr_delay carrier origin dest air_time
##      1: 2014     1   1        14        13      AA    JFK  LAX      359
##      2: 2014     1   1        -3        13      AA    JFK  LAX      363
##      3: 2014     1   1         2         9      AA    JFK  LAX      351
##      4: 2014     1   1        -8       -26      AA    LGA  PBI      157
##      5: 2014     1   1         2         1      AA    JFK  LAX      350
##     ---                                                                
## 253312: 2014    10  31         1       -30      UA    LGA  IAH      201
## 253313: 2014    10  31        -5       -14      UA    EWR  IAH      189
## 253314: 2014    10  31        -8        16      MQ    LGA  RDU       83
## 253315: 2014    10  31        -4        15      MQ    LGA  DTW       75
## 253316: 2014    10  31        -5         1      MQ    LGA  SDF      110
##         distance hour
##      1:     2475    9
##      2:     2475   11
##      3:     2475   19
##      4:     1035    7
##      5:     2475   13
##     ---              
## 253312:     1416   14
## 253313:     1400    8
## 253314:      431   11
## 253315:      502   11
## 253316:      659    8

If our data is not well modeled for further processing, as they need to be reshaped from long-to-wide or wide-to-long (also known as pivot and unpivot) format, we may look at ?dcast and ?melt functions, known from reshape2 package. However, data.table implements faster and memory efficient methods for data.table/data.frame class.

Querying with data.table Syntax

If You’re Familiar with data.frame

Query data.table is very similar to query data.frame. While filtering in i argument, we can use column names directly without the need to access them with the $ sign, like df[df$col > 1, ]. When providing the next argument j, we provide an expression to be evaluated in the scope of our data.table. To pass a non-expression j argument use with=FALSE. Third argument, not present in data.frame method, defines the groups, making the expression in j to be evaluated by groups.

# data.frame
DF[DF$col1 > 1L, c("col2", "col3")]
# data.table
DT[col1 > 1L, .(col2, col3), ...] # by group using: `by = col4`

If You’re Familiar with Databases

Query data.table in many aspects corresponds to SQL queries that more people might be familiar with. DT below represents data.table object and corresponds to SQLs FROM clause.

DT[ i = where,
    j = select | update,
    by = group by]
  [ having, ... ]
  [ order by, ... ]
  [ ... ] ... [ ... ]

Sorting Rows and Re-Ordering Columns

Sorting data is a crucial transformation for time series, and it is also imports for data extract and presentation. Sort can be achieved by providing the integer vector of row order to i argument, the same way as data.frame. First argument in query order(carrier, -dep_delay) will select data in ascending order on carrier field and descending order on dep_delay measure. Second argument j, as described in the previous section, defines the columns (or expressions) to be returned and their order.

ans <- DT[order(carrier, -dep_delay),
          .(carrier, origin, dest, dep_delay)]
head(ans)
##    carrier origin dest dep_delay
## 1:      AA    EWR  DFW      1498
## 2:      AA    JFK  BOS      1241
## 3:      AA    EWR  DFW      1071
## 4:      AA    EWR  DFW      1056
## 5:      AA    EWR  DFW      1022
## 6:      AA    EWR  DFW       989

To re-order data by reference, instead of querying data in specific order, we use set* functions.

setorder(DT, carrier, -dep_delay)
leading.cols <- c("carrier","dep_delay")
setcolorder(DT, c(leading.cols, setdiff(names(DT), leading.cols)))
print(DT)
##         carrier dep_delay year month day arr_delay origin dest air_time
##      1:      AA      1498 2014    10   4      1494    EWR  DFW      200
##      2:      AA      1241 2014     4  15      1223    JFK  BOS       39
##      3:      AA      1071 2014     6  13      1064    EWR  DFW      175
##      4:      AA      1056 2014     9  12      1115    EWR  DFW      198
##      5:      AA      1022 2014     6  16      1073    EWR  DFW      178
##     ---                                                                
## 253312:      WN       -12 2014     3   9       -21    LGA  BNA      115
## 253313:      WN       -13 2014     3  10       -18    EWR  MDW      112
## 253314:      WN       -13 2014     5  17       -30    LGA  HOU      202
## 253315:      WN       -13 2014     6  15        10    LGA  MKE      101
## 253316:      WN       -13 2014     8  19       -30    LGA  CAK       63
##         distance hour
##      1:     1372    7
##      2:      187   13
##      3:     1372   10
##      4:     1372    6
##      5:     1372    7
##     ---              
## 253312:      764   16
## 253313:      711   20
## 253314:     1428   17
## 253315:      738   20
## 253316:      397   16

Most often, we don’t need both the original dataset and the ordered/sorted dataset. By default, the R language, similar to other functional programming languages, will return sorted data as new object, and thus will require twice as much memory as sorting by reference.

Subset Queries

Let’s create a subset dataset for flight origin “JFK” and month from 6 to 9. In the second argument, we subset results to listed columns, adding one calculated variable sum_delay.

ans <- DT[origin == "JFK" & month %in% 6:9,
          .(origin, month, arr_delay, dep_delay, sum_delay = arr_delay + dep_delay)]
head(ans)
##    origin month arr_delay dep_delay sum_delay
## 1:    JFK     7       925       926      1851
## 2:    JFK     8       727       772      1499
## 3:    JFK     6       466       451       917
## 4:    JFK     7       414       450       864
## 5:    JFK     6       411       442       853
## 6:    JFK     6       333       343       676

By default, when subsetting dataset on single column data.table will automatically create an index for that column. This results in real-time answers on any further filtering calls on that column.

Update Dataset

Adding a new column by reference is performed using the := operator, it assigns a variable into dataset in place. This avoids in-memory copy of dataset, so we don’t need to assign results to each new variable.

DT[, sum_delay := arr_delay + dep_delay]
head(DT)
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:      AA      1498 2014    10   4      1494    EWR  DFW      200
## 2:      AA      1241 2014     4  15      1223    JFK  BOS       39
## 3:      AA      1071 2014     6  13      1064    EWR  DFW      175
## 4:      AA      1056 2014     9  12      1115    EWR  DFW      198
## 5:      AA      1022 2014     6  16      1073    EWR  DFW      178
## 6:      AA       989 2014     6  11       991    EWR  DFW      194
##    distance hour sum_delay
## 1:     1372    7      2992
## 2:      187   13      2464
## 3:     1372   10      2135
## 4:     1372    6      2171
## 5:     1372    7      2095
## 6:     1372   11      1980

To add more variables at once, we can use DT[,:=(sum_delay = arr_delay + dep_delay)] syntax, similar to .(sum_delay = arr_delay + dep_delay) when querying from dataset.

It is possible to sub-assign by reference, updating only particular rows in place, just by combining with i argument.

DT[origin=="JFK",
   distance := NA]
head(DT)
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:      AA      1498 2014    10   4      1494    EWR  DFW      200
## 2:      AA      1241 2014     4  15      1223    JFK  BOS       39
## 3:      AA      1071 2014     6  13      1064    EWR  DFW      175
## 4:      AA      1056 2014     9  12      1115    EWR  DFW      198
## 5:      AA      1022 2014     6  16      1073    EWR  DFW      178
## 6:      AA       989 2014     6  11       991    EWR  DFW      194
##    distance hour sum_delay
## 1:     1372    7      2992
## 2:       NA   13      2464
## 3:     1372   10      2135
## 4:     1372    6      2171
## 5:     1372    7      2095
## 6:     1372   11      1980

Aggregate Data

To aggregate data, we provide the third argument by to the square bracket. Then, in j we need to provide aggregate function calls, so the data can be actually aggregated. The .N symbol used in the j argument corresponds to the number of all observations in each group. As previously mentioned, aggregates can be combined with subsets on rows and selecting columns.

ans <- DT[,
          .(m_arr_delay = mean(arr_delay),
            m_dep_delay = mean(dep_delay),
            count = .N),
          .(carrier, month)]
head(ans)
##    carrier month m_arr_delay m_dep_delay count
## 1:      AA    10    5.541959    7.591497  2705
## 2:      AA     4    1.903324    3.987008  2617
## 3:      AA     6    8.690067   11.476475  2678
## 4:      AA     9   -1.235160    3.307078  2628
## 5:      AA     8    4.027474    8.914054  2839
## 6:      AA     7    9.159886   11.665953  2802

Often, we may need to compare a value of a row to its aggregate over a group. In SQL, we apply aggregates over partition by: AVG(arr_delay) OVER (PARTITION BY carrier, month).

ans <- DT[,
          .(arr_delay, carrierm_mean_arr = mean(arr_delay),
            dep_delay, carrierm_mean_dep = mean(dep_delay)),
          .(carrier, month)]
head(ans)
##    carrier month arr_delay carrierm_mean_arr dep_delay carrierm_mean_dep
## 1:      AA    10      1494          5.541959      1498          7.591497
## 2:      AA    10       840          5.541959       848          7.591497
## 3:      AA    10       317          5.541959       338          7.591497
## 4:      AA    10       292          5.541959       331          7.591497
## 5:      AA    10       322          5.541959       304          7.591497
## 6:      AA    10       306          5.541959       299          7.591497

If we don’t want to query data with those aggregates, and instead just put them into actual table updating by reference, we can accomplish that with := operator. This avoids the in-memory copy of the dataset, so we don’t need to assign results to the new variable.

DT[,
   `:=`(carrierm_mean_arr = mean(arr_delay),
        carrierm_mean_dep = mean(dep_delay)),
   .(carrier, month)]
head(DT)
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:      AA      1498 2014    10   4      1494    EWR  DFW      200
## 2:      AA      1241 2014     4  15      1223    JFK  BOS       39
## 3:      AA      1071 2014     6  13      1064    EWR  DFW      175
## 4:      AA      1056 2014     9  12      1115    EWR  DFW      198
## 5:      AA      1022 2014     6  16      1073    EWR  DFW      178
## 6:      AA       989 2014     6  11       991    EWR  DFW      194
##    distance hour sum_delay carrierm_mean_arr carrierm_mean_dep
## 1:     1372    7      2992          5.541959          7.591497
## 2:       NA   13      2464          1.903324          3.987008
## 3:     1372   10      2135          8.690067         11.476475
## 4:     1372    6      2171         -1.235160          3.307078
## 5:     1372    7      2095          8.690067         11.476475
## 6:     1372   11      1980          8.690067         11.476475

Join Datasets

Base R joining and merging of datasets is considered a special type of subset operation. We provide a dataset to which we want to join in the first square bracket argument i. For each row in dataset provided to i, we match rows from the dataset in which we use [. If we want to keep only matching rows (inner join), then we pass an extra argument nomatch = 0L. We use on argument to specify columns on which we want to join both datasets.

# create reference subset
carrierdest <- DT[, .(count=.N), .(carrier, dest) # count by carrier and dest
                  ][1:10                        # just 10 first groups
                    ]                           # chaining `[...][...]` as subqueries
print(carrierdest)
##     carrier dest count
##  1:      AA  DFW  5877
##  2:      AA  BOS  1173
##  3:      AA  ORD  4798
##  4:      AA  SEA   298
##  5:      AA  EGE    85
##  6:      AA  LAX  3449
##  7:      AA  MIA  6058
##  8:      AA  SFO  1312
##  9:      AA  AUS   297
## 10:      AA  DCA   172
# outer join
ans <- carrierdest[DT, on = c("carrier","dest")]
print(ans)
##         carrier dest count dep_delay year month day arr_delay origin
##      1:      AA  DFW  5877      1498 2014    10   4      1494    EWR
##      2:      AA  BOS  1173      1241 2014     4  15      1223    JFK
##      3:      AA  DFW  5877      1071 2014     6  13      1064    EWR
##      4:      AA  DFW  5877      1056 2014     9  12      1115    EWR
##      5:      AA  DFW  5877      1022 2014     6  16      1073    EWR
##     ---                                                             
## 253312:      WN  BNA    NA       -12 2014     3   9       -21    LGA
## 253313:      WN  MDW    NA       -13 2014     3  10       -18    EWR
## 253314:      WN  HOU    NA       -13 2014     5  17       -30    LGA
## 253315:      WN  MKE    NA       -13 2014     6  15        10    LGA
## 253316:      WN  CAK    NA       -13 2014     8  19       -30    LGA
##         air_time distance hour sum_delay carrierm_mean_arr
##      1:      200     1372    7      2992          5.541959
##      2:       39       NA   13      2464          1.903324
##      3:      175     1372   10      2135          8.690067
##      4:      198     1372    6      2171         -1.235160
##      5:      178     1372    7      2095          8.690067
##     ---                                                   
## 253312:      115      764   16       -33          6.921642
## 253313:      112      711   20       -31          6.921642
## 253314:      202     1428   17       -43         22.875845
## 253315:      101      738   20        -3         14.888889
## 253316:       63      397   16       -43          7.219670
##         carrierm_mean_dep
##      1:          7.591497
##      2:          3.987008
##      3:         11.476475
##      4:          3.307078
##      5:         11.476475
##     ---                  
## 253312:         11.295709
## 253313:         11.295709
## 253314:         30.546453
## 253315:         24.217560
## 253316:         17.038047
# inner join
ans <- DT[carrierdest,                # for each row in carrierdest
          nomatch = 0L,               # return only matching rows from both tables
          on = c("carrier","dest")]   # joining on columns carrier and dest
print(ans)
##        carrier dep_delay year month day arr_delay origin dest air_time
##     1:      AA      1498 2014    10   4      1494    EWR  DFW      200
##     2:      AA      1071 2014     6  13      1064    EWR  DFW      175
##     3:      AA      1056 2014     9  12      1115    EWR  DFW      198
##     4:      AA      1022 2014     6  16      1073    EWR  DFW      178
##     5:      AA       989 2014     6  11       991    EWR  DFW      194
##    ---                                                                
## 23515:      AA        -8 2014    10  11       -13    JFK  DCA       53
## 23516:      AA        -9 2014     5  21       -12    JFK  DCA       52
## 23517:      AA        -9 2014     6   5        -6    JFK  DCA       53
## 23518:      AA        -9 2014    10   2       -21    JFK  DCA       51
## 23519:      AA       -11 2014     5  27        10    JFK  DCA       55
##        distance hour sum_delay carrierm_mean_arr carrierm_mean_dep count
##     1:     1372    7      2992          5.541959          7.591497  5877
##     2:     1372   10      2135          8.690067         11.476475  5877
##     3:     1372    6      2171         -1.235160          3.307078  5877
##     4:     1372    7      2095          8.690067         11.476475  5877
##     5:     1372   11      1980          8.690067         11.476475  5877
##    ---                                                                  
## 23515:       NA   15       -21          5.541959          7.591497   172
## 23516:       NA   15       -21          4.150172          8.733665   172
## 23517:       NA   15       -15          8.690067         11.476475   172
## 23518:       NA   15       -30          5.541959          7.591497   172
## 23519:       NA   15        -1          4.150172          8.733665   172

Be aware that because of the consistency to base R subsetting, the outer join is by default RIGHT OUTER. If we are looking for LEFT OUTER, we need to swap the tables, as in the example above. Exact behavior can also be easily controlled in merge data.table method, using the same API as base R merge data.frame.

If we want to simply lookup the column(s) to our dataset, we can efficiently do it with := operator in j argument while joining. The same way as we sub-assign by reference, as described in the Update dataset section, we just now add a column by reference from the dataset to which we join. This avoids the in-memory copy of data, so we don’t need to assign results into new variables.

DT[carrierdest,                     # data.table to join with
   lkp.count := count,              # lookup `count` column from `carrierdest`
   on = c("carrier","dest")]        # join by columns
head(DT)
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:      AA      1498 2014    10   4      1494    EWR  DFW      200
## 2:      AA      1241 2014     4  15      1223    JFK  BOS       39
## 3:      AA      1071 2014     6  13      1064    EWR  DFW      175
## 4:      AA      1056 2014     9  12      1115    EWR  DFW      198
## 5:      AA      1022 2014     6  16      1073    EWR  DFW      178
## 6:      AA       989 2014     6  11       991    EWR  DFW      194
##    distance hour sum_delay carrierm_mean_arr carrierm_mean_dep lkp.count
## 1:     1372    7      2992          5.541959          7.591497      5877
## 2:       NA   13      2464          1.903324          3.987008      1173
## 3:     1372   10      2135          8.690067         11.476475      5877
## 4:     1372    6      2171         -1.235160          3.307078      5877
## 5:     1372    7      2095          8.690067         11.476475      5877
## 6:     1372   11      1980          8.690067         11.476475      5877

For aggregate while join, use by = .EACHI. It performs join that won’t materialize intermediate join results and will apply aggregates on the fly, making it memory efficient.

Rolling join is an uncommon feature, designed for dealing with ordered data. It fits perfectly for processing temporal data, and time series in general. It basically roll matches in join condition to next matching value. Use it by providing the roll argument when joining.

Fast overlap join joins datasets based on periods and its overlapping handling by using various overlaping operators: any, within, start, end.

A non-equi join feature to join datasets using non-equal condition is currently being developed.

Profiling Data

When exploring our dataset, we may sometimes want to collect technical information on the subject, to better understand the quality of the data.

Descriptive Statistics

summary(DT)
##    carrier            dep_delay            year          month       
##  Length:253316      Min.   :-112.00   Min.   :2014   Min.   : 1.000  
##  Class :character   1st Qu.:  -5.00   1st Qu.:2014   1st Qu.: 3.000  
##  Mode  :character   Median :  -1.00   Median :2014   Median : 6.000  
##                     Mean   :  12.47   Mean   :2014   Mean   : 5.639  
##                     3rd Qu.:  11.00   3rd Qu.:2014   3rd Qu.: 8.000  
##                     Max.   :1498.00   Max.   :2014   Max.   :10.000  
##                                                                      
##       day          arr_delay           origin              dest          
##  Min.   : 1.00   Min.   :-112.000   Length:253316      Length:253316     
##  1st Qu.: 8.00   1st Qu.: -15.000   Class :character   Class :character  
##  Median :16.00   Median :  -4.000   Mode  :character   Mode  :character  
##  Mean   :15.89   Mean   :   8.147                                        
##  3rd Qu.:23.00   3rd Qu.:  15.000                                        
##  Max.   :31.00   Max.   :1494.000                                        
##                                                                          
##     air_time        distance           hour         sum_delay      
##  Min.   : 20.0   Min.   :  80.0   Min.   : 0.00   Min.   :-224.00  
##  1st Qu.: 86.0   1st Qu.: 529.0   1st Qu.: 9.00   1st Qu.: -19.00  
##  Median :134.0   Median : 762.0   Median :13.00   Median :  -5.00  
##  Mean   :156.7   Mean   : 950.4   Mean   :13.06   Mean   :  20.61  
##  3rd Qu.:199.0   3rd Qu.:1096.0   3rd Qu.:17.00   3rd Qu.:  23.00  
##  Max.   :706.0   Max.   :4963.0   Max.   :24.00   Max.   :2992.00  
##                  NA's   :81483                                     
##  carrierm_mean_arr carrierm_mean_dep   lkp.count     
##  Min.   :-22.403   Min.   :-4.500    Min.   :  85    
##  1st Qu.:  2.676   1st Qu.: 7.815    1st Qu.:3449    
##  Median :  6.404   Median :11.354    Median :5877    
##  Mean   :  8.147   Mean   :12.465    Mean   :4654    
##  3rd Qu.: 11.554   3rd Qu.:17.564    3rd Qu.:6058    
##  Max.   : 86.182   Max.   :52.864    Max.   :6058    
##                                      NA's   :229797

Cardinality

We can check the uniqueness of data by using uniqueN function and apply it on every column. Object .SD in the query below corresponds to Subset of the Data.table:

DT[, lapply(.SD, uniqueN)]
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:      14       570    1    10  31       616      3  109      509
##    distance hour sum_delay carrierm_mean_arr carrierm_mean_dep lkp.count
## 1:      152   25      1021               134               134        11

NA Ratio

To calculate the ratio of unknown values (NA in R, and NULL in SQL) for each column, we provide the desired function to apply on every column.

DT[, lapply(.SD, function(x) sum(is.na(x))/.N)]
##    carrier dep_delay year month day arr_delay origin dest air_time
## 1:       0         0    0     0   0         0      0    0        0
##     distance hour sum_delay carrierm_mean_arr carrierm_mean_dep lkp.count
## 1: 0.3216654    0         0                 0                 0 0.9071555

Exporting Data

Fast export tabular data to CSV format is also provided by the data.table package.

tmp.csv <- tempfile(fileext=".csv")
fwrite(DT, tmp.csv)
# preview exported data
cat(system(paste("head -3",tmp.csv), intern=TRUE), sep="n")
## carrier,dep_delay,year,month,day,arr_delay,origin,dest,air_time,distance,hour,sum_delay,carrierm_mean_arr,carrierm_mean_dep,lkp.count
## AA,1498,2014,10,4,1494,EWR,DFW,200,1372,7,2992,5.54195933456561,7.59149722735674,5877
## AA,1241,2014,4,15,1223,JFK,BOS,39,,13,2464,1.90332441727168,3.98700802445548,1173

At the time of writing this, the fwrite function hasn’t yet been published to the CRAN repository. To use it we need to install data.table development version, otherwise we can use base R write.csv function, but don’t expect it to be fast.

Resources

There are plenty of resources available. Besides the manuals available for each function, there are also package vignettes, which are tutorials focused around the particular subject. Those can be found on the Getting started page. Additionally, the Presentations page lists more than 30 materials (slides, video, etc.) from data.table presentations around the globe. Also, the community support has grown over the years, recently reaching the 4000-th question on Stack Overflow data.table tag, still having a high ratio (91.9%) of answered questions. The below plot presents the number of data.table tagged questions on Stack Overflow over time.

Summary

This article provides chosen examples for efficient tabular data transformation in R using the data.table package. The actual figures on performance can be examined by looking for reproducible benchmarks. I published a summarized blog post about data.table solutions for the top 50 rated StackOverflow questions for the R language called Solve common R problems efficiently with data.table, where you can find a lot of figures and reproducible code. The package data.table uses native implementation of fast radix ordering for its grouping operations, and binary search for fast subsets/joins. This radix ordering has been incorporated into base R from version 3.3.0. Additionally, the algorithm was recently implemented into H2O machine learning platform and parallelized over H2O cluster, enabling efficient big joins on 10B x 10B rows.

To leave a comment for the author, please follow the link and comment on their blog: Jan Gorecki – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Computerworld’s advanced beginner’s guide to R

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Many newcomers to R got their start learning the language with Computerworld’s Beginner’s Guide to R, a 6-part introduction to the basics of the language. Now, budding R users who want to take their skills to the next level have a new guide to help them: Computerword’s Advanced Beginner’s Guide to R. Written by Sharon Machlis, author of the prior Beginner’s guide and regular reporter of R news at Computerworld, this new 72-page guide dives into some trickier topics related to R: extracting data via API, data wrangling, and data visualization.

On the data wrangling front, the guide provides some recipes for handling messy data. You’ll learn how to transform data and add the resulting data as a new column to a data frame. There’s also an extended look at restructuring data: transforming “wide” data to “long” data, and vice versa.

For visualizing data, there’s a basic intro to the ggplot2 package and its grammar of graphics. There’s also an in-depth tutorial on creating choropleths: geographics maps with regions shaded by data values.

And as an in-dept example of importing data, you’ll learn how to use the Google Analytics API to download and prepare data on traffic to your website.

Finally, don’t miss the comprehensive index of R packages for data import, data wrangling, and visualization.

To download the 72-page PDF (free by providing your email address) visit Computerworld at the link below.

Computerworld Crash Course: Advanced Beginner’s Guide to R

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Simulation and power analysis of generalized linear mixed models

By Educate-R – R

(This article was first published on Educate-R – R, and kindly contributed to R-bloggers)

Simulation and power analysis of generalized linear mixed models

Brandon LeBeau

University of Iowa

Overview

  1. (G)LMMs
  2. Power
  3. simglm package
  4. Demo Shiny App!

Linear Mixed Model (LMM)

Power

  • Power is the ability to statistically detect a true effect (i.e. non-zero population effect).
  • For simple models (e.g. t-tests, regression) there are closed form equations for generating power.
    • R has routines for these: power.t.test, power.anova.test
    • Gpower3

Power Example

n <- seq(4, 1000, 2)
power <- sapply(seq_along(n), function(i) 
  power.t.test(n = n[i], delta = .15, sd = 1, type = 'two.sample')$power)

Power for (G)LMM

Power is hard

  • In practice, power is hard.
  • Need to make many assumptions on data that has not been collected.
    • Therefore, data assumptions made for power computations will likely differ from collected sample.
  • A power analysis needs to be flexible, exploratory, and well thought out.

simglm Overview

  • simglm aims to simulate (G)LMMs with up to three levels of nesting (aim to add more later).
  • Flexible data generation allows:
    • any number of covariates and discrete covariates
    • change random distribution
    • unbalanced data
    • missing data
    • serial correlation.
  • Also has routines to generate power.

Demo Shiny App

shiny::runGitHub('simglm', username = 'lebebr01', subdir = 'inst/shiny_examples/demo')

or

devtools::install_github('lebebr01/simglm')
library(simglm)
run_shiny()
  • Must have following packages installed: simglm, shiny, shinydashboard, ggplot2, lme4, DT.

Questions?

To leave a comment for the author, please follow the link and comment on their blog: Educate-R – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Express Intro to dplyr

By Stevie P

(This article was first published on Rolling Your Rs, and kindly contributed to R-bloggers)

Working The Data Like a Boss !

I recently introduced the data.table package which provides a nice way to manage and aggregate large data sources using the standard bracket notation that is commonly employed when manipulating data frames in R. As data sources grow larger one must be prepared with a variety of approaches to efficiently handle this information. Using databases (both SQL and NoSQL) are a possibility wherein one queries for a subset of information although this assumes that the database is pre-existing or that you are prepared to create it yourself. The dplyr package offers ways to read in large files, interact with databases, and accomplish aggregation and summary. Some feel that dplyr is a competitor to the data.table package though I do not share that view. I think that each offers a well-conceived philosophy and approach and does a good job of delivering on their respective design goals. That there is overlap in their potential applications simply means to me that there is another way to do something. They are just great tools in a larger toolbox so I have no complaints. Let’s dig into dplyr to learn what it can do. Note that this post is part one of two. The second dplyr blog will apply the knowledge learned in this post.

Upcoming Class

Before we get too deep into this I wanted to indicate that I will be teaching a 3-day Intro to R BootCamp in the Atlanta, GA area of the US sometime in August or September. I say “sometime” because the logistics are still under development. If interested please feel free to email me and once I get everything lined up I will get back to you with the details. You can also visit my home page. Thanks for indulging my self-promotion. Steve – Now Back to the Action…

Verbs in Action !

dplyr is based on the idea that when working with data there are a number of common activities one will pursue: reading, filtering rows on some condition, selecting or excluding columns, arranging/sorting, grouping, summarize, merging/joining, and mutating/transforming columns. There are other activities but these describe the main categories. dplyr presents a number of commands or “verbs” that help you accomplish the work. Note that dplyr does not replace any existing commands – it simply gives you new commands:

Command Purpose
select() Select columns from a data frame
filter() Filter rows according to some condition(s)
arrange() Sort / Re-order rows in a data frame
mutate() Create new columns or transform existing ones
group_by() Group a data frame by some factor(s) usually in conjunction to summary
summarize() Summarize some values from the data frame or across groups
inner_join(x,y,by=”col”) return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
left_join(x,y,by=”col”) return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
right_join(x,y,by=”col”) return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned
anti_join(x,y,by=”col”) return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’

readr

There is also an associated package called readr that is more efficient at ingesting CSV files than the base R functions such as read.csv. While it is not part of the actual dplyr package it does in fact produce a dplyr structure as it reads in files. readr provides the read_csv function to do the work. It is also pretty smart and can figure things out like if there is a header or not so you don’t have to provide a lot of additional arguments. Here is an example using a file that contains information on weather station measurements in the year 2013.

install.packages("readr")  # one time only 
library(readr)

url <- "http://steviep42.bitbucket.org/YOUTUBE.DIR/weather.csv"
download.file(url,"weather.csv")

system("head -5 weather.csv")  # Take a peak at the first 5 lines

"origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib"
"EWR",2013,1,1,0,37.04,21.92,53.97,230,10.35702,11.9186514756,0,1013.9,10
"EWR",2013,1,1,1,37.04,21.92,53.97,230,13.80936,15.8915353008,0,1013,10
"EWR",2013,1,1,2,37.94,21.92,52.09,230,12.65858,14.5672406924,0,1012.6,10
"EWR",2013,1,1,3,37.94,23,54.51,230,13.80936,15.8915353008,0,1012.7,10

weather <- read_csv("weather.csv")

weather
Source: local data frame [8,719 x 14]

   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
    (chr) (int) (int) (int) (int) (dbl) (dbl) (dbl)    (int)      (dbl)
1     EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
2     EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
3     EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
4     EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
5     EWR  2013     1     1     4 37.94 24.08 57.04      240   14.96014
6     EWR  2013     1     1     6 39.02 26.06 59.37      270   10.35702
7     EWR  2013     1     1     7 39.02 26.96 61.63      250    8.05546
8     EWR  2013     1     1     8 39.02 28.04 64.43      240   11.50780
9     EWR  2013     1     1     9 39.92 28.04 62.21      250   12.65858
10    EWR  2013     1     1    10 39.02 28.04 64.43      260   12.65858
..    ...   ...   ...   ...   ...   ...   ...   ...      ...        ...
Variables not shown: wind_gust (dbl), precip (dbl), pressure (dbl), visib (dbl)

tbl_df

It is important to note that dplyr works transparently with existing R data frames though ideally one should explicitly create or transform an existing data frame to a dplyr structure to get the full benefit of the package. Let’s use the dplyr tbl_df command to wrap an existing data frame. We’ll convert the infamous mtcars data frame into a dplyr table since it is a small data frame that is easy to understand. The main advantage in using a ‘tbl_df’ over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text.

dp_mtcars <- tbl_df(mtcars)

# dp_mtcars is a data frame as well as a dplyr object

class(dp_mtcars)
[1] "tbl_df"     "tbl"        "data.frame"

In the example below (as with the readr example above) notice how only a subset of the data gets printed by default. This is actually very nice especially if you have ever accidentally typed the name of a really, really large native data frame. R will dutifully try to print a large portion of the data even if it locks up your R session. So wrapping the data frame in a dplyr table will prevent this. Also notice how you get a summary of the number of rows and columns as well as the type of each column.

dp_mtcars
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..

Now we could start to operate on this data frame / dplyr table by using some of the commands on offer from dplyr. They do pretty much what the name implies and you could use them in isolation though the power of dplyr comes through when using the piping operator to chain together commands. We’ll get there soon enough. Here are some basic examples:

filter()


# Find all rows where MPG is >= 30 and Weight is over 1.8 tons

filter(dp_mtcars, mpg >= 30 & wt > 1.8)
Source: local data frame [2 x 11]

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1

select()

The following example illustrates how the select() function works. We will select all columns whose name begins with the letter “m”. This is more useful when you have lots of columns that are named according to some pattern. For example some Public Health data sets can have many, many columns (hundreds even) so counting columns becomes impractical which is why select() supports a form of regular expressions to find columns by name. Other helpful arguments in this category include:

Argument Purpose
ends_with(x, ignore.case=TRUE) Finds columns whose nqme ends with “x”
contains(x, ignore.case=TRUE) Finds columns whose nqme contains “x”
matches(x, ignore.case=TRUE) Finds columns whose names match the regular expression “x”
num_range(“x”,1:5, width=2) selects all variables (numerically) from x01 to x05
one_of(“x”, “y”, “z”) Selects variables provided in a character vector
select(dp_mtcars,starts_with("m"))
Source: local data frame [32 x 1]

     mpg
   (dbl)
1   21.0
2   21.0
3   22.8
4   21.4
5   18.7
6   18.1
7   14.3
8   24.4
9   22.8
10  19.2

# Get all columns except columns 5 through 10 

select(dp_mtcars,-(5:10))
Source: local data frame [32 x 5]

     mpg   cyl  disp    hp  carb
   (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110     4
2   21.0     6 160.0   110     4
3   22.8     4 108.0    93     1
4   21.4     6 258.0   110     1
5   18.7     8 360.0   175     2
6   18.1     6 225.0   105     1
7   14.3     8 360.0   245     4
8   24.4     4 146.7    62     2
9   22.8     4 140.8    95     2
10  19.2     6 167.6   123     4
..   ...   ...   ...   ...   ...

mutate()

Here we use the mutate() function to transform the wt variable by multiplying it by 1,000 and then we create a new variable called “good_mpg” which takes on a value of “good” or “bad” depending on if a given row’s MPG value is > 25 or not

mutate(dp_mtcars, wt=wt*1000, good_mpg=ifelse(mpg > 25,"good","bad"))
Source: local data frame [32 x 12]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb good_mpg
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)    (chr)
1   21.0     6 160.0   110  3.90  2620 16.46     0     1     4     4      bad
2   21.0     6 160.0   110  3.90  2875 17.02     0     1     4     4      bad
3   22.8     4 108.0    93  3.85  2320 18.61     1     1     4     1      bad
4   21.4     6 258.0   110  3.08  3215 19.44     1     0     3     1      bad
5   18.7     8 360.0   175  3.15  3440 17.02     0     0     3     2      bad
6   18.1     6 225.0   105  2.76  3460 20.22     1     0     3     1      bad
7   14.3     8 360.0   245  3.21  3570 15.84     0     0     3     4      bad
8   24.4     4 146.7    62  3.69  3190 20.00     1     0     4     2      bad
9   22.8     4 140.8    95  3.92  3150 22.90     1     0     4     2      bad
10  19.2     6 167.6   123  3.92  3440 18.30     1     0     4     4      bad
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...      ...

arrange()

Next we could sort or arrange the data according to some column values. This is usually to make visual inspection of the data easier. Let’s sort the data frame by cars with the worst MPG and then sort by weight from heaviest to lightest.

arrange(dp_mtcars,mpg,desc(wt))
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4
2   10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4
3   13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4
4   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
5   14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
6   15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
7   15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3
8   15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2
9   15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2
10  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Ceci n’est pas une pipe

While the above examples are instructive they are not, at least in my opinion, the way to best use dplyr. Once you get up to speed with dplyr functions I think you will soon agree that using “pipes” to create chains of commands is the way to go. If you come from a UNIX background you will no doubt have heard of “pipes” which is a construct allowing you to take the output of one command and route it or “pipe” it to the input of another program. This can be done for several commands thus creating a chain of piped commands. One can save typing while creating, in effect, a “one line program” that does a lot. Here is an example of a UNIX pipeline. I’m using Apple OSX but this should work on a Linux machine just as well. This example will pipe the output of the /etc/passwd file into the input of the awk command (a program used to parse files) and the output of the awk command will go into the input of the tail command which lists the last 10 lines of the final result.

 $ cat /etc/passwd | awk -F: '{print $1}' | tail
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_xcsbuildagent
_xcscredserver
_launchservicesd

$ 

This is a great paradigm for working on UNIX that also maps well for what one does in data exploration. When first encountering data you rarely know what it is you want to get from it (unless you are a student and your teacher told you specifically what she or he wants). So you embark on some exploratory work and start to interrogate the data which might first require some filtering and maybe exclusion of incomplete data or maybe some imputation for missing values. Until you have worked with it for a while you don’t want to change the data – you just want to experiment with various transformed and grouped versions of it which is much easier if you use dplyr. Just pipe various commands together to clean up your data, make some visualizations, and perhaps generate some hypotheses about your data. You find yourself generating some pretty involved adhoc command chains without having to create a standalone script file. The dplyr package uses the magrittr package to enable this piping capability within R. The “pipe” character is “%>%” which is different from the traditional UNIX pipe which is the vertical bar “|”. But don’t let the visual difference confuse you as, conceptually, pipes in R work just like they do in UNIX. The magrittr package has a motto “Ceci n’est pas une pipe” presumably in acknowledgement of the noted difference and also as a tribute to the painter Rene Magritte’s work La trahison des images.


# Here we filter rows where MPG is >= 25 and then select only rows 1-4
# and 10-11.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) 
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66     4     1
2  30.4     4  75.7    52     4     2
3  33.9     4  71.1    65     4     1
4  27.3     4  79.0    66     4     1
5  26.0     4 120.3    91     5     2
6  30.4     4  95.1   113     5     2

Next we filter rows where MPG is >= 25 and then select only rows 1-4 and 10-11 after which we sort the result by MPG from highest to lowest. You can keep adding as many pipes as you wish. At first, while you are becoming familiar with the idea, it is best to keep the pipeline relatively short so you can check your work. But it will not be long before you are stringing together lots of different commands. dplyr enables and encourages this type of activity so don’t be shy.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) %>% arrange(desc(mpg))
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  33.9     4  71.1    65     4     1
2  32.4     4  78.7    66     4     1
3  30.4     4  75.7    52     4     2
4  30.4     4  95.1   113     5     2
5  27.3     4  79.0    66     4     1
6  26.0     4 120.3    91     5     2

That was pretty cool wasn’t it ? We don’t need to alter dp_mtcars at all to explore it. We could change our minds about how and if we want to filter, select, or sort. The way this works is that the output of the dp_mtcars data frame/table gets sent to the input of the filter function that is aware of the source which is why we don’t need to explicitly reference dp_mtcars by name. The output of the filter step gets sent to the select function which in turns pipes or chains its output into the input of the arrange function which sends its output to the screen. We could even pipe the output of these operations to the ggplot2 package. But first let’s convert some of the columns into factors so the resulting plot will look better.

# Turn the cyl and am variables into factors. Notice that the resulting
# output reflects the change

dp_mtcars %>%
mutate(cyl=factor(cyl,levels=c(4,6,8)),
am=factor(am,labels=c("Auto","Manual" )))
mpg    cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
(dbl) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) (dbl) (dbl)
1   21.0      6 160.0   110  3.90 2.620 16.46     0 Manual     4     4
2   21.0      6 160.0   110  3.90 2.875 17.02     0 Manual     4     4
3   22.8      4 108.0    93  3.85 2.320 18.61     1 Manual     4     1
4   21.4      6 258.0   110  3.08 3.215 19.44     1   Auto     3     1
5   18.7      8 360.0   175  3.15 3.440 17.02     0   Auto     3     2
6   18.1      6 225.0   105  2.76 3.460 20.22     1   Auto     3     1
7   14.3      8 360.0   245  3.21 3.570 15.84     0   Auto     3     4
8   24.4      4 146.7    62  3.69 3.190 20.00     1   Auto     4     2
9   22.8      4 140.8    95  3.92 3.150 22.90     1   Auto     4     2
10  19.2      6 167.6   123  3.92 3.440 18.30     1   Auto     4     4
..   ...    ...   ...   ...   ...   ...   ...   ...    ...   ...   ...

But that was kind of boring – Let’s visualize this using the ggplot package whose author, Hadley Wickham, is also the author of dplyr.

dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" ))) %>%
                     ggplot(aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

Okay well that might have been too much for you and that’s okay if it is. Let’s break this down into two steps. First let’s save the results of the mutate operation into a new data frame.

new_dp_mtcars <- dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" )))

# Now we can call the ggplot command separately

ggplot(new_dp_mtcars,aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

Pick whatever approach you want to break things down to the level you need. However, I guarantee that after a while you will probably wind up writing lots of one line programs.

Split-Apply-Combine

There are two more commands from the dplyr package that are particularly useful in aggregating data. The group_by() and summarize() functions help us group a data frame according to some factors and then apply some summary functions across those groups. The idea is to first “split” the data into groups, “apply” some functions (e.g. mean()) to some continuous quantity relating to each group, and then combine those group specific results back into an integrated result. In the next example we will group (or split) the data frame by the cylinder variable and then summarize the mean MPG for each group and then combine that into a final aggregated result.

dp_mtcars %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))
Source: local data frame [3 x 2]

    cyl  avg_mpg
  (dbl)    (dbl)
1     4 26.66364
2     6 19.74286
3     8 15.10000

# Let's group by cylinder then by transmission type and then apply the mean
# and sd functions to mpg

dp_mtcars %>% group_by(cyl,am) %>% summarize(avg_mpg=mean(mpg),sd=sd(mpg))
Source: local data frame [6 x 4]
Groups: cyl [?]

    cyl    am  avg_mpg        sd
  (dbl) (dbl)    (dbl)     (dbl)
1     4     0 22.90000 1.4525839
2     4     1 28.07500 4.4838599
3     6     0 19.12500 1.6317169
4     6     1 20.56667 0.7505553
5     8     0 15.05000 2.7743959
6     8     1 15.40000 0.5656854

# Note that just grouping a data frame without summary doesn't appear to do 
# much from a visual point of view. 

dp_mtcars %>% group_by(cyl)
Source: local data frame [32 x 11]
Groups: cyl [3]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Merging Data Frames

One of the strengths of dplyr is it’s ability to do merges via various “joins” like those associated with database joins. There is already a built-in R command called merge that can handle merging duties but dplyr offers flexible and extended capabilities in this regard. Moreover it does so in a way that is consistent (for the most part) with SQL which you can use for a wife variety of data mining tasks. If you already know SQL then you will understand these commands without much effort. Let’s set up two example simple data frames to explain the concept of joining.

df1 <- data.frame(id=c(1,2,3),m1=c(0.98,0.45,0.22))
df2 <- data.frame(id=c(3,4),m1=c(0.17,0.66))

df1
  id   m1
1  1 0.98
2  2 0.45
3  3 0.22

df2
  id   m1
1  3 0.17
2  4 0.66

Left Join

Think about what it means to merge these data frames. It makes sense to want to join the data frames with respect to some common column name. In this case it is clear that the id column is in both data frames. So let’s join the data frames using “id” as a “key”. The question is what to do about the fact that there is no id in df2 corresponding to id number 2. This is why different types of joins exist. Let’s see how they work. We’ll start with the left join:

left_join(df1,df2,by="id")
  id m1.x m1.y
1  1 0.98   NA
2  2 0.45   NA
3  3 0.22 0.17

So the left join looks at the first data frame df1 and then attempts to find corresponding “id” values in df2 that match all id values in df1. Of course there are no ids matching 2 or 3 in df2 so what happens ? The left join will insert NAs in the m1.y column since there are no values in df2. Note that there is in fact an id of value 3 in both data frames so it fills in both measurement columns with the values. Also note that since in both data frames there is a column named “m1” so it has to create unique names to accommodate both columns. The “x” and “y” come from the fact that df1 comes before df2 in the calling sequence to left_join. Thus “x” matches df1 and “y” matches df2.

Inner Join

Let’s join the two data frames in a way that yields only the intersection of the two data structures based on “id”. Using visual examination we can see that there is only one id in common to both data frames – id 3.

inner_join(df1,df2,by="id")
  id m1.x m1.y
1  3 0.22 0.17

More Involved Join Examples

Now we’ll look at a more advanced example. Let’s create two data frames where the first, (we’ll call it “authors”), presents a list of, well, authors. The second data frame presents a list of books published by various authors. Each data frame has some additional attributes of interest.

# For reference sake - these data frames come from the examples contained in 
# the help pages for the built-in R merge command

authors <- data.frame(
         surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
         nationality = c("US", "Australia", "US", "UK", "Australia"),
         deceased = c("yes", rep("no", 4)))
     
books <- data.frame(
         name = I(c("Tukey", "Venables", "Tierney",
                  "Ripley", "Ripley", "McNeil", "R Core")),
         title = c("Exploratory Data Analysis",
                   "Modern Applied Statistics ...",
                   "LISP-STAT",
                   "Spatial Statistics", "Stochastic Simulation",
                   "Interactive Data Analysis",
                   "An Introduction to R"),
         other.author = c(NA, "Ripley", NA, NA, NA, NA,
                          "Venables & Smith"))

authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no
 
books
      name                         title     other.author
1    Tukey     Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics ...           Ripley
3  Tierney                     LISP-STAT             <NA>
4   Ripley            Spatial Statistics             <NA>
5   Ripley         Stochastic Simulation             <NA>
6   McNeil     Interactive Data Analysis             <NA>
7   R Core          An Introduction to R Venables & Smith

At first glance it appears that there is nothing in common between these two data frames in terms of column names. However, it is fairly obvious that the “surname” column in the authors data frame matches the “name” column in books so we could probably use those as keys to join the two data frames. We also see that there is an author ,”R Core” (meaning the R Core Team), who appears in the books table though is not listed as an author in the authors data frame. This kind of thing happens all the time in real life so better get used to it. Let’s do some reporting using these two data frames:

Let’s find all authors listed in the authors table who published a book along with their book titles, other authors, nationality, and living status. Let’s try an inner join on this. Because we don’t have any common column names between books and authors we have to tell the join what columns to use for matching. The by argument exists for this purpose. Note also that the author “R Core” listed in books isn’t printed here because that author does not also exist in the authors table. This is because the inner join looks for the intersection of the tables.


inner_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

# We could have also done a right join since this will require a result that has
# all rows form the "right" data frame (in the "y" position) which in this case is 
# authors

right_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

Next, find any and all authors who published a book even if they do not appear in the authors table. The result should show names, titles, other authors, nationality, and living status. Let’s do a left join which will pull in all rows from “x” (books) and where there is no matching key/name in authors then NAs will be inserted for columns existing in the “y” (authors) table.

left_join(books,authors,by=c("name"="surname"))
      name                         title     other.author nationality deceased
1    Tukey     Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics ...           Ripley   Australia       no
3  Tierney                     LISP-STAT             <NA>          US       no
4   Ripley            Spatial Statistics             <NA>          UK       no
5   Ripley         Stochastic Simulation             <NA>          UK       no
6   McNeil     Interactive Data Analysis             <NA>   Australia       no
7   R Core          An Introduction to R Venables & Smith        <NA>     <NA>

Do the same as above but the result should show only the book title and name columns
in that order. This is simply a matter of doing the previous join and piping the result to a filter statement

left_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                          title     name
1     Exploratory Data Analysis    Tukey
2 Modern Applied Statistics ... Venables
3                     LISP-STAT  Tierney
4            Spatial Statistics   Ripley
5         Stochastic Simulation   Ripley
6     Interactive Data Analysis   McNeil
7          An Introduction to R   R Core

Find the book names of all US authors and who are not deceased. Well first we filter the authors table to filter out rows according the specified conditions. Then we can pass the result to an inner_join() statement to get the book titles and then we pass that result to select only the book titles. Note that because we are piping the output from the filter() results we don’t need to specify that in the call to inner_join(). That is, the inner_join function assumes that the filter() results represent the “x” position in the call to inner_join()

authors %>% filter(deceased == "no" & nationality == "US") %>%
            inner_join(books,by=c("surname"="name")) %>% select(title)surname 
title
1 LISP-STAT

Find any book titles for authors who do not appear in the authors data frame. Here we use an anti_join() which returns all rows from books where there are no matching values in authors, keeping just columns from books – and then we pass that result to select for title and name

anti_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                 title   name
1 An Introduction to R R Core

Up Next – Biking in San Francisco

That’s it for now and we have covered a lot of ground in one go although once you invest some time in playing with dplyr (especially the pipes) then it becomes difficult to go back to the “old ways” of doing things. Next up we will look at some “real” data and apply our new found knowledge to working with it. The data set actually comes from the Kaggle Project page for the San Francisco Bay Area Bike Share Service. The data is about 660MB and you can download it though you will need a Kaggle account. You might want to go ahead and download that data in anticipation of the next posting.

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: Rolling Your Rs.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Making “Time Rivers” in R

By hrbrmstr

IMG_0509-1

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Once again, @albertocairo notices an interesting chart and spurs pondering in the visualization community with his post covering an unusual “vertical time series” chart produced for the print version of the NYTimes:

I’m actually less concerned about the vertical time series chart component here since I agree with TAVE* Cairo that folks are smart enough to grok it and that it will be a standard convention soon enough given the prevalence of our collective tiny, glowing rectangles. The Times folks plotted Martin-Quinn (M-Q) scores for the U.S. Supreme Court justices which are estimates of how liberal or conservative a justice was in a particular term. Since they are estimates they aren’t exact and while it’s fine to plot the mean value (as suggested by the M-Q folks), if we’re going to accept the intelligence of the reader to figure out the nouveau time series layout, perhaps we can also show them some of the uncertainty behind these estimates.

What I’ve done below is take the data provided by the M-Q folks and make what I’ll call a vertical time series river plot using the mean, median and one standard deviation. This shows the possible range of real values the estimates can take and provides a less-precise but more forthright view of the values (in my opinion). You can see right away that they estimates are not so precise, but there is still an overall trend for the justices to become more liberal in modern times.

The ggplot2 code is a bit intricate, which is one reason I’m posting it. You need to reorient your labeling mind due to the need to use coord_flip(). I also added an arrow on the Y-axis to show how time flows. I think the vis community will need to help standardize on some good practices for how to deal with these vertical time series charts to help orient readers more quickly. In a more dynamic visualization, either using something like D3 or even just stop-motion animation, the flow could actually draw in the direction time flows, which would definitely make it easier immediately orient the reader.

However, the main point here is to not be afraid to show uncertainty. In fact, the more we all work at it, the better we’ll all be able to come up with effective ways to show it.

* == “The Awesome Visualization Expert” since he winced at my use of “Dr. Cairo” 🙂

library(dplyr)
library(readr)
library(ggplot2)  # devtools::install_github("hadley/ggplot2")
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(grid)
library(scales)

URL <- "http://mqscores.berkeley.edu/media/2014/justices.csv"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

justices <- read_csv(fil)

justices %>%
  filter(term>=1980,
         justiceName %in% c("Thomas", "Scalia", "Alito", "Roberts", "Kennedy",
                            "Breyer", "Kagan", "Ginsburg", "Sotomayor")) %>%
  mutate(col=ifelse(justiceName %in% c("Breyer", "Kagan", "Ginsburg", "Sotomayor"),
                    "Democrat", "Republican")) -> recent

just_labs <- data_frame(
  label=c("Thomas", "Scalia", "Alito", "Roberts", "Kennedy", "Breyer", "Kagan", "Ginsburg", "Sotomayor"),
      x=c(  1990.5,   1985.5,  2004.5,    2004.5,    1986.5,      1994,   2010,     1992.5,      2008.5),
      y=c(     2.9,      1.4,    1.35,       1.7,       1.0,      -0.1,   -0.9,       -0.1,          -2)
)

gg <- ggplot(recent)
gg <- gg + geom_hline(yintercept=0, alpha=0.5)
gg <- gg + geom_label(data=data.frame(x=c(0.1, -0.1),
                                      label=c("More →nconservative", "← Morenliberal"),
                                      hjust=c(0, 1)), aes(y=x, x=1982, hjust=hjust, label=label),
                      family="Arial Narrow", fontface="bold", size=4, label.size=0, vjust=1)
gg <- gg + geom_ribbon(aes(ymin=post_mn-post_sd, ymax=post_mn+post_sd, x=term,
                             group=justice, fill=col, color=col), size=0.1, alpha=0.3)
gg <- gg + geom_line(aes(x=term, y=post_med, color=col, group=justice), size=0.1)
gg <- gg + geom_text(data=just_labs, aes(x=x, y=y, label=label),
                     family="Arial Narrow", size=2.5)
gg <- gg + scale_x_reverse(expand=c(0,0), limits=c(2014, 1982),
                           breaks=c(2014, seq(2010, 1990, -10), 1985, 1982),
                           labels=c(2014, seq(2010, 1990, -10), "1985nTERMn↓", ""))
gg <- gg + scale_y_continuous(expand=c(0,0), labels=c(-2, "0nM-Q Score", 2, 4))
gg <- gg + scale_color_manual(name=NULL, values=c(Democrat="#2166ac", Republican="#b2182b"), guide=FALSE)
gg <- gg + scale_fill_manual(name="Nominated by a", values=c(Democrat="#2166ac", Republican="#b2182b"))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL,
                title="Martin-Quinn scores for selected justices, 1985-2014",
                subtitle="Ribbon band derived from mean plus one standard deviation. Inner line is the M-Q median.",
                caption="Data source: http://mqscores.berkeley.edu/measures.php")
gg <- gg + theme_hrbrmstr_an(grid="XY")
gg <- gg + theme(plot.subtitle=element_text(margin=margin(b=15)))
gg <- gg + theme(legend.title=element_text(face="bold"))
gg <- gg + theme(legend.position=c(0.05, 0.6))
gg <- gg + theme(plot.margin=margin(20,20,20,20))
gg

Yes, I manually positioned the names of the justices, hence the weird spacing for those lines. Also, after publishing this post, I tweaked the line-height of the “More Liberal”/”More Conservative” top labels a bit and would definitely suggest doing that to anyone attempting to reproduce this code (the setting I used was 0.9).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News