Using R to detect fraud at 1 million transactions per second

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In Joseph Sirosh’s keynote presentation at the Data Science Summit on Monday, Wee Hyong Took demonstrated using R in SQL Server 2016 to detect fraud in real-time credit card transactions at a rate of 1 million transactions per second. The demo (which starts at the 17:00 minute mark) used a gradient-boosted tree model to predict the probability of a credit card transaction being fraudulent, based on attributes like the charge amount and the country of origin. Then, a stored procedure in SQL Server 2016 was used to score transactions streaming into the database at a rate of 3.6 billion per hour.

Later in the keynote (starting at 25:00), John Salch, VP of Technology and Platforms at PROS describes using R to determine prices for airline tickets, hotel rooms, and laptops. PROS has been using R for a while in development, but found running R within SQL Server 2016 to be 100 times (not 100%, 100x!) faster for price optimization. “This really woke us up that we can use R in a production setting … it’s truly amazing,” he says.

It’s great to see these global-scale applications of R, driving the intelligence of businesses behind the scenes. As Joseph said in the opening, “If there’s one language you should learn today … it’s R.”

Channel 9: Microsoft Machine Learning & Data Science Summit 2016

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing the eRum 2016 sponsors

By Tal Galili


Guest post by Maciej Beręsewicz.

It has been a very active semester in Europe regarding R meetings. After two successful main events: satRday in Budapest, and EARL in London, now is the turn for eRum 2016, the first European R users meeting. More than 250 data scientists will meet in Poznan, Poland, from 12th to 14th of October, to discuss and present innovative uses of the statistical language R in different industries, academic world, and government.

A crucial part of an event like this is the support of sponsors for many reasons: In first place they make possible to keep registration fees low, so more people can attend; they also help to validate the meeting inside the R global community; they contribute to the sessions with innovative applications in real business environments; and they add more value to the conference by the direct interaction with the rest of the attendants on coffee breaks and social events, or directly on their exhibition stands.

As organizers of eRum 2016 we are very proud and grateful to have well known and innovative companies who believe in our contribution to the R community and will make eRum 2016 a great meeting of useRs. Let us introduce our current sponsors!

  • McKinsey & Company is a leading international management consulting firm with more than 100 offices in over 60 countries. They are the trusted advisor to the world’s leading businesses, governments, and institutions on issues of strategy, organization and operations. McKinsey Analytics brings the latest analytical techniques plus a deep understanding of industry dynamics and corporate functions to help clients create the most value from data. The company has rapidly growing Analytics teams tackling tough challenges in this area for clients in all industries and geographies. Peek inside a team room and learn about life at McKinsey from some of Analytics colleagues
  • Microsoft. It’s hard for us to tell you something about Microsoft you don’t know, but let us try: It has the most open source contributors in Github, is the responsible of developing the Microsoft R Open and Microsoft R Server editions, and its commitment to R is also reflected in its integration to their tools SQL Server, PowerBI, Azure, and Cortana Analytics.
  • Analyx stands for marketing analytics at CMO level. They combine decades of industry experience with state-of-the-art big data methods, and cast this into easy-to-operate software tools for improving the success of customers marketing activities.
  • eoda is assisting their clients permanently in handling their data effectively ? from the collection to the analysis to the inference of recommendations for action. Their portfolio of solutions includes consulting, analytic services, R academy, and data science products
  • DataCamp is the easiest way to learn data science online. Their R courses cover from general topics as “Introduction to R” to more specific like “Credit Risk Modelling”, and they also offer on demand data science training for business teams.
  • Quantide is specialized in IT services for statistical purposes. The company is a leading expert in R, offering data analysis, reporting, dashboards, and advanced development and analytics to their clients, but is also offering R training in topics such data mining, data manipulation, and statistical models among others.
  • RStudio is the home of open source and enterprise ready professional software for R. Not only their RStudio product is the most popular R IDE among users, but also their packages are some of the most used by the R community including shiny, ggplot, dplyr, R Markdown, and many more.
  • WLOG Solutions is offering mathematical modelling methods to help their clients take advantage of the data they possess to solve complex decision making problems. Their solutions include optimization, data analysis and simulations.

If you would like to add your company to this list, please write us to until end of September, and for those of you attending eRum 2016, prepare your questions, comments, or even your CV, since they are always accepting applications from R experts like you.

Thanks again to our wonderful sponsors!

On the behalf of the eRum organizing committee,

Maciej Beręsewicz

Source:: R News

The Simpsons by the Data

By Todd Schneider


(This article was first published on Category: R | Todd W. Schneider, and kindly contributed to R-bloggers)

The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

The Simpsons characters who have spoken the most words

Simpsons World provides a delightful trove of content for fans. In addition to streaming every episode, the site includes episode guides, scripts, and audio commentary. I wrote code to parse the available episode scripts and attribute every word of dialogue to a character, then ranked the characters by number of words spoken in the history of the show.

The top four are, not surprisingly, the Simpson nuclear family.

If you want to quiz yourself, pause here and try to name the next 5 biggest characters in order before looking at the answers…

Of course Homer ranks first: he’s the undisputed most iconic character, and he accounts for 21% of the show’s 1.3 million words spoken through season 26. Marge, Bart, and Lisa—in that order—combine for another 26%, giving the Simpson family a 47% share of the show’s dialogue.

If we exclude the Simpson nuclear family and focus on the top 50 supporting characters, the results become a bit less predictable, if not exactly surprising.

supporting cast

Mr. Burns speaks the most words among supporting cast members, followed by Moe, Principal Skinner, Ned Flanders, and Krusty rounding out the top 5.

Gender imbalance on The Simpsons

The colors of the bars in the above graphs represent gender: blue for male characters, red for female. If we look at the supporting cast, the 14 most prominent characters are all male before we get to the first woman, Mrs. Krabappel, and only 5 of the top 50 supporting cast members are women.

Women account for 25% of the dialogue on The Simpsons, including Marge and Lisa, two of the show’s main characters. If we remove the Simpson nuclear family, things look even more lopsided: women account for less than 10% of the supporting cast’s dialogue.

A look at the show’s list of writers reveals that 9 of the top 10 writers are male. I did not collect data on which writers wrote which episodes, but it would make for an interesting follow-up to see if the episodes written by women have a more equal distribution of dialogue between male and female characters.

Eye on Springfield

The scripts also include each scene’s setting, which I used to compute the locations with the most dialogue.


The location data is a bit messy to work with—should “Simpson Living Room” really be treated differently than “Simpson Home”—but nevertheless it paints a picture of where people spend time in Springfield: at home, school, work, and the local bar.

The Bart-to-Homer transition?

Per Wikipedia:

While later seasons would focus on Homer, Bart was the lead character in most of the first three seasons

I’ve heard this argument before, that the show was originally about Bart before switching its focus to Homer, but the actual scripts only seem to partially support it.


Bart accounted for a significantly larger share of the show’s dialogue in season 1 than in any future season, but Homer’s share has always been higher than Bart’s. Dialogue share might not tell the whole story about a character’s prominence, but the fact is that Homer has always been the most talkative character on the show.

The Simpsons TV ratings are in decline

Historical Nielsen ratings data is hard to come by, so I relied on Wikipedia for Simpsons episode-level television viewership data.


Viewership appears to jump in 2000, between seasons 11 and 12, but closer inspection reveals that’s when the Wikipedia data switches from reporting households to individuals. I don’t know the reason for the switch—it might have something to do with Nielsen’s measurement or reporting—but without any other data sources it’s difficult to confirm.

Aside from that bump, which is most likely a data artifact, not a real trend, it’s clear that the show’s ratings are trending lower. The early seasons averaged over 20 million viewers per episode, including Bart Gets an “F”, the first episode of season 2, which is still the most-watched episode in the show’s history with an estimated 33.6 million viewers. The more recent seasons have averaged less than 5 million viewers per episode, more than an 80% decline since the show’s beginnings.



TV ratings have declined everywhere, not just on The Simpsons

Although the ratings data looks bad for The Simpsons, it doesn’t tell the whole story: TV ratings for individual shows have been broadly declining for over 60 years.

When The Simpsons came out in 1989, the highest 30 rated shows on TV averaged a 17.7 Nielsen rating, meaning that 17.7% of television-equipped households tuned in to the average top 30 show. In 2014–15, the highest 30 rated shows managed an 8.7 average rating, a decline of 50% over that 25 year span.

If we go all the way back to the 1951, the top 30 shows averaged a 38.2 rating, which is more than triple the single highest-rated program of 2014–15 (NBC’s Sunday Night Football, which averaged a 12.3 rating).


Full data for the top 30 shows by season is available here on GitHub

I have no proof for the cause of this decline in the average Nielsen rating of a top 30 show, but intuitively it must be related to the proliferation of channels. TV viewers in the 1950s had a small handful of channels to choose from, while modern viewers have hundreds if not thousands of choices, not to mention streaming options, which present their own ratings measurement challenges.



We could normalize Simpsons episode ratings by the declining top 30 curve to adjust for the fact that it’s more difficult for any one show to capture as large a share of the TV audience over time. But as mentioned earlier, the normalization would only account for about a 50% decline in ratings since 1989, while The Simpsons ratings have declined more like 80-85% over that horizon.

Alas, I must confess, I stopped watching the show around season 12, and Simpsons World’s episode view counts suggest that modern streaming viewers are more interested in the early seasons too, so it could just be that people are losing interest.

As I write this, The Simpsons is under contract to be produced for one more season, though it’s entirely possible it will be renewed. But ultimately Troy McClure said it best at the conclusion of the The Simpsons 138th Episode Spectacular, which, it’s hard to believe, now covers less than 25% of the show’s history:

troy mcclure


Automated episode summaries using tf–idf

Term frequency–inverse document frequency is a popular technique used to determine which words are most significant to a document that is itself part of a larger corpus. In our case, the documents are individual episode scripts, and the corpus is the collection of all scripts.

The idea behind tf–idf is to find words or phrases that occur frequently within a single document, but rarely within the overall corpus. To use a specific example from The Simpsons, the phrase “dental plan” appears 19 times in Last Exit to Springfield, but only once throughout the rest of the show, and sure enough the tf–idf algorithm identifies “dental plan” as the most relevant phrase from that episode.

I used R’s tidytext package to pull out the single word or phrase with the highest tf–idf rank for each episode; here’s the relevant section of code.

The results are pretty good, and should be at least slightly entertaining to fans of the show. Beyond “dental plan”, there are fan-favorites including “kwyjibo”, “down the well”, “monorail”, “I didn’t do it”, and “Dr. Zaius”, though to be fair, there are also some less iconic results.

You can see the full list of episodes and “most relevant phrases” here.

Another interesting follow-up could be to use more sophisticated techniques to write more complete episode summaries based on the scripts, but I was pleasantly surprised by the relevance of the comparatively simple tf–idf approach.



Code on GitHub

All code used in this post is available on GitHub, and the screencaps come from the amazing Frinkiac



var $placeholder = $(“.simpsons-episodes-chart-placeholder”);


Hover to view individual episode data, click and drag to zoom


$.get(“/data/simpsons/episodes.json”, function(episodes) {
new Highcharts.Chart({
chart: {
renderTo: ‘simpsons-episodes-chart’,
type: ‘scatter’,
zoomType: ‘x’,
backgroundColor: ‘#ffd90f’,
style: { fontFamily: ‘Akbar, Verdana, Arial, sans-serif’ }
series: [{data: episodes}],
title: {
text: ‘The Simpsons TV ratings by episode’,
style: { fontSize: ’36px’ }
subtitle: {
text: ‘data via Wikipedia’,
style: { fontSize: ’24px’ }
xAxis: {
type: ‘datetime’,
min: 567993600000,
lineWidth: 0,
gridLineWidth: 0.4,
gridLineColor: ‘#70d1ff’,
labels: {
style: { fontSize: ’20px’ }
title: {
text: ‘Original air date’,
margin: 25,
style: { fontSize: ’28px’ }
yAxis: {
min: 0,
gridLineWidth: 0.4,
gridLineColor: ‘#70d1ff’,
labels: {
style: { fontSize: ’20px’ }
title: {
text: ‘US viewers in millions’,
margin: 25,
style: { fontSize: ’28px’ }
legend: { enabled: false },
plotOptions: {
series: {
animation: false,
color: ‘rgba(79, 118, 223, 0.6)’,
stickyTracking: false
tooltip: {
enabled: true,
snap: 10,
useHTML: true,
animation: false,
borderColor: ‘#70d1ff’,
borderWidth: 3,
style: {
padding: 0
formatter: function() {
return ‘

‘ +

‘ +

‘ +

‘ +
Highcharts.dateFormat(‘%b %e, %Y’, this.x) +
‘, S’ + this.point.season + ‘ E’ + this.point.number_in_season +

‘ +

‘ + + ‘

‘ +

‘ + this.y + ‘ million US viewers

‘ +

‘ +

credits: {
text: ‘’,
href: ‘#simpsons-episodes-chart’,
style: { fontSize: ’14px’ }

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Todd W. Schneider. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Efficient Data Manipulation with R Course | Milan

By Quantide srl

data manipulation

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Efficient Data Manipulation with R is our second course of the Fall Term. It will take place on October 17-18 in Legnano (Milan).

This class will be a good fit for you if you have a working knowledge of R, and you usually handle with data and databases.

You will learn to organize your data manipulation tasks in a standard and clear way, write clean and efficient code, and build reproducible data management processes, using the most modern R tools: tidyr, dplyr and lubridate packages.


Data manipulation in a data flow

  • Tidying data with tidyr
  • The fundamental verbs of data manipulation: select, filter, arrange, mutate and summarize
  • Group-wise calculations
  • Date handling with lubridate
  • Joining tables
  • Chain operators
  • Do as generic data manipulation tool
  • Programming with dplyr and tidyr: NSE vs SE
  • Working with backend databases

The course is open to max 6 attendees: FAQ, detailed program and tickets here.

Data Manipulation is organized by the R training and consulting company Quantide and it is taught in Italian.

All the course materials are in English.

If you want to know more about Quantide, check out Quantide’s website.


Legnano is about 30 min by train from Milano. Trains from Milano to Legnano are scheduled every 30 minutes, and Quantide premises are 3 walking minutes from Legnano train station.

Other R courses | Autumn term

October 25-26: Statistical Models with R. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now

November 7-8: Data Mining with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now

November 15-16: R for Developers. Move forward from being a R user to become a R developer. Discover the R working mechanisms and master your R programming skills. Reserve now

For further information contact us at training[at]quantide[dot]com

The post Efficient Data Manipulation with R Course | Milan appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Warsaw R Enthusiast Meetups Season Finale

By Marcin Kosiński

Warsaw R and Data Analytics Enthusiast group is an effort that aims at integrating users of the R language in Warsaw, Poland. Our group has over 970 members at its meetup page. In this post I provide a summary of our group, our last 2 meetups before this season ending and I present plans for the future. Check who we are, what are we talking about, what are our future meetings about and how you can become a member or a co-organizer of Warsaw R Enthusiasts events.

We have been meeting for 3 years now and have organized 19 regular meetups with speakers’ session (2×30-35min speeches, sometimes 3x30min). At every gathering we provide free pizza and organize an afterparty to improve the networking. Our local group is strong and our speakers are always up to date with the newest R inventions and discoveries. With local Data Science Academic Society and MI2 group we have even organized 6 R workshops and Team Data Analytics MiNI-Marathon

Last Season Finale

Before holidays, at 18th meeting, we had a pleasure to listen to Data Scientist, Michał Janusz, who talked about R usage in the improvement of global paper industry. We also had a possibility to learn more about Rcpp package from Data Analyst Zygmunt Zawadzki who during presentation mentioned his new Rcpp reimplementation of FSelector package – the FSelectorRcpp package. Later, on 19th meetup called SER 19 %>% PwC <- XGBoost + SQLServer2016 , we were pleased to listen to 2 Data Scientists from PwC Poland: Mateusz Luczko and Piotr Bochnia who presented the XGBoost package and R integration with SQL Server 2016. Extreme Gradient Boosting is a new approach to predictive analytics and an algorithm that is a new benchmark of performance in kaggle competitions. SQL Server integration is a great opportunity for R users working with Microsoft R Open and MRAN.

All past presentations can be found on our website (still only in polish) or on our GitHub repository.

Future Plans – New Season

We are co-working with R meetups organizers from various polish cities like eRka (Cracow), Estymator (Poznań), Data Science Wrocław, Data Science Łódź or Academic Data Science (Gdańsk).

This year we all co-organize in October the European R Users Meeting – eRum 2016

But just before that you can meet us on 10th Cracow R Users Group meetup to listen Marta Karaś talking about Convex Clustering and Biclustering with application in R and me trying to explain Docker – Rocker: explanation and motivation for Docker containers usage in applications development [by R user, for R users]. You are welcome to join the meeting!

In Warsaw

Shortly after eRum2016 we are hosting R Ladies Warsaw – a workshop event with 4 simultaneous sessions about basics of R and ggplot2 package. You are still welcome to participate!

Then there is a plan R and Bio-informatics/Medical statistics meetup in Novembver and we are still looking for presenters on R and data vis meetup in December. Feel free to write to me if you are interested in presenting any state of the art graphics tips and tricks!

Source:: R News

The Financial Times uses R for Quantitative Journalism

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the 2016 EARL London conference senior data-visualisation journalist John Burn-Murdoch, described how the Financial Times uses R to produce high-quality, striking data visualisations. Until recently, charts were the realm of an information designer using tools like Adobe Illustrator: the output was beautiful, but the process was a long and winding one. The FT needed to be able to “audition” several different visual treatments quickly, to be able to create stunning visuals before deadline. That’s where R and the ggplot2 package come in.

John presented a case study (you can see the slides with animations here) on creating this FT article, Explore the changing tides of European footballing power. The final work included 128 charts in total, telling the story of dozens of soccer teams in four countries. John also shared these animated versions (along with the ggplot2 code to produce them):

You can find the data and R code behind the animations here.

John B Murdoch: Ggplot2 as a creative engine … and other ways R is transforming the FT’s quantitative journalism

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

First 10 Speakers announced for EARL Boston in November

By Angela Roberts

Keynote speakers

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

We are thrilled to announce our first batch of speakers for EARL Boston in November 2016.


Jenny Bryan and Ricardo Bion

Jenny is the Associate Professor of Statistics at the University of British Columbia and part of the leadership of rOpenSci and Ricardo Bion is a Data Science manager at Airbnb, where he leads two teams focused on product research and experimentation.

Other speakers announced are:

Danielle Dean & Jaya Mathews – Microsoft

Building an on-prem SQL Server Predictive Maintenance Solution

Amar Dhand – Harvard Medical School

Understanding hospital networks to improve patient outcomes and reduce healthcare costs.

Amar Dhand

Daniel Hadley – Sorenson Impact Center

Using R to Save Taxpayer Dollars

Daniel Hadley

Kaori Ito – Pfizer

Data cleaning, visualization, and meta-analysis for literature data using Shiny

Kaori Ito

Jared Lander – Lander Analytics

R for Everything

Jared Lander

David Smith – Microsoft

Microsoft and the R Ecosystem

David Smith

Jeff Schneider – Department of Defense Research, Surveys and Statistics Center

Computing Domain-Based Stratified Sample Allocations using R Shiny

Loads more exciting speakers to be announced shortly…


The conference runs from Monday 7th to Wednesday 9th November 2016 and will be held at the Boston Science Museum.


The Workshops are on Monday 7th November and are as follows:

Monday 7th November

All Day – Workshop 1: Advanced Shiny Workshop
Morning – Workshop 2: Introduction to ggplot2
Afternoon – Workshop 3: Using R with Microsoft Office Products

Spaces are limited, so book your workshops and conference pass now.

Don’t miss out on this amazing R Conference!

Book your ticket for EARL Boston

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

sparklyr — R interface for Apache Spark

By jjallaire

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited today to announce sparklyr, a new package that provides an interface between R and Apache Spark.

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

  • Interactively manipulate Spark data using both dplyr and SQL (via DBI).
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
  • Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
  • Create extensions that call the full Spark API and provide interfaces to Spark packages.
  • Integrated support for establishing Spark connections and browsing Spark data frames within the RStudio IDE.

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into their Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of their enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

Getting Started

You can install sparklyr from CRAN as follows:


You should also install a local version of Spark for development purposes:

spark_install(version = "1.6.2")

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at

Connecting to Spark

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:

sc <- spark_connect(master = "local")

The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

Reading Data

You can copy R data frames into Spark using the dplyr copy_to function (more typically though you’ll read data within the Spark cluster using the spark_read family of functions). For the examples below we’ll copy some datasets from R into Spark (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")

Using dplyr

We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:

# filter by departure delay
flights_tbl %>% filter(dep_delay == 2)

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, ! %>%

# plot delays
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)
Note that while the dplyr functions shown above look identical to the ones you use with R data frames, with sparklyr they use Spark as their back end and execute remotely in the cluster.

Window Functions

dplyr window functions are also supported, for example:

batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

Using SQL

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame:

iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")

Machine Learning

You can orchestrate machine learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package. Both provide a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Spark MLlib

In this example we’ll use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset, and see if we can predict a car’s fuel consumption (mpg) based on its weight (wt) and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

# fit a linear model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.


Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above it’s easy to chain these functions together with dplyr pipelines. To learn more see the Spark MLlib section of the sparklyr website.

H2O Sparkling Water

Let’s walk the same mtcars example, but in this case use H2O’s machine learning algorithms via the H2O Sparkling Water extension. The dplyr code used to prepare the data is the same, but after partitioning into test and training data we call h2o.glm rather than ml_linear_regression:

# convert to h20_frame (uses the same underlying rdd)
training <- as_h2o_frame(partitions$training)
test <- as_h2o_frame(partitions$test)

# fit a linear model to the training dataset
fit <- h2o.glm(x = c("wt", "cyl"),
               y = "mpg",
               training_frame = training,
               lamda_search = TRUE)

# inspect the model

For linear regression models produced by H2O, we can use either print() or summary() to learn a bit more about the quality of our fit. The summary() method returns some extra information about scoring history and variable importance.

To learn more see the H2O Sparkling Water section of the sparklyr website.


The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).

The sas7bdat extension enables parallel reading of SAS datasets in the sas7bdat format into Spark data frames. The rsparkling extension provides a bridge between sparklyr and H2O’s Sparkling Water.

We’re excited to see what other sparklyr extensions the R community creates. To learn more see the Extensions section of the sparklyr website.

RStudio IDE

The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:

  • Creating and managing Spark connections
  • Browsing the tables and columns of Spark DataFrames
  • Previewing the first 1,000 rows of Spark DataFrames

Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:

Once you’ve connected to Spark you’ll be able to browse the tables contained within the Spark cluster:

The Spark DataFrame preview uses the standard RStudio data viewer:

The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release. The final version of RStudio IDE that includes integrated support for sparklyr will ship within the next few weeks.


We’re very pleased to be joined in this announcement by IBM, Cloudera, and H2O, who are working with us to ensure that sparklyr meets the requirements of enterprise customers and is easy to integrate with current and future deployments of Spark.


“With our latest contributions to Apache Spark and the release of sparklyr, we continue to emphasize R as a primary data science language within the Spark community. Additionally, we are making plans to include sparklyr in Data Science Experience to provide the tools data scientists are comfortable with to help them bring business-changing insights to their companies faster,” said Ritika Gunnar, vice president of Offering Management, IBM Analytics.


“At Cloudera, data science is one of the most popular use cases we see for Apache Spark as a core part of the Apache Hadoop ecosystem, yet the lack of a compelling R experience has limited data scientists’ access to available data and compute,” said Charles Zedlewski, vice president, Products at Cloudera. “We are excited to partner with RStudio to help bring sparklyr to the enterprise, so that data scientists and IT teams alike can get more value from their existing skills and infrastructure, all with the security, governance, and management our customers expect.”


“At, we’ve been focused on bringing the best of breed open source machine learning to data scientists working in R & Python. However, the lack of robust tooling in the R ecosystem for interfacing with Apache Spark has made it difficult for the R community to take advantage of the distributed data processing capabilities of Apache Spark.

We’re excited to work with RStudio to bring the ease of use of dplyr and the distributed machine learning algorithms from H2O’s Sparkling Water to the R community via the sparklyr & rsparkling packages”

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

EARL London 2016 revisited – Day 2

By Mango Blogger


(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

This year’s conference was a great opportunity to catch up with friends and meet new people who are eager to talk about their work with R. We enjoyed the sessions we attended and thought that the presenters were entertaining as well as interesting.
On the second day stream 1 was packed because of two very interesting speakers. Jerome Durussel from Catapult showed us how difficult it is to detect dives by a goalkeeper. His random forests did the job in the end and the results were presented with some cool animations. We wonder how well his algorithms work on strikers.

Jerome was followed up by a much anticipated presentation by Tim Paulden. Tim’s talk explored several methods for predicting the result of a football match via mathematical modelling. He began by using a very simple model to predict the result between the London rivals Arsenal and Fulham from several seasons ago. The model was based on the mean number of goals scored by both teams over the (then) current season. Using this model, Arsenal, with a mean of 2.1, far exceeded Fulham’s meagre 1.1, and so were the hot favourites to win the match. The model was clearly oversimplified so Tim ploughed on and introduced several other, more complex, models that were much more accurate. We found Tim’s talk to be a real insight into how data science can be applied to the sporting world; we just can’t wait to hear him again next year.


Stream 1 ended with a really interesting talk from John Burns Murdoch about how the FT are using ggplot2 to quickly test out possible graphics for publication. What amazed us most was that they are producing around 50 graphics a day for publication so you can immediately see the benefit of working with ggplot2. The ease of the layering system makes it so simple to prototype graphics and its eye-opening to hear one of the most well-known producers of visualisations, the Financial Times, are making the most of it.

More generally we were also looking forward to checking out talks on how to deploy your analysis into production in a developer friendly way. Louis Vine’s experiences with data scientists talking to developers at Funding Circle were a valuable lesson in how not to “throw it over the wall”. Louis had to strike a balance between demanding software best practice and giving the data scientists the freedom to try new things. Ben Downe at BCA had solved the problem a different way using AzureML’s built in web API services. Again, this allowed a data scientist to provide fully documented, production ready APIs, that you can simply hand over to the development team. Last but not least the always brilliant Vincent Warmerdam updated us on the latest advancements with H20 on Spark (Sparkling Water) and how this can be used to generate a .jar file that, once again, can be handed over neatly to a development team, in a format they understand, ready for deployment.

All presentations can be viewed here. We hope you enjoyed this year’s conference as much as we did and hope to see you again next year.


To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Tagged NA values and labelled data #rstats

By Daniel

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

sjmisc-package: Working with labelled data

A major update of my sjmisc-package was just released an CRAN. A major change (see changelog for all changes )is the support of the latest release from the haven-package, a package to import and export SPSS, SAS or Stata files.

The sjmisc-package mainly addresses three domains:

  • reading and writing data between other statistical packages and R
  • functions to make working with labelled data easier
  • frequently applied recoding and variable transformation tasks, also with support for labelled data

In this post, I want to introduce the topic of labelled data and give some examples of what the sjmisc-package can do, with a special focus on tagged NA values.

Introduction into Labelled Data

Labelled data (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values. Labelled data not only extends R‘s capabilities to deal with proper value and variable labels, but also facilitates the representation of different types of missing values, like in other statistical software packages. Typically, in R, multiple declared missings cannot be represented in a similar way, like in ‘SPSS’ or ‘SAS’, with the regular missing values. However, Hadley Wickham’s haven package introduced tagged_na values, which can do this. Tagged NA’s work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter (“a” to “z”) or also may be a character number (“0” to “9”). This allows to indicate different missings.

x <- labelled(
  c(1:3, tagged_na("a", "c", "z"), 4:1),
  c("Agreement" = 1, "Disagreement" = 4, 
    "First" = tagged_na("c"),
    "Refused" = tagged_na("a"), 
    "Not home" = tagged_na("z"))

# <Labelled double>
#  [1]     1     2     3 NA(a) 
#  [5] NA(c) NA(z)     4     3     2     1
# Labels:
#  value        label
#      1    Agreement
#      4 Disagreement
#  NA(c)        First
#  NA(a)      Refused
#  NA(z)     Not home

Value Labels

Getting value labels

The get_labels()-method is a generic method to return value labels of a vector or data frame.

# [1] "independent"          "slightly dependent"
# [3] "moderately dependent" "severely dependent"

You can prefix the value labels with the associated values or return them as named vector with the include.values argument.

get_labels(efc$e42dep, include.values = "p")
# [1] "[1] independent"          "[2] slightly dependent"  
# [3] "[3] moderately dependent" "[4] severely dependent"

get_labels() also returns “labels” of factors, even if the factor has no label attributes. This is useful, if you need a generic method for your functions to get value labels, either for labelled data or for factors.

x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
# [1] "hi"  "low" "mid"

Tagged missing values can also be included in the output, using the argument.

# get labels, including tagged NA values
x <- labelled(
  c(1:3, tagged_na("a", "c", "z"), 4:1),
  c("Agreement" = 1, "Disagreement" = 4, 
    "First" = tagged_na("c"),
    "Refused" = tagged_na("a"), 
    "Not home" = tagged_na("z"))

get_labels(x, include.values = "n", = FALSE)
#              1              4          
#    "Agreement" "Disagreement"
#      NA(c)          NA(a)          NA(z) 
#    "First"      "Refused"     "Not home"

Getting labelled values

The get_values() method returns the values for labelled values (i.e. values that have an associated label). We still use the vector x from the above examples.

# 1 4 NA NA NA 
#  [1]  1  2  3 NA NA NA  4  3  2  1
# attr(,"labels")
#    Agreement Disagreement        First      Refused     Not home 
#            1            4           NA           NA           NA

# [1] "1"     "4"     "NA(c)" "NA(a)" "NA(z)"

With the argument you can omit those values from the return values that are defined as missing.

get_values(x, = TRUE)
# [1] 1 4

Setting value labels

With set_labels() you can add label attributes to any vector. You can either return a new labelled vector, or label an existing vector.

x <- sample(1:4, 20, replace = TRUE)

# return new labelled vector
x <- set_labels(x, c("very low", "low", "mid", "hi"))
#  [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4
# attr(,"labels")
# very low      low      mid       hi 
#        1        2        3        4

# label existing vector
set_labels(x) <- c("too low", "less low", 
                   "mid", "very hi")
#  [1] 4 2 1 2 4 1 3 1 3 1 1 4 2 4 2 4 3 4 4 4
# attr(,"labels")
#  too low less low      mid  very hi 
#        1        2        3        4

To add explicit labels for values, use a named vector of labels as argument.

x <- c(1, 2, 3, 2, 4, 5)
x <- set_labels(x, c("strongly agree" = 1, 
                     "totally disagree" = 4, 
                     "refused" = 5,
                     "missing" = 9))
# [1] 1 2 3 2 4 5
# attr(,"labels")
#   strongly agree totally disagree          refused          missing 
#                1                4                5                9

Missing Values

Defining missing values

set_na() converts values of a vector or of multiple vectors in a data frame into tagged NAs, which means that these missing values get an information tag and a value label (which is, by default, the former value that was converted to NA). You can either return a new vector/data frame, or set NAs into an existing vector/data frame.

x <- sample(1:8, 100, replace = TRUE)
# x
#  1  2  3  4  5  6  7  8 
# 10 12  6 13 12 17 18 12

set_na(x) <- c(1, 8)
#   [1]  2  6  6 NA  7  4 NA  3  6 NA  4 NA  5  4  2  5  2  2  3  2  5  6 NA
#  [24]  7  6  4  6  3  4 NA NA  5 NA  6 NA  7  7  7  6  6 NA  7  2  2 NA  6
#  [47]  4  6  5  7  5 NA NA  7  4  7  4  3  7  2  6  5  5  7  2 NA  6  6 NA
#  [70]  2  5  7  4  7 NA  2  7  7  7  4  6  3 NA  5  5 NA  7  4  3  4 NA  6
#  [93]  4  2 NA NA  6  7  5 NA
# attr(,"labels")
#  1  8 

table(x, useNA = "always")
# x
#    2    3    4    5    6    7 <NA> 
#   12    6   13   12   17   18   22

#   [1]     2     6     6 NA(8)     7     4 NA(1)     3     6 NA(1)     4
#  [12] NA(8)     5     4     2     5     2     2     3     2     5     6
#  [23] NA(8)     7     6     4     6     3     4 NA(1) NA(1)     5 NA(1)
#  [34]     6 NA(8)     7     7     7     6     6 NA(1)     7     2     2
#  [45] NA(1)     6     4     6     5     7     5 NA(8) NA(8)     7     4
#  [56]     7     4     3     7     2     6     5     5     7     2 NA(8)
#  [67]     6     6 NA(8)     2     5     7     4     7 NA(1)     2     7
#  [78]     7     7     4     6     3 NA(1)     5     5 NA(8)     7     4
#  [89]     3     4 NA(8)     6     4     2 NA(8) NA(8)     6     7     5
# [100] NA(1)

x <- factor(c("a", "b", "c"))
# [1] a b c
# Levels: a b c

set_na(x) <- "b" 
# [1] a    <NA> c   
# attr(,"labels")
#  b 
# NA 
# Levels: a b c

Getting missing values

The get_na() function returns all tagged NA values.

set_na(efc$c87cop6) <- 3
# Often 
#    NA

get_na(efc$c87cop6, as.tag = TRUE)
#   Often 
# "NA(3)"

Replacing specific NA with values

While set_na() allows you to replace values with specific tagged NA’s, replace_na() allows you to replace either all NA values of a vector or specific tagged NA values with a non-NA value.

#  atomic [1:908] 2 3 1 3 1 3 4 2 3 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 2 3 4
#   ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"

set_na(efc$c84cop3) <- c(2, 3)
#  atomic [1:908] NA NA 1 NA 1 NA 4 NA NA 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 NA NA 4
#   ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"

# Sometimes     Often 
#        NA        NA

replace_na(efc$c84cop3, na.label = "restored NA", = "2") <- 2
#  atomic [1:908] 2 NA 1 NA 1 NA 4 2 NA 1 ...
#  - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#  - attr(*, "labels")= Named num [1:4] 1 2 4 NA
#   ..- attr(*, "names")= chr [1:4] "Never" "restored NA" "Always" "Often"

# Often 
#    NA

get_labels(efc$c84cop3, include.values = "p")
# [1] "[1] Never"       "[2] restored NA" "[4] Always"


Labelled data vastly extends R‘s capabilities to deal with value and variable labels. The sjmisc-package offers a collection of convenient functions to work with labelled data, which might be of interest especially for users coming from other statistical packages like SPSS, who want to switch to R. Packages like sjPlot facilitate the features of labelled data, making it easy to produce well annotated plots (see these vignettes for various examples). A slightly more comprehensive introduction into the sjmisc-package can be found here.


To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News