colourpicker package v1.0: You can now select semi-transparent colours in R (& more!)

By Dean Attali's R Blog

new colour input with alpha

(This article was first published on Dean Attali’s R Blog, and kindly contributed to R-bloggers)

For those who aren’t familiar with the colourpicker package, it provides a colour picker for R that can be used in Shiny, as well as other related tools. Today it’s leaving behind its 0.x days and moving on to version 1.0!

colourpicker has gone through a few major milestones since its inception. It began as merely a colour selector input in an unrelated package (shinyjs), simply because I didn’t think a colour picker input warrants its own package. After gaining a gadget and an RStudio addin (as well as some popularity!), it graduated to become its own package. Earlier this year, the Plot Helper tool was added. And now colourpicker is taking its next big step – an upgrade to version 1.0.

Table of contents

Due credit

Before describing the amazing new features, I have to give credit to David Griswold who made substantial contributions to this update.

New feature #1: Transparent colours

The most important new feature is the new alpha selector that lets you select a transparency value for any colour. Instead of being able to only select opaque colours (such as “green”), you can now select “green” with 50% transparency for example. To use this feature, use the allowTransparent = TRUE argument in the colourInput() function. Here is a screenshot of the new colour input with an alpha bar:

While I’m very excited about this transparency feature, it also caused me to introduce my first big breaking change unfortunately. The colour input previously already had a allowTransparent argument, which when set to TRUE resulted in a checkbox inside the input:

There was also a parameter that could control the text inside the checkbox. This was a useful, albeit fairly awkward, feature. After much consideration and discussion with many people, I decided against supporting both the old and new features simultaneously, and instead overriding the old allowTransparent by changing its behaviour to the current form. Even though it is technically a breaking change, this change was welcomed by most people I asked who’ve used that checkbox 🙂

New feature #2: Flexible colour specification

Previously, colours supplied to the colour input had to be specified by either using a colour name (such as blue), or using a HEX code (such as #0000FF or #00F).

In version 1.0, you can also specify colours using RGB codes (such as rgb(0, 0, 255)) or HSL codes (such as hsl(240, 100, 50)).

If your colour input uses the transparency feature described above, then you can also specify the colour to be transparent. All four methods of specifying colours support transparent colours using the alpha channel. For example, transparent, #0000FF80, rgba(0, 0, 255, 0.5) and hsla(240, 100, 50, 0.5) are all legal colour values for colourpicker.

New feature #3: Type colour directly into input box

This one is less impressive, but still a very convenient and useful feature.

Previously, the only way to pick a colour was to click on a colour input box with the mouse and select a colour by clicking on the desired colour in the palette. There is now another way to change the value of a colour input if you know the precise colour that you want to choose: you can type in the value of a colour directly into the input box. For example, you can click on the colour input and literally type the word “blue”, and the colour blue will get selected. This will work with any of the four methods for specifying colours that are mentioned in Feature 2.

Existing features

If you’ve never used the colour input in R before, here is a quick rundown of the features it has.

Simple and familiar

Using colourInput is extremely trivial if you’ve used Shiny, and it’s as easy to use as any other input control. It was implemented to very closely mimic all other Shiny inputs so that using it will feel very familiar. You can add a simple colour input to your Shiny app with colourInput("col", "Select colour", value = "red"). The return value from a colourInput is an uppercase HEX colour, so in the previous example the value of input$col would be #FF0000.

Retrieving the colour names

If you use the returnName = TRUE parameter, then the return value will be a colour name instead of a HEX value, when possible. For example, if the chosen colour is red, the return value will be red instead of #FF0000. For any colour that does not have a standard name, its HEX value will be returned.

Limited colour selection

If you want to only allow the user to select a colour from a specific list of colours, rather than any possible colour, you can use the palette = "limited" parameter. By default, the limited palette will contain 40 common colours, but you can supply your own list of colours using the allowedCols parameter. Here is an image of the default limited colour palette.

limited palette

How the chosen colour is shown inside the input box

By default, the colour input’s background will match the selected colour and the text inside the input field will be the colour’s HEX value. If that’s too much for you, you can customize the input with the showColour parameter to either only show the text or only show the background colour.

Here is what a colour input with each of the possible values for showColour looks like


Works on any device

If you’re worried that maybe someone viewing your Shiny app on a phone won’t be able to use this input properly – don’t you worry. I haven’t quite checked every single device out there, but I did spend extra time making sure the colour selection JavaScript works in most devices I could think of. colourInput() will work fine in Shiny apps that are viewed on Android cell phones, iPhones, iPads, and even Internet Explorer 8+.

As always, any feedback is more than welcomed!

To leave a comment for the author, please follow the link and comment on their blog: Dean Attali’s R Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

My interview with ROpenSci

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The ROpenSci team has started publishing a new series of interviews with the goal of “demystifying the creative and development processes of R community members”. I had the great pleasure of being interviewed by Kelly O’Briant earlier this year, and the interview was published on Friday.

Thanks for being a great interviewer, Kelly! I’m looking forward to hearing from other R community members as the the rest of the series is published.

ROpenSci blog: .rprofile: David Smith

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Newbie’s Install of Keras & Tensorflow on Windows 10 with R

By Nicole Radziwill

(This article was first published on R – Quality and Innovation, and kindly contributed to R-bloggers)

This weekend, I decided it was time: I was going to update my Python environment and get Keras and Tensorflow installed so I could start doing tutorials (particularly for deep learning) using R. Although I used to be a systems administrator (about 20 years ago), I don’t do much installing or configuring so I guess that’s why I’ve put this task off for so long. And it wasn’t unwarranted: it took me the whole weekend to get the install working. Here are the steps I used to get things running on Windows 10, leveraging clues in about 15 different online resources — and yes (I found out the hard way), the order of operations is very important. I do not claim to have nailed the order of operations here, but definitely one that works.

Step 0: I had already installed the tensorflow and keras packages within R, and had been wondering why they wouldn’t work. “Of course!” I finally realized, a few weeks later. “I don’t have Python on this machine, and both of these packages depend on a Python install.” Turns out they also depend on the reticulate package, so install.packages(“reticulate”) if you have not already.

Step 1: Installed Anaconda3 to C:/Users/User/Anaconda3 (from

Step 2: Opened “Anaconda Prompt” from Windows Start Menu. First, to create an “environment” specifically for use with tensorflow and keras in R called “tf-keras” with a 64-bit version of Python 3.5 I typed:

conda create -n tf-keras python=3.5 anaconda

… and then after it was done, I did this:

activate tf-keras

Step 3: Install TensorFlow from Anaconda prompt. Using the instructions at I typed this:

pip install --ignore-installed --upgrade

I didn’t know whether this worked or not — it gave me an error saying that it “can not import html5lib”, so I did this next:

conda install -c conda-forge html5lib

I tried to run the pip command again, but there was an error so I consulted It told me to do this:

pip install --ignore-installed --upgrade tensorflow

This failed, and told me that the pip command had an error. I searched the web for an alternative to that command, and found this, which worked!!

conda install -c conda-forge tensorflow

Step 4: From inside the Anaconda prompt, I opened python by typing “python”. Next, I did this, line by line:

import tensorflow as tf
 hello = tf.constant('Hello, TensorFlow!')
 sess = tf.Session()

It said “b’Hello, TensorFlow!’” which I believe means it works. (Ctrl-Z then Enter will then get you out of Python and back to the Anaconda prompt.) This means that my Python installation of TensorFlow was functional.

Step 5: Install Keras. I tried this:

pip install keras

…but I got the same error message that pip could not be installed or found or imported or something. So I tried this, which seemed to work:

conda install -c conda-forge keras

Step 6: Load them up from within R. First, I opened a 64-bit version of R v3.4.1 and did this:


It took a couple minutes but it seemed to work.


Step 7: Try a tutorial! I decided to go for which guides you through developing a classifier for the MNIST handwritten image database — a very popular data science resource. I loaded up my dataset and checked to make sure it loaded properly:

List of 2
 $ train:List of 2
 ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
 ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
 $ test :List of 2
 ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
 ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...

Step 8: Here is the code I used to prepare the data and create the neural network model. This didn't take long to run at all.

layer_dense(units = 784, input_shape = 784) %>% 
layer_activation(activation = 'relu') %>% 
layer_dense(units = 10) %>% 
layer_activation(activation = 'softmax')

model %>% compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')

Step 9: Train the network. THIS TOOK ABOUT 12 MINUTES on a powerful machine with 64GB high-performance RAM. It looks like it worked, but I don’t know how to find or evaluate the results yet.

model %>% fit(train_x, train_y, epochs = 100, batch_size = 128)
 loss_and_metrics % evaluate(test_x, test_y, batch_size = 128)

Layer (type) Output Shape Param #
dense_1 (Dense) (None, 784) 615440
dropout_1 (Dropout) (None, 784) 0
activation_1 (Activation) (None, 784) 0
dense_2 (Dense) (None, 10) 7850
activation_2 (Activation) (None, 10) 0
Total params: 623,290
Trainable params: 623,290
Non-trainable params: 0

Step 10: Next, I wanted to try the tutorial at Turns out this uses the kerasR package, not the keras package:

X_train  dim(X_train)
[1] 60000 28 28


To check and see what's in any individual image, type:


At this point, the to_categorical function stopped working. I was supposed to do this but got an error:


So I did this instead:


But then I tried this, and it was clear I was stuck again — it wouldn't work:

mod$add(Dense(units = 512, input_shape = dim(X_train)[2]))

Stack Overflow recommended grabbing a version of kerasR from GitHub, so that’s what I did next:


I got an error in R which told me to go to the Anaconda prompt (which I did), and type this:

conda install m2w64-toolchain

Then I went back into R and this worked fantastically:


mod$Add would still not work though, and this is where my patience expired for the evening. I'm pretty happy though — Python is up, keras and tensorflow are up on Python, all three (keras, tensorflow, and kerasR) are up in R, and some tutorials seem to be working.

To leave a comment for the author, please follow the link and comment on their blog: R – Quality and Innovation. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Citibike Business Opportunity: Advertising

By Josh Yoon

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)


It is hard to wander around New York City without seeing rows of dozens of bright blue Citibikes planted in the middle of busiest nooks and crannies of the city. These bikes belong to Citibike, a ride-sharing program that allows users to conveniently rent a bike to travel to their destinations without having to worry about the hassles of parking and locking their bicycle. Citibike has quickly become the preferred mode of transportation for many New Yorkers who are tired of the laundry list of issues with public transportation and are looking to get some fresh air as they travel around the city.

The premise for the program is quite simple, you can choose between an annual pass for year round access or a 3 or 7 day pass as a more temporary option. Pass holders are able to pick up a bike from a station near them and ride to their destination and park the bike at a station closest to their final destination. With over 622 stations in Manhattan, Brooklyn, and Queens, New Yorkers can easily find a convenient location nearby.

As a New Yorker who sees Citibike stations in what seems like just about every other corner, I came across the interesting realization that Citibike has a great opportunity to sell advertising space at the Citibike stations that are located in prime locations with mass exposure to pedestrian and vehicular traffic. For my project, I built an app that organizes existing Citibike user data to help guide sponsors looking to purchase advertising space with Citibike.

Data Summary

Citibike’s public data set is a fantastic source of information because every single ride is documented and released to its data set. Across the 12 month span from Aug 2016 to July 2017, there were over 15 million observations for citibike rides. Due to timing and the limitations with R, I will be focusing on 1.5 million observations for the month of July in 2017 (The newest month released by Citibike). Citibike’s raw data set includes the following categories:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

For the sake of my project which attempts to better understand and observe trends in the Citibike customer, I will be focusing on the following categories:

  • Gender
  • Age
  • Station Locations
  • Number of Visits To Each Station(Both Start and Finish)
  • Date and Time of Citibike Ridership

Exploration And Visualization

Left: Gender Breakout Chart Right: Age Group Breakout Chart

Upon simple analysis of Citibike users, you can find a couple of interesting facts. First, Male riders outnumber female riders by 2:1 ratio. The cause of the discrepancy may be attributed to male vs female transportation preferences but can not be accurately determined from the scope of this project. Next, you will notice that two age groups (25 to 30 and 31 to 36) account for just under 50% of all Citibike riders. This information is very telling in terms of the popularity of Citibike among young professionals. Understanding the population of Citibike users is extremely important in considering advertising opportunities with Citibike.

Shiny App Layout

The premise of the app that I have built is for potential sponsors looking to buy advertising space with Citibike to use citibike’s user data to better strategize where and when to invest in advertising. As a result, the Shiny app allows the user to interact with the app to gain valuable information. My app consists of three tabs outlined below:

1. Gender Map

In this tab, the user(potential sponsor) is able to select the gender they would like to filter by in the top left box. Once the user selects a filter, the map on the right automatically adjusts to show density of citibike users by the selected criteria. For example, if “Female” button is selected, the map will show the neighborhoods in NYC(represented by the colored polygons ) by how many female riders are active, the darker colors representing higher density of female riders. In addition, ten markers are shown on the map representing the location of the Top 10 Most VIsited Bike Stations based on the chosen gender.

Top: Gender Map (Both Males and Females) Bottom: Top Ten Stations(Both Genders)

2. Age Range Map

The second tap in the app, mirrors the functionality of the Gender Map tab, only now the user is able to visualize New York City according to age range. Once again, the user selects the age range he/she would like to study and the map shows the density of Citibike riders in the selected age range in each neighborhood. Top 10 Most Visited Citibike Stations are displayed for each age range through markers.

Top: Age Range Map(All Ages) Bottom: Top Ten Stations( All Genders)

3. Date & Time

In the final tab, the user is able to select both gender as well as the age range he/she would like to examine. The charts below output the hourly activity of the selected gender/age range combination as well as activity by day of the week.

Example Hourly and Weekday ridership visualization in the “Time and Date” Tab

Case Study Example

In order to better understand a real world application of the app, it is useful to examine a sample case study. :

What if a up-and-coming makeup brand X is looking to target younger female customers wanted to advertise with Citibike? Using the gender map, company X can see which stations/ neighborhoods have the most exposure to female riders. Similarly, selecting the 19-24 age range in the Age Range tab will generate a map showing which stations/neighborhoods will have the most exposure to 19-24 year old Citibike users. Finally, the Date and Time Tab can be filtered to show the time of day and day of the week female riders are most active! Using the Top 10 stations for females and the 19-24 age group, we can deduce the top 3 stations that Company X should target.

Top: Map Filtered For Females Right: Map Filtered For 18 to 25 Age Group

Hour and Day Activity of Female Riders in 18-25 year old range

Left: Top 10 Stations: Females Right Top 10 Stations 18 to 25 Age Group

Best Stations:

1)West St & Chambers St

2) Broadway & E 22nd St

3) Broadway & E 14th St

Best Hour:


Best Day:


Future Application for the App

As shown in the case study example, the app takes Citibike ridership data and organizes the information in a way that can be extremely useful for advertising. Although the app itself, was a little specific in scope(limited to Citibike users and NYC region), the application is a great example of using data sets to glean information regarding customer behavior. With simple analysis and visualization techniques, the app is able to hone in on locations and times that will provide the most advantageous for marketing purposes. As the technology for acquiring customer-centric data becomes more and more powerful( i.e. credit card/ shopping cart analysis, viewership analysis, email subscription analysis, etc.), it will become extremely important for companies to take massive amounts of data and translate into useful contributions to business strategy and innovation.

The post Citibike Business Opportunity: Advertising appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Can we use B-splines to generate non-linear data?

By Keith Goldfeld

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I’m exploring the idea of adding a function or set of functions to the simstudy package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated. It would also make it easy to explore spline methods, because as with any other simulated data set, we would know the underlying data generating process.


A B-spline is a linear combination of a set of basis functions that are determined by the number and location of specified knots or cut-points, as well as the (polynomial) degree of curvature. A degree of one implies a set of straight lines, degree of two implies a quadratic curve, three a cubic curve, etc. This nice quick intro provides much more insight into issues B-splines than I can provide here. Or if you want even more detail, check out this book. It is a very rich topic.

Within a cut-point region, the sum of the basis functions always equals 1. This is easy to see by looking at a plot of basis functions, several of which are provided below. The definition and shape of the basis functions do not in any way depend on the data, only on the degree and cut-points. Of course, these functions can be added together in infinitely different ways using weights. If one is trying to fit a B-spline line to data, those weights can be estimated using regression models.

Splines in R

The bs function in the splines package, returns values from these basis functions based on the specification of knots and degree of curvature. I wrote a wrapper function that uses the bs function to generate the basis function, and then I do a linear transformation of these functions by multiplying the vector parameter theta, which is just a vector of coefficients. The linear combination at each value of (x) (the support of the basis functions) generates a value (which I call (y.spline)) on the desired curve. The wrapper returns a list of objects, including a data.table that includes (x) and (y.spline), as well as the basis functions, and knots.



I’ve also written two functions that make it easy to print the basis function and the spline curve. This will enable us to look at a variety of splines.


Linear spline with quartile cut-points

Here is a simple linear spline that has four regions defined by three cut-points, and the slope of the line in each region varies. The first value of theta is essentially the intercept. When you look at the basis plot, you will see that any single region has two “active” basis functions (represented by two colors), the other functions are all 0 in that region. The slope of the line in each is determined by the relevant values of theta. It is probably just easier to take a look:


For this example, I am printing out the basis function for the first few values of (x).

round( head(cbind(x = sdata$dt$x, sdata$basis)), 4 )
##          x     1     2 3 4 5
## [1,] 0.000 1.000 0.000 0 0 0
## [2,] 0.001 0.996 0.004 0 0 0
## [3,] 0.002 0.992 0.008 0 0 0
## [4,] 0.003 0.988 0.012 0 0 0
## [5,] 0.004 0.984 0.016 0 0 0
## [6,] 0.005 0.980 0.020 0 0 0


Same knots (cut-points) but different theta (coefficients)

If use the same knot and degree specification, but change the vector (theta), we change the slope of the lines in each of the four regions:

theta = c(0.2, 0.3, 0.8, 0.2, 0.1)


Quadratic spline with quartile cut-points

The basis functions get a little more elaborate with a quadratic spline. With this added degree, we get an additional basis function in each region, so you should see 3 colors instead of 2. The resulting spline is parabolic in each region, but with a different shape, each of which is determined by theta.



Quadratic spline with two cut-points (three regions)



Cubic spline with two cut-points (three regions)

And in this last example, we generate basis functions for a cubic spline the differs in three regions. The added curvature is apparent:



Generating new data from the underlying spline

It is a simple step to generate data from the spline. Each value on the line is treated as the mean, and “observed” data can be generated by adding variation. In this case, I use the normal distribution, but there is no reason other distributions can’t be used. I’m generating data based on the the parameters in the previous example. And this time, the spline plot includes the randomly generated data:


Now that we have generated new data, why don’t we go ahead and fit a model to see if we can recover the coefficients specified in theta? We are interested in the relationship of (x) and (y), but the relationship is not linear and changes across (x). To estimate a model, we regress the outcome data (y) on the values of the basis function that correspond to each value of (x):

##         x1    x2    x3    x4    x5    x6     y
##   1: 0.063 0.557 0.343 0.036 0.000 0.000 0.443
##   2: 0.000 0.000 0.140 0.565 0.295 0.000 0.542
##   3: 0.000 0.000 0.003 0.079 0.495 0.424 0.634
##   4: 0.003 0.370 0.523 0.104 0.000 0.000 0.232
##   5: 0.322 0.553 0.120 0.005 0.000 0.000 0.269
##  ---                                          
## 246: 0.000 0.023 0.442 0.494 0.041 0.000 0.520
## 247: 0.613 0.356 0.031 0.001 0.000 0.000 0.440
## 248: 0.246 0.584 0.161 0.009 0.000 0.000 0.236
## 249: 0.000 0.000 0.014 0.207 0.597 0.182 0.505
## 250: 0.002 0.344 0.539 0.115 0.000 0.000 0.313
# fit the model - explicitly exclude intercept since x1 is intercept

##   term   estimate  std.error true
## 1   x1 0.16465186 0.03619581  0.2
## 2   x2 0.57855125 0.03996219  0.6
## 3   x3 0.09093425 0.04267027  0.1
## 4   x4 0.94938718 0.04395370  0.9
## 5   x5 0.13579559 0.03805510  0.2
## 6   x6 0.85867619 0.03346704  0.8

Using the parameter estimates (estimated here using OLS), we can get predicted values and plot them to see how well we did:

# get the predicted values so we can plot

dxbasis[ , y.pred := predict(object = lmfit)]
dxbasis[ , x := x]

# blue line represents predicted values

plot.spline(sdata, points = TRUE) + 
  geom_line(data=dxbasis, aes(x=x, y=y.pred), color = "blue", size = 1 )

The model did quite a good job, because we happened to assume the correct underlying assumptions of the spline. However, let’s say we suspected that the data were generated by a quadratic spline. We need to get the basis function assuming the same cut-points for the knots but now using a degree equal to two. Since a reduction in curvature reduces the number of basis functions by one, the linear model changes slightly. (Note that this model is not quite nested in the previous (cubic) model, because the values of the basis functions are different.)


If we compare the two models in terms of model fit, the cubic model only does slightly better in term of (R^2): 0.96 vs. 0.94. In this case, it probably wouldn’t be so obvious which model to use.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Why Use Docker with R? A DevOps Perspective

By Jeroen Ooms

opencpu logo

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

There have been several blog posts going around about why one would use Docker with R.
In this post I’ll try to add a DevOps point of view and explain how containerizing
R is used in the context of the OpenCPU system for building and deploying R servers.

Has anyone in the #rstats world written really well about the *why* of their use of Docker, as opposed to the the *how*?

— Jenny Bryan (@JennyBryan) September 29, 2017

1: Easy Development

The flagship of the OpenCPU system is the OpenCPU server:
a mature and powerful Linux stack for embedding R in systems and applications.
Because OpenCPU is completely open source we can build and ship on DockerHub. A ready-to-go linux server with both OpenCPU and RStudio
can be started using the following (use port 8004 or 80):

docker run -t -p 8004:8004 opencpu/rstudio

Now simply open http://localhost:8004/ocpu/ and
http://localhost:8004/rstudio/ in your browser!
Login via rstudio with user: opencpu (passwd: opencpu) to build or install apps.
See the readme for more info.

Docker makes it easy to get started with OpenCPU. The container gives you the full
flexibility of a Linux box, without the need to install anything on your system.
You can install packages or apps via rstudio server, or use docker exec to a
root shell on the running server:

# Lookup the container ID
docker ps

# Drop a shell
docker exec -i -t eec1cdae3228 /bin/bash

From the shell you can install additional software in the server, customize the apache2 httpd
config (auth, proxies, etc), tweak R options, optimize performance by preloading data or
packages, etc.

2: Shipping and Deployment via DockerHub

The most powerful use if Docker is shipping and deploying applications via DockerHub. To create a fully standalone
application container, simply use a standard opencpu image
and add your app.

For the purpose of this blog post I have wrapped up some of the example apps as docker containers by adding a very simple Dockerfile to each repository. For example the nabel app has a Dockerfile that contains the following:

FROM opencpu/base

RUN R -e 'devtools::install_github("rwebapps/nabel")'

It takes the standard opencpu/base
image and then installs the nabel app from the Github repository.
The result is a completeley isolated, standalone application. The application can be
started by anyone using e.g:

docker run -d 8004:8004 rwebapps/nabel

The -d daemonizes on port 8004.
Obviously you can tweak the Dockerfile to install whatever extra software or settings you need
for your application.

Containerized deployment shows the true power of docker: it allows for shipping fully
self contained appliations that work out of the box, without installing any software or
relying on paid hosting services. If you do prefer professional hosting, there are
many companies that will gladly host docker applications for you on scalable infrastructure.

3 Cross Platform Building

There is a third way Docker is used for OpenCPU. At each release we build
the opencpu-server installation package for half a dozen operating systems, which
get published on
This process has been fully automated using DockerHub. The following images automatically
build the enitre stack from source:

DockerHub automatically rebuilds this images when a new release is published on Github.
All that is left to do is run a script
which pull down the images and copies the opencpu-server binaries to the archive server.

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sales Analytics: How to Use Machine Learning to Predict and Optimize Product Backorders

By – Articles

Precision and Recall Vs Cutoff

(This article was first published on – Articles, and kindly contributed to R-bloggers)

Sales, customer service, supply chain and logistics, manufacturing… no matter which department you’re in, you more than likely care about backorders. Backorders are products that are temporarily out of stock, but a customer is permitted to place an order against future inventory. Back orders are both good and bad: Strong demand can drive back orders, but so can suboptimal planning. The problem is when a product is not immediately available, customers may not have the luxury or patience to wait. This translates into lost sales and low customer satisfaction. The good news is that machine learning (ML) can be used to identify products at risk of backorders. In this article we use the new H2O automated ML algorithm to implement Kaggle-quality predictions on the Kaggle dataset, “Can You Predict Product Backorders?”. This is an advanced tutorial, which can be difficult for learners. We have good news, see our announcement below if you are interested in a machine learning course from Business Science. If you love this tutorial, please connect with us on social media to stay up on the latest Business Science news, events and information! Good luck and happy analyzing!

Challenges and Benefits

This tutorial covers two challenges. One challenge with this problem is dataset imbalance, when the majority class significantly outweighs the minority class. We implement a special technique for dealing with unbalanced data sets called SMOTE (synthetic minority over-sampling technique) that improves modeling accuracy and efficiency (win-win). The second challenge is optimizing for the business case. To do so, we explore cutoff (threshold) optimization which can be used to find the cutoff that maximizes expected profit.

One of the important visualizations (benefits to your understanding) is the effect of precision and recall on inventory strategy. A lever the analyst controls is the cutoff (threshold). The threshold has an effect on precision and recall. By selecting the optimal threshold we can maximize expected profit.

Quick Navigation

This is a longer analysis, and certain sections may be more interesting based on your role. Are you an Executive or a Data Scientist?

Others should just read the whole thing. 😉

Background on Backorder Prediction

What is a backorder?

According to Investopedia, a backorder is…

A customer order that has not been fulfilled. A backorder generally indicates that customer demand for a product or service exceeds a company’s capacity to supply it. Total backorders, also known as backlog, may be expressed in terms of units or dollar amount.

Why do we care?

Apple iPhone X


A backorder can be both a good thing and a bad thing. Consider a company like Apple that recently rolled out several new iPhone models including iPhone 8 and iPhone X. Because the initial supply cannot possibly keep up with the magnitude of demand, Apple is immediately in a “backorder” situation that requires customer service management.

On the positive side, backorders indicate healthy demand for a product. For Apple, this is a credit to strong brand recognition. Customers are likely to wait several months to get the latest model because of the new and innovative technology and incredible branding. Conversely, Apple’s competitors don’t have this luxury. If new models cannot be provided immediately, their customers cancel orders and go elsewhere.

Companies constantly strive for a balance in managing backorders. It’s a fine line: Too much supply increases inventory costs while too little supply increases the risk that customers may cancel orders. Why not just keep 100,000 of everything on the shelves at all time? For most retailers and manufacturers, this strategy will drive inventory costs through the roof considering they likely have a large number of SKUs (unique product IDs).

A predictive analytics program can identify which products are most likely to experience backorders giving the organization information and time to adjust. Machine learning can identify patterns related to backorders before customers order. Production can then adjust to minimize delays while customer service can provide accurate dates to keep customers informed and happy. The predictive analytics approach enables the maximum product to get in the hands of customers at the lowest cost to the organization. The result is a win-win: organizations increase sales while customers get to enjoy the products they demand.

Challenges with Predicting Backorders

One of the first steps in developing a backorder strategy is to understand which products are at risk of being backordered. Sounds easy right? Wrong. A number of factors can make this a challenging business problem.

Demand-Event Dependencies

Home Depot Snow Blower

Demand for existing products are constantly changing, often dependent on non-business data (holidays, weather, and other events). Consider demand for snow blowers at Home Depot. We know that geographically-speaking in Northern USA the winter is when demand will rise and the spring is when demand will fall. However, this demand is highly dependent on the level of snowfall. If snowfall is minimal, demand will be low regardless of it being winter. Therefore, Home Depot will need to know the likelihood of snowfall to best predict demand surges and shortfalls in order to optimize inventory.

Small Numbers (Anomalies)

If backorders are very infrequent but highly important, it can be very difficult to predict the minority class accurately because of the imbalance between backorders to non-backorders within the data set. Accuracy for the model will look great, but the actual predictive quality may be very poor. Consider the NULL error rate, or the rate of just picking the majority class. Just picking “non-backorder” may be the same or more accurate than the model. Special strategies exist for dealing with unbalanced data, and we’ll implement SMOTE.

Time Dependency

Often this problem is viewed as a cross-sectional problem rather than as a temporal (or time-based) problem. The problem is time can play a critical role. Just because a product sold out this time last year, will it sell out again this year? Maybe. Maybe not. Time-based demand forecasting can be necessary to augment cross-sectional binary classification.

Case Study: Predicting Backorder Risk and Optimizing Profit

The Problem

A hypothetical manufacturer has a data set that identifies whether or not a backorder has occurred. The challenge is to accurately predict future backorder risk using predictive analytics and machine learning and then to identify the optimal strategy for inventorying products with high backorder risk.

The Data

The data comes from Kaggle’s Can You Predict Product Backorders? dataset. If you have a Kaggle account, you can download the data, which includes both a training and a test set. We’ll use the training set for developing our model and the test set for determining the final accuracy of the best model.

The data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data were taken as weekly snapshots at the start of each week. The target (or response) is the went_on_backorder variable. To model and predict the target, we’ll use the other features, which include:

  • sku – Random ID for the product
  • national_inv – Current inventory level for the part
  • lead_time – Transit time for product (if available)
  • in_transit_qty – Amount of product in transit from source
  • forecast_3_month – Forecast sales for the next 3 months
  • forecast_6_month – Forecast sales for the next 6 months
  • forecast_9_month – Forecast sales for the next 9 months
  • sales_1_month – Sales quantity for the prior 1 month time period
  • sales_3_month – Sales quantity for the prior 3 month time period
  • sales_6_month – Sales quantity for the prior 6 month time period
  • sales_9_month – Sales quantity for the prior 9 month time period
  • min_bank – Minimum recommend amount to stock
  • potential_issue – Source issue for part identified
  • pieces_past_due – Parts overdue from source
  • perf_6_month_avg – Source performance for prior 6 month period
  • perf_12_month_avg – Source performance for prior 12 month period
  • local_bo_qty – Amount of stock orders overdue
  • deck_risk – Part risk flag
  • oe_constraint – Part risk flag
  • ppap_risk – Part risk flag
  • stop_auto_buy – Part risk flag
  • rev_stop – Part risk flag
  • went_on_backorder – Product actually went on backorder. This is the target value.

Using Machine Learning to Predict Backorders


We’ll use the following libraries:

  • tidyquant: Used to quickly load the “tidyverse” (dplyr, tidyr, ggplot, etc) along with custom, business-report-friendly ggplot themes. Also great for time series analysis (not featured)
  • unbalanced: Contains various methods for working with unbalanced data. We’ll use the ubSMOTE() function.
  • h2o: Go-to package for implementing professional grade machine learning.

Note on H2O: You’ll want to install the latest stable R version from If you have issues in this post, you probably did not follow these steps:

  1. Go to’s download page
  2. Under H2O, select “Latest Stable Release”
  3. Click on the “Install in R” tab
  4. Follow instructions exactly.

Load the libraries.

# You can install these from CRAN using install.packages()
library(tidyquant)  # Loads tidyverse and custom ggplot themes
library(unbalanced) # Methods for dealing with unbalanced data sets

# Follow instructions for latest stable release, don't just install from CRAN
library(h2o)        # Professional grade ML

Load Data

Use read_csv() to load the training and test data.

# Load training and test sets 
train_raw_df     read_csv("data/Kaggle_Training_Dataset_v2.csv")
test_raw_df      read_csv("data/Kaggle_Test_Dataset_v2.csv")

Inspect Data

Inspecting the head and tail, we can get an idea of the data set. There’s some sporadic NA values in the “lead_time” column along with -99 values in the two supplier performance columns.

# Head
train_raw_df %>% head() %>% knitr::kable()
sku national_inv lead_time in_transit_qty forecast_3_month forecast_6_month forecast_9_month sales_1_month sales_3_month sales_6_month sales_9_month min_bank potential_issue pieces_past_due perf_6_month_avg perf_12_month_avg local_bo_qty deck_risk oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder
1026827 0 NA 0 0 0 0 0 0 0 0 0 No 0 -99.00 -99.00 0 No No No Yes No No
1043384 2 9 0 0 0 0 0 0 0 0 0 No 0 0.99 0.99 0 No No No Yes No No
1043696 2 NA 0 0 0 0 0 0 0 0 0 No 0 -99.00 -99.00 0 Yes No No Yes No No
1043852 7 8 0 0 0 0 0 0 0 0 1 No 0 0.10 0.13 0 No No No Yes No No
1044048 8 NA 0 0 0 0 0 0 0 4 2 No 0 -99.00 -99.00 0 Yes No No Yes No No
1044198 13 8 0 0 0 0 0 0 0 0 0 No 0 0.82 0.87 0 No No No Yes No No

The tail shows another issue, the last row is NA values.

# Tail
train_raw_df %>% tail() %>% knitr::kable()
sku national_inv lead_time in_transit_qty forecast_3_month forecast_6_month forecast_9_month sales_1_month sales_3_month sales_6_month sales_9_month min_bank potential_issue pieces_past_due perf_6_month_avg perf_12_month_avg local_bo_qty deck_risk oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder
1407754 0 2 0 10 10 10 0 5 7 7 0 No 0 0.69 0.69 5 Yes No No Yes No No
1373987 -1 NA 0 5 7 9 1 3 3 8 0 No 0 -99.00 -99.00 1 No No No Yes No No
1524346 -1 9 0 7 9 11 0 8 11 12 0 No 0 0.86 0.84 1 Yes No No No No Yes
1439563 62 9 16 39 87 126 35 63 153 205 12 No 0 0.86 0.84 6 No No No Yes No No
1502009 19 4 0 0 0 0 2 7 12 20 1 No 0 0.73 0.78 1 No No No Yes No No

If we take a look at the distribution of the target, train_raw_df$went_on_backorder, we see that the data set is severely imbalanced. We’ll need a strategy to balance the data set if we want to get maximum model performance and efficiency.

# Unbalanced data set
train_raw_df$went_on_backorder %>% table() %>% prop.table()
## .
##          No         Yes 
## 0.993309279 0.006690721

We can also inspect missing values. Because both the train and test sets have missing values, we’ll need to come up with a strategy to deal with them.

# train set: Percentage of complete cases
train_raw_df %>% complete.cases() %>% sum() / nrow(train_raw_df)
## [1] 0.9402238

The test set has a similar distribution.

# test set: Percentage of complete cases
test_raw_df %>% complete.cases() %>% sum() / nrow(test_raw_df)
## [1] 0.939172

Finally, it’s also worth taking a glimpse of the data. We can quickly see:

  • We have some character columns with Yes/No values.
  • The “perf_” columns have -99 values, which reflect missing data.
## Observations: 1,687,861
## Variables: 23
## $ sku                1026827, 1043384, 1043696, 1043852, 104...
## $ national_inv       0, 2, 2, 7, 8, 13, 1095, 6, 140, 4, 0, ...
## $ lead_time          NA, 9, NA, 8, NA, 8, NA, 2, NA, 8, 2, N...
## $ in_transit_qty     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ forecast_3_month   0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0,...
## $ forecast_6_month   0, 0, 0, 0, 0, 0, 0, 0, 114, 0, 0, 0, 0...
## $ forecast_9_month   0, 0, 0, 0, 0, 0, 0, 0, 152, 0, 0, 0, 0...
## $ sales_1_month      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sales_3_month      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sales_6_month      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sales_9_month      0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ min_bank           0, 0, 0, 1, 2, 0, 4, 0, 0, 0, 0, 0, 0, ...
## $ potential_issue    "No", "No", "No", "No", "No", "No", "No...
## $ pieces_past_due    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ perf_6_month_avg   -99.00, 0.99, -99.00, 0.10, -99.00, 0.8...
## $ perf_12_month_avg  -99.00, 0.99, -99.00, 0.13, -99.00, 0.8...
## $ local_bo_qty       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deck_risk          "No", "No", "Yes", "No", "Yes", "No", "...
## $ oe_constraint      "No", "No", "No", "No", "No", "No", "No...
## $ ppap_risk          "No", "No", "No", "No", "No", "No", "No...
## $ stop_auto_buy      "Yes", "Yes", "Yes", "Yes", "Yes", "Yes...
## $ rev_stop           "No", "No", "No", "No", "No", "No", "No...
## $ went_on_backorder  "No", "No", "No", "No", "No", "No", "No...

Data Setup

Validation Set

We have a training and a test set, but we’ll need a validation set to assist with modeling. We can perform a random 85/15 split of the training data.

# Train/Validation Set Split
split_pct  0.85
n  nrow(train_raw_df)
sample_size  floor(split_pct * n)

idx_train  sample(1:n, size = sample_size)

valid_raw_df  train_raw_df[-idx_train,]
train_raw_df  train_raw_df[idx_train,]

Data Pre-Processing

Pre-processing includes a number of steps to get the raw data ready for modeling. We can make a pre-process function that drops unnecessary columns, deals with NA values, converts Yes/No data to 1/0, and converts the target to a factor. These are all standard pre-processing steps that will need to be applied to each of the data sets.

# Custom pre-processing function
preprocess_raw_data  function(data) {
    # data = data frame of backorder data
    data %>%
        select(-sku) %>%
        drop_na(national_inv) %>%
        mutate(lead_time = ifelse(, -99, lead_time)) %>%
        mutate_if(is.character, .funs = function(x) ifelse(x == "Yes", 1, 0)) %>%
        mutate(went_on_backorder = as.factor(went_on_backorder))

# Apply the preprocessing steps
train_df  preprocess_raw_data(train_raw_df) 
valid_df  preprocess_raw_data(valid_raw_df) 
test_df   preprocess_raw_data(test_raw_df)

# Inspect the processed data
## Observations: 1,434,680
## Variables: 22
## $ national_inv       275, 9, 6, 5, 737, 3, 318, 12, 66, 16, ...
## $ lead_time          8, 8, 2, 4, 2, 8, 8, 2, 8, 9, 8, 2, 4, ...
## $ in_transit_qty     0, 0, 0, 0, 0, 0, 16, 0, 21, 0, 8, 5, 0...
## $ forecast_3_month   0, 0, 0, 0, 0, 0, 234, 0, 29, 0, 40, 14...
## $ forecast_6_month   0, 0, 0, 0, 0, 0, 552, 0, 62, 0, 80, 21...
## $ forecast_9_month   0, 0, 0, 0, 0, 0, 780, 0, 106, 0, 116, ...
## $ sales_1_month      0, 0, 0, 0, 1, 0, 129, 0, 10, 0, 19, 4,...
## $ sales_3_month      0, 0, 0, 0, 4, 1, 409, 0, 35, 0, 50, 10...
## $ sales_6_month      0, 0, 0, 0, 5, 1, 921, 0, 57, 0, 111, 1...
## $ sales_9_month      0, 0, 0, 0, 9, 1, 1404, 2, 99, 0, 162, ...
## $ min_bank           0, 2, 0, 0, 3, 1, 108, 1, 22, 0, 72, 2,...
## $ potential_issue    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ pieces_past_due    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ perf_6_month_avg   0.96, 0.42, 1.00, 0.77, 0.00, 0.75, 0.7...
## $ perf_12_month_avg  0.94, 0.36, 0.98, 0.80, 0.00, 0.74, 0.4...
## $ local_bo_qty       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deck_risk          0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ oe_constraint      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ppap_risk          0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ stop_auto_buy      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ rev_stop           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ went_on_backorder  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Data Transformation

A step may be needed to transform (normalize, center, scale, apply PCA, etc) the data especially if using deep learning as a classification method. In our case, overall performance actually decreased when transformation was performed. We skip this step because of this. Interested readers may wish to explore the recipes package for creating and applying ML pre-processing recipes, which has great documentation.

Balance Dataset with SMOTE

As we saw previously, the data is severely unbalanced with only 0.7% of cases being positive (went_on_backorder = Yes). To deal with this class imbalance, we’ll implement a technique called SMOTE (synthetic minority over-sampling technique), which oversamples the minority class by generating synthetic minority examples in the neighborhood of observed ones. In other words, it shrinks the prevalence of the majority class (under sampling) while simultaneously synthetically increasing the prevalence of the minority class using a k-nearest neighbors approach (over sampling via knn). The great thing is that SMOTE can improve classifier performance (due to better classifier balance) and improve efficiency (due to smaller but focused training set) at the same time (win-win)!

The ubSMOTE() function from the unbalanced package implements SMOTE (along with a number of other sampling methods such as over, under, and several advanced methods). We’ll setup the following arguments:

  • perc.over = 200: This is the percent of new instances generated for each rare instance. For example, if there were only 5 rare cases, the 200% percent over will synthetically generate and additional 10 rare cases.
  • perc.under = 200: The percentage of majority classes selected for each SMOTE observations. If 10 additional observations were created through the SMOTE process, 20 majority cases would be sampled.
  • k = 5: The number of nearest neighbors to select when synthetically generating new observations.
# Use SMOTE sampling to balance the dataset
input   train_df %>% select(-went_on_backorder)
output  train_df$went_on_backorder 
train_balanced  ubSMOTE(input, output, perc.over = 200, perc.under = 200, k = 5)

We need to recombine the results into a tibble. Notice there are now only 68K observations versus 1.4M previously. This is a good indication that SMOTE worked correctly. As an added benefit, the training set size has shrunk which will make the model training significantly faster.

# Recombine the synthetic balanced data
train_df  bind_cols(as.tibble(train_balanced$X), tibble(went_on_backorder = train_balanced$Y))
## # A tibble: 67,669 x 22
##    national_inv  lead_time in_transit_qty forecast_3_month
##  1   141.000000 -99.000000              0          0.00000
##  2   108.000000   8.000000              0          0.00000
##  3    38.000000   8.000000              0         26.00000
##  4     0.000000 -99.000000              0          0.00000
##  5     1.417421  11.165158              0         40.60803
##  6   253.000000   8.000000             11        175.00000
##  7    67.000000  52.000000              0          0.00000
##  8     0.000000   2.000000              0          5.00000
##  9     5.000000  12.000000              0          0.00000
## 10     0.000000   6.068576              0         24.19806
## # ... with 67,659 more rows, and 18 more variables:
## #   forecast_6_month , forecast_9_month ,
## #   sales_1_month , sales_3_month , sales_6_month ,
## #   sales_9_month , min_bank , potential_issue ,
## #   pieces_past_due , perf_6_month_avg ,
## #   perf_12_month_avg , local_bo_qty , deck_risk ,
## #   oe_constraint , ppap_risk , stop_auto_buy ,
## #   rev_stop , went_on_backorder 

We can also check the new balance of Yes/No: It’s now 43% Yes to 57% No, which in theory should enable the classifier to better detect relationships with the positive (Yes) class.

# Inspect class balance after SMOTE
train_df$went_on_backorder %>% table() %>% prop.table() 
## .
##         0         1 
## 0.5714286 0.4285714

Modeling with h2o

Now we are ready to model. Let’s fire up h2o.

# Fire up h2o and turn off progress bars
##  Connection successful!
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         27 seconds 221 milliseconds 
##     H2O cluster version: 
##     H2O cluster version age:    1 month and 18 days  
##     H2O cluster name:           H2O_started_from_R_mdanc_bpa702 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.51 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.0 (2017-04-21)

The h2o package deals with H2OFrame (instead of data frame). Let’s convert.

# Convert to H2OFrame
train_h2o  as.h2o(train_df)
valid_h2o  as.h2o(valid_df)
test_h2o   as.h2o(test_df)

We use the h2o.automl() function to automatically try a wide range of classification models. The most important arguments are:

  • training_frame: Supply our training data set to build models.
  • validation_frame: Supply our validation data set to prevent models from overfitting
  • leaderboard_frame: Supply our test data set, and models will be ranked based on performance on the test set.
  • max_runtime_secs: Controls the amount of runtime for each model. Setting to 45 seconds will keep things moving but give each model a fair shot. If you want even higher accuracy you can try increasing the run time. Just be aware that there tends to be diminishing returns.
# Automatic Machine Learning
y  "went_on_backorder"
x  setdiff(names(train_h2o), y)

automl_models_h2o  h2o.automl(
    x = x, 
    y = y,
    training_frame    = train_h2o,
    validation_frame  = valid_h2o,
    leaderboard_frame = test_h2o,
    max_runtime_secs  = 45

Extract our leader model.

automl_leader  automl_models_h2o@leader

Making Predictions

We use h2o.predict() to make our predictions on the test set. The data is still stored as an h2o object, but we can easily convert to a data frame with as.tibble(). Note that the left column (“predict”) is the class prediction, and columns “p0” and “p1” are the probabilities.

pred_h2o  h2o.predict(automl_leader, newdata = test_h2o)
## # A tibble: 242,075 x 3
##    predict        p0         p1
##  1       0 0.9838573 0.01614266
##  2       0 0.9378156 0.06218435
##  3       0 0.9663659 0.03363410
##  4       0 0.9233651 0.07663488
##  5       0 0.9192454 0.08075463
##  6       0 0.9159785 0.08402145
##  7       0 0.8662870 0.13371302
##  8       0 0.9833785 0.01662152
##  9       0 0.9805942 0.01940576
## 10       0 0.9162502 0.08374984
## # ... with 242,065 more rows

Assessing Performance

We use h2o.performance() to get performance-related information by passing the function the leader model and the test set in the same fashion as we did for the prediction. We’ll use the performance output to visualize ROC, AUC, precision and recall.

perf_h2o  h2o.performance(automl_leader, newdata = test_h2o) 

From the performance output we can get a number of key classifier metrics by threshold using h2o.metric().

# Getting performance metrics
h2o.metric(perf_h2o) %>%
    as.tibble() %>%
## Observations: 400
## Variables: 20
## $ threshold                0.9933572, 0.9855898, 0.9706335, ...
## $ f1                       0.006659267, 0.008856089, 0.01174...
## $ f2                       0.004179437, 0.005568962, 0.00741...
## $ f0point5                 0.01637555, 0.02161383, 0.0282485...
## $ accuracy                 0.9889084, 0.9889043, 0.9888795, ...
## $ precision                0.6000000, 0.5454545, 0.4444444, ...
## $ recall                   0.003348214, 0.004464286, 0.00595...
## $ specificity              0.9999749, 0.9999582, 0.9999165, ...
## $ absolute_mcc             0.04423925, 0.04861468, 0.0504339...
## $ min_per_class_accuracy   0.003348214, 0.004464286, 0.00595...
## $ mean_per_class_accuracy  0.5016616, 0.5022113, 0.5029344, ...
## $ tns                      239381, 239377, 239367, 239353, 2...
## $ fns                      2679, 2676, 2672, 2660, 2653, 264...
## $ fps                      6, 10, 20, 34, 45, 56, 77, 97, 10...
## $ tps                      9, 12, 16, 28, 35, 39, 46, 56, 63...
## $ tnr                      0.9999749, 0.9999582, 0.9999165, ...
## $ fnr                      0.9966518, 0.9955357, 0.9940476, ...
## $ fpr                      2.506402e-05, 4.177336e-05, 8.354...
## $ tpr                      0.003348214, 0.004464286, 0.00595...
## $ idx                      0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...

ROC Curve

The Receiver Operating Characteristic (ROC) curve is a graphical method that pits the true positive rate (y-axis) against the false positive rate (x-axis). The benefit to the ROC curve is two-fold:

  1. We can visualize how the binary classification model compares to randomly guessing
  2. We can calculate AUC (Area Under the Curve), which is a method to compare models (perfect classification = 1).

Let’s review the ROC curve for our model using h2o. The red dotted line is what you could theoretically achieve by randomly guessing.

# Plot ROC Curve
left_join(h2o.tpr(perf_h2o), h2o.fpr(perf_h2o)) %>%
    mutate(random_guess = fpr) %>%
    select(-threshold) %>%
    ggplot(aes(x = fpr)) +
    geom_area(aes(y = tpr, fill = "AUC"), alpha = 0.5) +
    geom_point(aes(y = tpr, color = "TPR"), alpha = 0.25) +
    geom_line(aes(y = random_guess, color = "Random Guess"), size = 1, linetype = 2) +
    theme_tq() +
        name = "Key", 
        values = c("TPR" = palette_dark()[[1]],
                   "Random Guess" = palette_dark()[[2]])
        ) +
    scale_fill_manual(name = "Fill", values = c("AUC" = palette_dark()[[5]])) +
    labs(title = "ROC Curve", 
         subtitle = "Model is performing much better than random guessing") +
    annotate("text", x = 0.25, y = 0.65, label = "Better than guessing") +
    annotate("text", x = 0.75, y = 0.25, label = "Worse than guessing")

plot of chunk unnamed-chunk-26

We can also get the AUC of the test set using h2o.auc(). To give you an idea, the best Kaggle data scientists are getting AUC = 0.95. At AUC = 0.92, our automatic machine learning model is in the same ball park as the Kaggle competitors, which is quite impressive considering the minimal effort to get to this point.

# AUC Calculation
## [1] 0.9242604

The Cutoff (Threshold)

The cutoff (also known as threshold) is the value that divides the predictions. If we recall, the prediction output has three columns: “predict”, “p0”, and “p1”. The first column, “predict” (model predictions) uses an automatically determined threshold value as the cutoff. Everything above is classified as “Yes” and below as “No”.

# predictions are based on p1_cutoff
## # A tibble: 242,075 x 3
##    predict        p0         p1
##  1       0 0.9838573 0.01614266
##  2       0 0.9378156 0.06218435
##  3       0 0.9663659 0.03363410
##  4       0 0.9233651 0.07663488
##  5       0 0.9192454 0.08075463
##  6       0 0.9159785 0.08402145
##  7       0 0.8662870 0.13371302
##  8       0 0.9833785 0.01662152
##  9       0 0.9805942 0.01940576
## 10       0 0.9162502 0.08374984
## # ... with 242,065 more rows

How is the Cutoff Determined?

The algorithm uses the threshold at the best F1 score to determine the optimal value for the cutoff.

# Algorithm uses p1_cutoff that maximizes F1
h2o.F1(perf_h2o) %>%
    as.tibble() %>%
    filter(f1 == max(f1))
## # A tibble: 1 x 2
##   threshold        f1
## 1 0.6300685 0.2844262

We could use a different threshold if we like, which can change depending on your goal (e.g. maximize recall, maximize precision, etc). For a full list of thresholds, we can inspect the “metrics” slot that is inside the perf_h2o object.

# Full list of thresholds at various performance metrics
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.630068 0.284426  64
## 2                       max f2  0.560406 0.366587  86
## 3                 max f0point5  0.719350 0.271677  41
## 4                 max accuracy  0.993357 0.988908   0
## 5                max precision  0.993357 0.600000   0
## 6                   max recall  0.000210 1.000000 399
## 7              max specificity  0.993357 0.999975   0
## 8             max absolute_mcc  0.603170 0.287939  72
## 9   max min_per_class_accuracy  0.254499 0.850965 207
## 10 max mean_per_class_accuracy  0.222498 0.853464 222

Is cutoff at max F1 what we want? Probably not as we’ll see next.

Optimizing the Model for Expected Profit

In the business context we need to decide what cutoff (threshold of probability to assign yes/no) to use: Is the “p1” cutoff >= 0.63 (probability above which predict “Yes”) adequate? The answer lies in the balance between the cost of inventorying incorrect product (low precision) versus the cost of the lost customer (low recall):

  • Precision: When model predicts yes, how often is the actual value yes. If we implement a high precision (low recall) strategy then we need to accept a tradeoff of letting the model misclassify actual yes values to decrease the number of incorrectly predicted yes values.

  • Recall: When actual is yes, how often does the model predict yes. If we implement a high recall (low precision) strategy then we need to accept a tradeoff of letting the model increase the no values incorrectly predicted as yes values.

Effect of Precision and Recall on Business Strategy

By shifting the cutoff, we can control the precision and recall and this has major effect on the business strategy. As the cutoff increases, the model becomes more conservative being very picky about what it classifies as went_on_backorder = Yes.

# Plot recall and precision vs threshold, visualize inventory strategy effect
left_join(h2o.recall(perf_h2o), h2o.precision(perf_h2o)) %>%
    rename(recall = tpr) %>%
    gather(key = key, value = value, -threshold) %>%
    ggplot(aes(x = threshold, y = value, color = key)) +
    geom_point(alpha = 0.5) +
    scale_color_tq() +
    theme_tq() +
    labs(title = 'Precision and Recall vs Cutoff ("Yes" Threshold)',
         subtitle = "As the cutoff increases from zero, inventory strategy becomes more conservative",
         caption = "Deciding which cutoff to call YES is highly important! Don't just blindly use 0.50.",
         x = 'Cutoff (Probability above which we predict went_on_backorder = "Yes")',
         y = "Precision and Recall Values"
         ) +
    # p>=0
    geom_vline(xintercept = 0, color = palette_light()[[3]], size = 1) +
    annotate("text", x = 0.12, y = 0.75, size = 3,
             label = 'p1 >= 0: "Yes"nInventorynEverything') +
    geom_segment(x = 0, y = 0.7, xend = 0.02, yend= 0.72, color = palette_light()[[3]], size = 1) +
    # p>=0.25
    geom_vline(xintercept = 0.25, color = palette_light()[[3]], size = 1) +
    annotate("text", x = 0.37, y = 0.35, size = 3,
             label = 'p1 >= 0.25: "Yes"nInventory AnythingnWith Chancenof Backorder') +
    geom_segment(x = 0.25, y = 0.30, xend = 0.27, yend= 0.32, color = palette_light()[[3]], size = 1) +
    # p>=0.5
    geom_vline(xintercept = 0.5, color = palette_light()[[3]], size = 1) +
    annotate("text", x = 0.62, y = 0.75, size = 3,
             label = 'p1 >= 0.50: "Yes"nInventorynProbabilitynSplit 50/50') +
    geom_segment(x = 0.5, y = 0.70, xend = 0.52, yend= 0.72, color = palette_light()[[3]], size = 1) +
    # p>=0.75
    geom_vline(xintercept = 0.75, color = palette_light()[[3]], size = 1) +
    annotate("text", x = 0.87, y = 0.75, size = 3,
             label = 'p1 >= 0.75: "Yes"nInventory VerynConservativelyn(Most Likely Backorder)') +
    geom_segment(x = 0.75, y = 0.70, xend = 0.77, yend= 0.72, color = palette_light()[[3]], size = 1) +
    # p>=1
    geom_vline(xintercept = 1, color = palette_light()[[3]], size = 1) +
    annotate("text", x = 0.87, y = 0.22, size = 3,
             label = 'p1 >= 1.00: "Yes"nInventory Nothing') +
    geom_segment(x = 1.00, y = 0.23, xend = 0.98, yend= 0.21, color = palette_light()[[3]], size = 1) 

plot of chunk unnamed-chunk-31

Expected Value Framework

We need another function to help us determine which values of cutoff to use, and this comes from the Expected Value Framework described in Data Science for Business.

Expected Value Framework

Source: Data Science for Business

To implement this framework we need two things:

  1. Expected Rates (matrix of probabilities): Needed for each threshold
  2. Cost/Benefit Information: Needed for each observation

Expected Rates

We have the class probabilities and rates from the confusion matrix, and we can retrieve this using the h2o.confusionMatrix() function.

# Confusion Matrix, p1_cutoff = 0.63
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.630068453397666:
##             0    1    Error          Rate
## 0      235796 3591 0.015001  =3591/239387
## 1        1647 1041 0.612723    =1647/2688
## Totals 237443 4632 0.021638  =5238/242075

The problems are:

  1. The confusion matrix is for just one specific cutoff. We want to evaluate for all of the potential cutoffs to determine which strategy is best.
  2. We need to convert to expected rates, as shown in the Figure 7-2 diagram.

Luckily, we can retrieve the rates by cutoff conveniently using h2o.metric(). Huzzah!

# Get expected rates by cutoff
expected_rates  h2o.metric(perf_h2o) %>%
    as.tibble() %>%
    select(threshold, tpr, fpr, fnr, tnr)
## # A tibble: 400 x 5
##    threshold         tpr          fpr       fnr       tnr
##  1 0.9933572 0.003348214 2.506402e-05 0.9966518 0.9999749
##  2 0.9855898 0.004464286 4.177336e-05 0.9955357 0.9999582
##  3 0.9706335 0.005952381 8.354673e-05 0.9940476 0.9999165
##  4 0.9604183 0.010416667 1.420294e-04 0.9895833 0.9998580
##  5 0.9513270 0.013020833 1.879801e-04 0.9869792 0.9998120
##  6 0.9423699 0.014508929 2.339308e-04 0.9854911 0.9997661
##  7 0.9320206 0.017113095 3.216549e-04 0.9828869 0.9996783
##  8 0.9201693 0.020833333 4.052016e-04 0.9791667 0.9995948
##  9 0.9113555 0.023437500 4.511523e-04 0.9765625 0.9995488
## 10 0.9061218 0.025297619 5.472311e-04 0.9747024 0.9994528
## # ... with 390 more rows

Cost/Benefit Information

The cost-benefit matrix is a business assessment of the cost and benefit for each of four potential outcomes:

  • True Negative (benefit): This is the benefit from a SKU that is correctly predicted as not backordered. The benefit is zero because a customer simply does not buy this item (lack of demand).

  • True Positive (benefit): This is the benefit from a SKU that is correctly predicted as backorder and the customer needs this now. The benefit is the profit from revenue. For example, if the item price is $1000 and the cost is $600, the unit profit is $400.

  • False Positive (cost): This is the cost associated with incorrectly classifying a SKU as backordered when demand is not present. The is the warehousing and inventory related cost. For example, if the item costs $10/unit to inventory, which would have otherwise not have occurred.

  • False Negative (cost): This is the cost associated with incorrectly classifying a SKU when demand is present. In our case, no money was spent and nothing was gained.

The cost-benefit information is needed for each decision pair. If we take one SKU, say the first prediction from the test set, we have an item that was predicted to NOT be backordered (went_on_backorder == "No"). If hypothetically the value for True Positive (benefit) is $400/unit in profit from correctly predict a backorder and the False Positive (cost) of accidentally inventorying and item that was not backordered is $10/unit then a data frame can be be structured like so.

# Cost/benefit info codified for first item
first_item  pred_h2o %>%
    as.tibble() %>%
    slice(1) %>%
        cb_tn = 0,
        cb_tp = 400,
        cb_fp = -10,
        cb_fn = 0
## # A tibble: 1 x 7
##   predict        p0         p1 cb_tn cb_tp cb_fp cb_fn
## 1       0 0.9838573 0.01614266     0   400   -10     0

Optimization: Single Item

The expected value equation generalizes to:


  • pi is the probability of the outcome for one observation
  • vi is the value of the outcome for one observation

The general form isn’t very useful, but from it we can create an Expected Profit equation using a basic rule of probability p(x,y) = p(y)*p(x|y) that combines both the Expected Rates (2×2 matrix of probabilities after normalization of Confusion Matrix) and a Cost/Benefit Matrix (2×2 matrix with expected costs and benefits). Refer to Data Science for Business for the proof.


  • p(p) is the confusion matrix positive class prior (probability of actual yes / total)
  • p(n) is the confusion matrix positive class prior (probability of actual no / total = 1 – positive class prior)
  • p(Y|p) is the True Positive Rate (tpr)
  • p(N|p) is the False Negative Rate (fnr)
  • p(N|n) is the True Negative Rate (tnr)
  • p(Y|n) is the False Positive Rate (fpr)
  • b(Y,p) is the benefit from true positive (cb_tp)
  • c(N,p) is the cost from false negative (cb_fn)
  • b(N,n) is the benefit from true negative (cb_tn)
  • c(Y,n) is the cost from false positive (cb_fp)

This equation simplifies even further since we have a case where cb_tn and cb_fn are both zero.

We create a function to calculate the expected profit using the probability of a positive case (positive prior, p1), the cost/benefit of a true positive (cb_tp), and the cost/benefit of a false positive (cb_fp). We’ll take advantage of the expected_rates data frame we previously created, which contains the true positive rate and false positive rate for each threshold (400 thresholds in the range of 0 and 1).

# Function to calculate expected profit
calc_expected_profit  function(p1, cb_tp, cb_fp) {
    # p1    = Set of predictions with "predict", "p0", and "p1" columns
    # cb_tp = Benefit (profit) from true positive (correctly identifying backorder)
    # cb_fp = Cost (expense) from false negative (incorrectly inventorying)
        p1    = p1,
        cb_tp = cb_tp,
        cb_fp = cb_fp
        ) %>%
        # Add in expected rates
        mutate(expected_rates = list(expected_rates)) %>%
        unnest() %>%
            expected_profit = p1 * (tpr * cb_tp) + (1 - p1) * (fpr * cb_fp)
        ) %>%
        select(threshold, expected_profit)

We can test the function for a hypothetical prediction that is unlikely to have a backorder (most common class). Setting p1 = 0.01 indicates the hypothetical SKU has a low probability. Continuing with the example case of $400/unit profit and $10/unit inventory cost, we can see that optimal threshold is approximately 0.4. Note that an inventory all items strategy (threshold = 0) would cause the company to lose money on low probability of backorder items (-$6/unit) and an inventory nothing strategy would result in no benefit but no loss ($0/unit).

# Investigate a expected profit of item with low probability of backorder
hypothetical_low  calc_expected_profit(p1 = 0.01, cb_tp = 400, cb_fp = -10)
hypothetical_low_max  filter(hypothetical_low, expected_profit == max(expected_profit))

hypothetical_low %>%
    ggplot(aes(threshold, expected_profit, color = expected_profit)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "red") +
    geom_vline(xintercept = hypothetical_low_max$threshold, color = palette_light()[[1]]) +
    theme_tq() +
    scale_color_continuous(low = palette_light()[[1]], high = palette_light()[[2]]) +
    labs(title = "Expected Profit Curve, Low Probability of Backorder",
         subtitle = "When probability of backorder is low, threshold increases inventory conservatism",
         caption  = paste0('threshold @ max = ', hypothetical_low_max$threshold %>% round (2)),
         x = "Cutoff (Threshold)",
         y = "Expected Profit per Unit"

plot of chunk unnamed-chunk-36

Conversely if we investigate a hypothetical item with high probability of backorder, we can see that it’s much more advantageous to have a loose strategy with respect to inventory conservatism. Notice the profit per-unit is 80% of the theoretical maximum profit (80% of $400/unit = $320/unit if “p1” = 0.8). The profit decreases to zero as the inventory strategy becomes more conservative.

# Investigate a expected profit of item with high probability of backorder
hypothetical_high  calc_expected_profit(p1 = 0.8, cb_tp = 400, cb_fp = -10)
hypothetical_high_max  filter(hypothetical_high, expected_profit == max(expected_profit))

hypothetical_high %>%
    ggplot(aes(threshold, expected_profit, color = expected_profit)) +
    geom_point() +
    geom_hline(yintercept = 0, color = "red") +
    geom_vline(xintercept = hypothetical_high_max$threshold, color = palette_light()[[1]]) +
    theme_tq() +
    scale_color_continuous(low = palette_light()[[1]], high = palette_light()[[2]]) +
    labs(title = "HExpected Profit Curve, High Probability of Backorder",
         subtitle = "When probability of backorder is high, ",
         caption  = paste0('threshold @ max = ', hypothetical_high_max$threshold %>% round (2))

plot of chunk unnamed-chunk-37

Let’s take a minute to digest what’s going on in both the high and low expected profit curves. Units with low probability of backorder (the majority class) will tend to increase the cutoff while units with high probability will tend to lower the cutoff. It’s this balance or tradeoff that we need to scale to understand the full picture.

Optimization: Multi-Item

Now the fun part, scaling the optimization to multiple products. Let’s analyze a simplified case: 10 items with varying backorder probabilities, benefits, costs, and safety stock levels. In addition, we’ll include a backorder purchase quantity with logic of 100% safety stock (meaning items believed to be backordered will have an additional quantity purchased equal to that of the safety stock level). We’ll investigate optimal stocking level for this subset of items to illustrate scaling the analysis to find the global optimized cutoff (threshold).

# Ten Hypothetical Items
ten_items  tribble(
    ~"item", ~"p1",  ~"cb_tp", ~"cb_fp", ~"safety_stock",
    1,       0.02,      10,    -0.75,    100,
    2,       0.09,      7.5,   -0.75,    35,
    3,       0.65,      8.5,   -0.75,    75,
    4,       0.01,      25,    -2.5,     50,
    5,       0.10,      15,    -0.5,     150,
    6,       0.09,      400,   -25,      5,
    7,       0.05,      17.5,  -5,       25,
    8,       0.01,      200,   -9,       75,
    9,       0.11,      25,    -2,       50,
    10,      0.13,      11,    -0.9,     150
## # A tibble: 10 x 5
##     item    p1 cb_tp  cb_fp safety_stock
##  1     1  0.02  10.0  -0.75          100
##  2     2  0.09   7.5  -0.75           35
##  3     3  0.65   8.5  -0.75           75
##  4     4  0.01  25.0  -2.50           50
##  5     5  0.10  15.0  -0.50          150
##  6     6  0.09 400.0 -25.00            5
##  7     7  0.05  17.5  -5.00           25
##  8     8  0.01 200.0  -9.00           75
##  9     9  0.11  25.0  -2.00           50
## 10    10  0.13  11.0  -0.90          150

We use purrr to map the calc_expected_profit() to each item, thus returning a data frame of expected profits per unit by threshold value. We unnest() to expand the data frame to one level enabling us to work with the expected profits and quantities. We then “extend” (multiply the unit expected profit by the backorder-prevention purchase quantity, which is 100% of safety stock level per our logic) to get total expected profit per unit.

# purrr to calculate expected profit for each of the ten items at each threshold
extended_expected_profit_ten_items  ten_items %>%
    # pmap to map calc_expected_profit() to each item
    mutate(expected_profit = pmap(.l = list(p1, cb_tp, cb_fp), .f = calc_expected_profit)) %>%
    unnest() %>%
    rename(expected_profit_per_unit = expected_profit) %>%
    # Calculate 100% safety stock repurchase and sell
    mutate(expected_profit_extended = expected_profit_per_unit * 1 * safety_stock) %>%
    select(item, p1, threshold, expected_profit_extended) 
## # A tibble: 4,000 x 4
##     item    p1 threshold expected_profit_extended
##  1     1  0.02 0.9933572               0.06512208
##  2     1  0.02 0.9855898               0.08621537
##  3     1  0.02 0.9706335               0.11290693
##  4     1  0.02 0.9604183               0.19789417
##  5     1  0.02 0.9513270               0.24660013
##  6     1  0.02 0.9423699               0.27298466
##  7     1  0.02 0.9320206               0.31862027
##  8     1  0.02 0.9201693               0.38688435
##  9     1  0.02 0.9113555               0.43559030
## 10     1  0.02 0.9061218               0.46573090
## # ... with 3,990 more rows

We can visualize the expected profit curves for each item extended for backorder-prevention quantity to be purchased and sold (note that selling 100% is a simplifying assumption).

# Visualizing Expected Profit 
extended_expected_profit_ten_items %>%
    ggplot(aes(threshold, expected_profit_extended, 
               color = factor(item)), 
               group = item) +
    geom_line(size = 1) +
    theme_tq() +
    scale_color_tq() +
        title = "Expected Profit Curves, Modeling Extended Expected Profit",
        subtitle = "Taking backorder-prevention purchase quantity into account weights curves",
        color = "Item No." 

plot of chunk unnamed-chunk-40

Finally, we can aggregate the expected profit using a few dplyr operations. Visualizing the final curve exposes the optimal threshold.

# Aggregate extended expected profit by threshold 
total_expected_profit_ten_items  extended_expected_profit_ten_items %>%
    group_by(threshold) %>%
    summarise(expected_profit_total = sum(expected_profit_extended)) 

# Get maximum (optimal) threshold
max_expected_profit  total_expected_profit_ten_items %>%
    filter(expected_profit_total == max(expected_profit_total))

# Visualize the total expected profit curve
total_expected_profit_ten_items %>%
    ggplot(aes(threshold, expected_profit_total)) +
    geom_line(size = 1, color = palette_light()[[1]]) +
    geom_vline(xintercept = max_expected_profit$threshold, color = palette_light()[[1]]) +
    theme_tq() +
    scale_color_tq() +
        title = "Expected Profit Curve, Modeling Total Expected Profit",
        subtitle = "Summing up the curves by threshold yields optimal strategy",
        caption  = paste0('threshold @ max = ', max_expected_profit$threshold %>% round (2))

plot of chunk unnamed-chunk-41


This was a very technical and detailed post, and if you made it through congratulations! We covered automated machine learning with H2O, an efficient and high accuracy tool for prediction. We worked with an extremely unbalanced data set, showing how to use SMOTE to synthetically improve dataset balance and ultimately model performance. We spent a considerable amount of effort optimizing the cutoff (threshold) selection to maximize expected profit, which ultimately matters most to the bottom line. Hopefully you can see how data science and machine learning can be very beneficial to the business, enabling better decisions and ROI.


Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to this article and our article, HR Analytics: Using Machine Learning to Predict Employee Turnover. The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

Follow Business Science on Social Media

To leave a comment for the author, please follow the link and comment on their blog: – Articles. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Creating interactive SVG tables in R

By Riddhiman

In this post we will explore how to make SVG tables in R using plotly. The tables are visually appealing and can be modified on the fly using simple drag and drop. Make sure you install the latest version of Plotly i.e. v from Github using devtools::install_github("ropensci/plotly)

The easiest way to create a SVG table in R using plotly is by manually specifying headers and individual cell values. The following code snippet highlights this:



    # Formatting 
    line = list(color = '#DFE8F3'),
    align = c('left','left','left','left','left'),
    font = list(color = c('#506784', '#506784', '#506784', '#506784', '#ab63fa'), size = 14)
  # Specify individual cells
  cells = list(
    # Now specify each cell content
    values = list(
      c('Salaries', 'Office', 'Merchandise', 'Legal', 'TOTAL'),
      c(1200000, 20000, 80000, 2000, 12120000),
      c(1300000, 20000, 70000, 2000, 130902000),
      c(1300000, 20000, 120000, 2000, 131222000),
      c(1400000, 20000, 90000, 2000, 14102000)),
    # Formatting
    line = list(color = '#DFE8F3'),
    align = c('left', 'left', 'left', 'left', 'left'),
    font = list(color = c('#506784', '#506784', '#506784', '#506784', '#ab63fa'), size = 14),
    height = 48
    )) %>% 
  # Layout is needed to remove gridlines, axis zero lines and ticktext 
  # or else they will also show up
  layout(xaxis = list(zeroline = F, showgrid = F, showticklabels = F, domain = c(0, 0.5)),
         yaxis = list(zeroline = F, showgrid = F, showticklabels = F))


We can also write a helper function to create these tables using dataframes.


createTable ", x, ""))    
  nms Rows", after = 0)
  headerValues % 
    layout(xaxis = list(zeroline = F, showgrid = F, showticklabels = F),
           yaxis = list(zeroline = F, showgrid = F, showticklabels = F))


Note that columns can easily rearranged by dragging them around. You can find more information on individual attributes here
Hope this post was useful – happy table making !

Source:: R News

New Zealand fatal traffic crashes by @ellis2013nz

By Peter's stats stuff – R

(This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

Traffic fatalities in the news

Fatalities from traffic crashes have been in the news in New Zealand recently, for tragic reasons. For once, data beyond the short term have come into the debate. Sam Warburton has written two good articles on the topic, with good consideration of the data. He has also, via Twitter, made his compilation of NZTA and other data and his considerable workings available.

Traffic, and crashes in particular, are characterised by large volumes of high quality data. Sam Warburton’s workings can be supplemented with a range of data from the NZTA. I know little about the subject and thought it seemed a good one to explore. This is a long blog post but I still feel I’ve barely touched the surface. I nearly split this into a series of posts but decided I won’t get back to this for months at least, so best to get down what I fiddled around with straight away.

Long term

I always like to see as long term a picture as I can in a new area. In one of the tabs in Sam Warburton’s Excel book is a table with annual data back to 1950. Check this out:


Here’s the R code that draws that graph, plus what’s needed in terms of R packages for the rest of the post:

library(ggthemes) # for theme_map

#=================long term picture========================
longterm  read.xlsx("../data/vkt2.1.xlsx", sheet = "safety (2)", 
                      cols = c(1, 3), rows = 2:69)

ggplot(longterm, aes(x = Year, y = Deaths)) +
  geom_line() +
  ggtitle("Road deaths in New Zealand 1950 - 2016",
          "Not adjusted for population or vehicle kilometres travelled") +
  labs(caption = "Source: NZTA data compiled by Sam Warburton")

Spatial elements

There have been a few charts going around of crashes and fatalities over time aggregated by region, but I was curious to see a more granular picture. It turns out the NZTA publish a CSV of all half a million recorded crashes (fatal and non-fatal) back to 2000. The data is cut back a bit – it doesn’t have comments on what may have caused the crash for example, or the exact date – but it does have the exact NZTM coordinates of the crash location, which we can convert to latitude and longitude and use for maps.

So I started as always with the big picture – where have all the fatal crashes been since 2000?

Unsurprisingly, they are heavily concentrated in the population centres and along roads. What happens when we look at this view over time?

So, turns out a map isn’t a great way to see subtle trends. In fact, I doubt there’s an effective simple visualisation that would show changing spatial patterns here; it’s definitely a job for a sophisticated model first, followed by a visualisation of its results. Outside today’s scope.

A map is a good way to just get to understand what is in some data though. Here are another couple of variables the NZTA provide. First, whether the accidents happened on a public holiday:

… and combination of vehicles they involved:

Here’s the code for downloading the individual crash data, converting coordinates from NZTM to latitude and longitude, and building those maps. This is a nice example of the power of the ggplot2 universe. Once I’ve defined my basic map and its theme, redrawing it with new faceting variables is just one line of code in each case.

#============download data and tidy up a bit======================
# The individual crash data come from 

# caution, largish file - 27 MB
              destfile = "", mode = "wb")

crash  read.csv("disaggregated-crash-data.csv", check.names = FALSE)
dim(crash) # 586,189 rows

# I prefer lower case names, and underscores rather than dots:
names(crash)  gsub(" ", "_", str_trim(tolower(names(crash))), fixed = TRUE)

# Convert the NZTM coordinates to latitude and longitude:
# see
p4str  "+proj=tmerc +lat_0=0.0 +lon_0=173.0 +k=0.9996 +x_0=1600000.0 +y_0=10000000.0 +datum=WGS84 +units=m"
p  proj4::project(crash[ , c("easting", "northing")], proj = p4str, inverse = TRUE)
crash$longitude  p$x
crash$latitude  p$y

#====================NZ maps========================
my_map_theme  theme_map(base_family = "myfont") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(1), face = "bold"),
        plot.caption = element_text(colour = "grey50")) 

m  crash %>%
  as_tibble() %>%
  filter(fatal_count > 0) %>%
  ggplot(aes(x = longitude, y = latitude, size = fatal_count))  +
  borders("nz", colour = NA, fill = "darkgreen", alpha = 0.1) +
  geom_point(alpha = 0.2) +
  coord_map() +
  my_map_theme +
  theme(legend.position = c(0.9, 0.05)) +
  labs(size = "Fatalities",
       caption = "Source: NZTA Disaggregated Crash Data") +
  ggtitle("Fatal crashes in New Zealand, 2000 - 2017")


m + facet_wrap(~crash_year)

m + facet_wrap(~multi_veh, ncol = 4)

m + facet_wrap(~holiday)  

Closer up maps

I live in New Zealand’s capital, Wellington, and was interested in a closer view of accidents in there. ggmap makes it easy to do this with a street map in the background:

wtn  get_map("Wellington, New Zealand", maptype = "roadmap", zoom = 11)

ggmap(wtn) +
  geom_point(aes(x = longitude, y = latitude), data = crash, alpha = 0.2) +
  my_map_theme +
  ggtitle("Crashes in Wellington, 2000 to 2017",
          "(Including non-fatal)")

But clearly there’s a limit to how many small, static maps I’ll want to make. So I made an interactive version covering the whole country:

You can zoom in and out, move the map around, and hover or click on points to get more information. Full screen version also available.

Is there a trend over time?

Much of the public discussion is about whether there is significant evidence of the recent uptick in deaths being more than the usual year-on-year variation. The most thorough discussion was in Thomas Lumley’s StatsChat. He fit a state space model to the fatality counts, allowing latent tendency for fatalities per kilometres travelled to change at random over time as a Markov chain, with extra randomness each year on top of the randomness coming from a Poisson distribution of the actual count of deaths. He concluded that there is strong evidence of a minimum some time around 2013 or 2014, and growth since then.

To keep the Bayesian modelling part of my brain in practice I set out to repeat and expand this, using six monthly data rather than annual. The disaggregated crash data published by NZTA has both calendar year and financial year for each crash, which we can use to deduce the six month period it was in. The vehicle kilometres travelled in Sam Warburton’s data is at a quarterly grain, although it looks like the quarters are rolling annual averages.

First, to get that data into shape and check out the vehicle kilometres travelled data. I decided to divide the original numbers by four on the assumption they are rolling annual averages to convert them to a quarterly grain. I might be missing something, but it looked to me that the kilometres travelled data stop in first quarter of 2016, so I forecast forward another couple of years using the forecastHybrid R package as a simple way to get an average of ARIMA and exponential state space smoothing models. The forecast looks like this:

#------------------vehicle kilometres travelled--------------------------
vkt  read.xlsx("../data/vkt2.1.xlsx", sheet = "VKT forecasts (2)",
                 cols = 3, rows = 6:62, colNames = FALSE) / 4

vkt_ts  ts(vkt$X1, start = c(2001, 4), frequency = 4)
vkt_mod  hybridModel(vkt_ts, models = c("ae"))
vkt_fc  forecast(vkt_mod, h = 8)

autoplot(vkt_fc) +
  ggtitle("Vehicle kilometres travelled per quarter in New Zealand, 2001 to 2017",
          "Forecasts from 2016 second quarter are an average of auto.arima and ets") +
  labs(x = "", y = "Billions of vehicle kilometres travelled",
       fill = "Predictionnintervals",
       caption = "Source: NZTA data compiled by Sam Warburton;nForecasts by Peter's Stats Stuff")

Then there’s a bit of mucking about to get the data in a six monthly state and ready for modelling but it wasn’t too fiddly. I adapted Professor Lumley’s model to Stan (which I’m more familiar with than JAGS which he used) to a six monthly observations. Here’s the end result

Consistent with Professor Lumley, I find that 88% of simulations show the first half of 2017 (where my data stops) having a higher latent crash rate per km travelled than the second half of 2013.

Fitting a model like this takes three steps:

  • data management in R
  • model fitting in Stan
  • presentation of results in R

First, here’s the code for the two R stages. The trick here is that I make one column with year (eg 2005) and one called six_month_period which is a for the first six months of the year, and b for the rest. I do this separately for the vehicle kilometres travelled (which needs to be aggregated up from quarterly) and the deaths (which needs to be aggregated up from individual, taking advantage of the fact that we know both calendar and financial years in which it occurred).

#==================combining vkt and crashes============
vkt_df  data_frame(vkt = c(vkt[ ,1, drop = TRUE], vkt_fc$mean),
                     t = c(time(vkt_ts), time(vkt_fc$mean))) %>%
  mutate(six_month_period = as.character(ifelse(t %% 1  0.4, "a", "b")),
         year = as.integer(t %/% 1)) %>%
  group_by(year, six_month_period) %>%
  summarise(vkt = sum(vkt))

crash_six_month  crash %>%
  as_tibble() %>%
  filter(fatal_count > 0) %>%
  mutate(fin_year_end = as.numeric(str_sub(crash_fin_year, start = -4)),
         six_month_period = ifelse(fin_year_end == crash_year, "a", "b")) %>%
  rename(year = crash_year) %>%
  group_by(year, six_month_period) %>%
  summarise(fatal_count = sum(fatal_count)) %>%
  left_join(vkt_df, by = c("year", "six_month_period")) %>%
  mutate(deaths_per_billion_vkt = fatal_count / vkt) %>%
  filter(year > 2001) %>%
  arrange(year, six_month_period)

dpbv_ts  ts(crash_six_month$deaths_per_billion_vkt, frequency = 2, start = c(2002, 1))

# high autocorrelation, as we'd expect:
            main = "Crash deaths per billion vehicle kilometres travelled, six monthly aggregation") 

# fit an ARIMA model just to see what we get:
auto.arima(dpbv_ts) # ARIMA(1,1,0)(0,0,1)[2] with drift

#===============bayesian model=====================
d  list(
  n      = nrow(crash_six_month),
  deaths = crash_six_month$fatal_count,
  vkt    = crash_six_month$vkt

# fit model
stan_mod  stan(file = "0113-crashes.stan", data = d, 
                 control = list(adapt_delta = 0.99, max_treedepth = 15))

# extract and reshape results:
mu  t(, "mu")))

mu_gathered  mu %>%
  as_tibble() %>%
  mutate(period = as.numeric(time(dpbv_ts))) %>%
  gather(run, mu, -period)

mu_summarised  mu_gathered %>%
  group_by(period) %>%
  summarise(low80 = quantile(mu, 0.1),
            high80 = quantile(mu, 0.9),
            low95 = quantile(mu, 0.025),
            high95 = quantile(mu, 0.975),
            middle = mean(mu))

# chart results:
mu_gathered %>%
  ggplot(aes(x = period)) +
  geom_line(alpha = 0.02, aes(group = run, y = mu)) +
  geom_ribbon(data = mu_summarised, aes(ymin = low80, ymax = high80), 
              fill = "steelblue", alpha = 0.5) +
  geom_line(data = mu_summarised, aes(y = middle), colour = "white", size = 3) +
  geom_point(data = crash_six_month, 
             aes(x = year + (six_month_period == "b") * 0.5, 
                 y = deaths_per_billion_vkt), 
             fill = "white", colour = "black", shape = 21, size = 4) +
  labs(x = "", y = "Latent rate of deaths per billion km",
       caption = "Source: data from NZTA and Sam Warburton;
Analysis by Peter's Stats Stuff") +
  ggtitle("State space model of crash deaths in New Zealand",
          "Six monthly observations")

mu_gathered %>%
  filter(period %in% c(2017.0, 2013.5)) %>%
  summarise(mean(mu[period == 2017.0] > mu[period == 2013.5])) # 88%

…and here’s the model specification in Stan:

// save as 0113-crashes.stan

// Simple Bayesian model for traffic deaths in New Zealand at six monthly intervals
// Peter Ellis, 15 October 2017

// State space model with random walk in the latent variable and actual measurements
// are log normal - Poisson combination.
// Adapted from Thomas Lumley's model for the same data (but annual) at 

data {
  int n;          // number of six-monthly observations
  int deaths[n];  // deaths (actual count), six month period
  real vkt[n];    // vehicle kilometres travelled in billions

parameters {
  real mu1;            // value of mu in the first period
  real delta[n - 1];  // innovation in mu from period to period
  vector[n] epsilon;         // the 'bonus' deviation in each period
  real  tau;   // variance of delta
  real  sigma; // variance of epsilon

transformed parameters {
  real mu[n];         // the amount to multiply km travelled by to get lambda.  In other words, death per billion vkt
  real lambda[n];    // lambda is the expected mean death count in any year
  mu[1] = mu1;
  for(i in 2:n) mu[i] = mu[i - 1] * exp(delta[ i - 1] * tau);
  for(i in 1:n)  lambda[i] = mu[i] * vkt[i] * exp(epsilon[i] * sigma);

model {
  // priors
  mu1 ~ gamma(2, 1.0 / 10);
  tau ~ gamma(1, 1);
  sigma ~ gamma(1, 1);
  // innovation each period:
  delta ~ normal(0, 1);
  // bonus deviation each period:
  epsilon ~ normal(0, 1);
  deaths ~ poisson(lambda);

Objects hit during crashes

Finally, there were a few things I spotted along the way I wanted to explore.

Here is a graphic of the objects (excluding other vehicles and pedestrians) hit during crashes.

crash %>%
  as_tibble() %>%
  filter(fatal_count > 0) %>%
  mutate(fin_year_end = as.numeric(str_sub(crash_fin_year, start = -4))) %>%
  select(fin_year_end, post_or_pole:other) %>%
  mutate(id = 1:n()) %>%
  gather(cause, value, -fin_year_end, -id) %>%
  group_by(fin_year_end, cause) %>%
  summarise(prop_accidents = sum(value > 0) / length(unique(id))) %>%
  ungroup() %>%
  mutate(cause = gsub("_", " ", cause),
         cause = fct_reorder(cause, -prop_accidents)) %>%
  ggplot(aes(x = fin_year_end, y = prop_accidents)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(~cause) +
  scale_y_continuous("Proportion of accidnets with this as a factorn", label = percent) +
  ggtitle("Factors relating to fatal crashes 2000 - 2017",
          "Ordered from most commonly recorded to least") +
  labs(caption = "Source: NZTA disaggregated crash data",
       x = "Year ending June")

Regional relationship with holidays

I expected there to be a relationship between region and holidays – some regions would see a higher proportion of their accidents in particular periods than others. I was only able to scratch the surface of this, as everything else, of course.

I started by creating a three-way cross tab of year, region and holiday (four holidays plus “none”). I used the frequency count in each cell of this cross tab as the response in a generalized linear model and tested for an interaction effect between region and holiday. It’s there but it’s marginal; a Chi square test shows up as definitely significant (p value

I extracted the interaction effects and presented them in this chart:

So we can see, for example, that the West Coast (on the South Island) has a particularly strong Christmas time effect (expected); Gisborne gets disproportionate rate of its fatal accidents on Labour Day Weekend (not expected, by me anyway); and so on. The horizontal lines indicate 95% confidence intervals.

To check this wasn’t out to lunch, I also did the simpler calculation of just looking at what percentage of each region’s fatal accidents were on each of the four holidays:

This is much easier to interpret and explain!

Here’s the code for that region-holiday analysis:

#=========================changing relationship between holidays and numbers==============

crash_counts  crash %>%
  group_by(crash_fin_year, lg_region_desc, holiday) %>%
  summarise(fatal_count = sum(fatal_count)) %>%
  filter(lg_region_desc != "") %>%
  mutate(holiday = relevel(holiday, "None"))

mod1  glm(fatal_count ~ crash_fin_year + lg_region_desc + holiday, 
            family = "poisson", 
            data = crash_counts)

mod2  glm(fatal_count ~ crash_fin_year + lg_region_desc * holiday, 
            family = "poisson", data = crash_counts)

# marginal choice between these two models - there's a lot of extra degrees of freedom
# used up in mod2 for marginal improvement in deviance.  AIC suggests use the simpler 
# model; Chi square test says the interaction effect is "significant" and use the complex one.
anova(mod1, mod2, test = "Chi")
AIC(mod1, mod2)

# tidy up coefficients:
cfs  tidy(mod2)

cfs %>%
  filter(grepl(":", term)) %>%
  extract(term, into = c("region", "holiday"), regex = "(.+):(.+)") %>%
  mutate(region = gsub("lg_region_desc", "", region),
         holiday = gsub("holiday", "", holiday)) %>%
  ggplot(aes(x = estimate, y = region, 
             label = ifelse(p.value  0.05, as.character(region), ""))) +
  geom_vline(xintercept = 0, colour = "grey50") +
  geom_segment(aes(yend = region, 
                   x = estimate - 2 * std.error, 
                   xend = estimate + 2 * std.error), colour = "grey20") +
  geom_text(size = 3.5, nudge_y = 0.34, colour = "steelblue") +
  ggtitle("Distinctive region - holiday interactions from a statistical model",
          "Higher values indicate the holiday has a higher relative accident rate in that particular region than in Auckland") +
  labs(y = "", caption = "Source: NZTA Disaggregated Crash Data",
       x = "Estimated interaction effect in a generalized linear model with Poisson response:
frequency ~ year + holiday * region") +

crash %>%
  filter(lg_region_desc != "") %>%
  group_by(lg_region_desc, holiday) %>%
  summarise(fatal_count = sum(fatal_count)) %>%
  group_by(lg_region_desc) %>%
  mutate(share = fatal_count / sum(fatal_count)) %>%
  filter(holiday != "None") %>%
  ggplot(aes(x = share, y = lg_region_desc)) +
  facet_wrap(~holiday) +
  geom_point() +
  scale_x_continuous("nProportion of all fatalities in region that occur in this holiday",
                     label = percent) +
  labs(y = "", caption = "Source: NZTA Disaggregated Crash Data") +
  ggtitle("Proportion of each regions fatalities on various holidays",
          "This simpler calculation is designed as a check on the modelled results in the previous chart")

#==========clean up===========

That will do for now. Interesting data, very unsatisfactory state of the world.

To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

An AI pitches startup ideas

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero?














As you’ve probably guessed by now, these are not real startups. (Though given Juicero, you’d be forgiven if you didn’t.) Alex Bresler created a generating neural network (GNN) from YCombinator’s archive of startup names and descriptions, and the GNN created almost 5000 new ones. (You can download an Excel file with all of them here, to save load on Alex’s server.) I picked a few by hand to create the list above, but there was no shortage of funny ones (and plenty of all-too-real ones as well).

Alex used his fundmanageR package (available on Github) to download the YCombinator database into R. The Generational Neural Network was trained using the keras package for R on these data, using Tensorflow as the back end. You can find all the details on how the list was generated, including the R code, at Alex’s blog linked below.

ASBC LLC: Juice, Err, Ohh…

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News