While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

… and this chart on gender pay gaps at large British banks:

Meanwhile, the chart below was made for the Financial Times using just R and the ggplot2 package, “down to the custom FT font and the white bar in the top left”, according to data journalist John Burn-Murdoch.

There are also entire collections devoted to recreating Tufte’s own visualizations in R, presumably meeting his typography standards. Tufte later clarified saying “Problem is not code, problem is published practice”, which is true of any programming environment, which is why it was strange that he’d call out R in particular.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

Your favourite package for getting model outputs directly into publication ready tables is now available on CRAN. They make you work for it! Thank you to all that helped. The development version will continue to be available from github.

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg.

This is your chance to dive into the worlds of marketing and business analytics using R. Day by day, there are a multitude of decisions that companies have to face. With the help of statistical models, you’re going to be able to support the business decision-making process based on data, not your gut feeling. Let us show you what a great impact statistical modeling can have on the performance of businesses. You’re going to learn about and apply strategies to communicate your results and help them make a difference.

Chapter 1: Modeling Customer Lifetime Value with Linear Regression

How can you decide which customers are most valuable for your business? Learn how to model the customer lifetime value using linear regression.

Chapter 2: Logistic Regression for Churn Prevention

Predicting if a customer will leave your business, or churn, is important for targeting valuable customers and retaining those who are at risk. Learn how to model customer churn using logistic regression.

Chapter 3: Modeling Time to Reorder with Survival Analysis

Learn how to model the time to an event using survival analysis. This could be the time until next order or until a person churns.

Chapter 4: Reducing Dimensionality with Principal Component Analysis

Learn how to reduce the number of variables in your data using principal component analysis. Not only does this help to get a better understanding of your data. PCA also enables you to condense information to single indices and to solve multicollinearity problems in a regression analysis with many intercorrelated variables.

As an illustration of the lack of convergence of the Gibbs sampler applied to the two “conditionals” defined in the Bayesian GANs paperdiscussed yesterday, I took the simplest possible example of a Normal mean generative model (one parameter) with a logistic discriminator (one parameter) and implemented the scheme (during an ISBA 2018 session). With flat priors on both parameters. And a Normal random walk as Metropolis-Hastings proposal. As expected, since there is no stationary distribution associated with the Markov chain, simulated chains do not exhibit a stationary pattern,

And they eventually reach an overflow error or a trapping state as the log-likelihood gets approximately to zero (red curve).

Too bad I missed the talk by Shakir Mohammed yesterday, being stuck on the Edinburgh by-pass at rush hour!, as I would have loved to hear his views about this rather essential issue…

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

The world is flat The new R package flatxml provides functions to easily deal with XML files. When parsing an XML document fxml_importXMLFlat produces a special dataframe that is ‘flat’ by its very nature but contains all necessary information about the hierarchical structure of the underlying XML document (for details on the dataframe see the reference for the fxml_importXMLFlat function). flatxml offers a set of functions to work with this dataframe. Apart from representing the XML document in a dataframe structure, there is yet another way in which flatxml relates to dataframes: the fxml_toDataFrame function can be used to extract data from an XML document into a dataframe, e.g. to work on the data with statistical functions. Because in this case there is no need to represent the XML document structure as such (it’s all about the data contained in the document), there is no representation of the hierarchical structure of the document any more, it’s just a normal dataframe. Each XML element, for example Here is some text has certain characteristics that can be accessed via the flatxml interface functions, after an XML document has been imported with fxml_importXMLFlat. These characteristics are:

value: The (text) value of the element, “Here is some text” in the example above

attributes: The XML attributes of the element, attribute with its value “some value” in the example above

children: The elements on the next lower hierarchical level

parent: The element of the next higher hierarchical level, i.e. the element to which the current element is a child

siblings: The elements on the same hierarchical level as the current element

Structure of the flatxml interface The flatxml interface to access these characteristics follows a simple logic: For each of the characteristics there are typically three functions available:

fxml_has…(): Determines if the current XML element has (at least one instance of) the characteristic

fxml_num…(): Returns the number of the characteristics of the current XML (e.g. the number of children elements

fxml_get…(): Returns (the IDs of) the respective characteristics of the current XML element (e.g. the children of the current element)

This is a significant release for Shiny, with a major new feature that was nearly a year in the making: support for asynchronous operations!

Without this capability, when Shiny performs long-running calculations or tasks on behalf of one user, it stalls progress for all other Shiny users that are connected to the same process. Therefore, Shiny apps that feature long-running calculations or tasks have generally been deployed using many R processes, each serving a small number of users; this works, but is not the most efficient approach. Such applications now have an important new tool in the toolbox to improve performance under load.

Shiny async is implemented via integration with the future and promises packages. These two packages are used together:

Use future to perform long-running operations in a worker process that runs in the background, leaving Shiny processes free to serve other users in the meantime. This yields much better responsiveness under load, and much more predictable latency.

Use promises to handle the result of each long-running background operation back in the Shiny process, where additional processing can occur, such as further data manipulation, or displaying to the user via a reactive output.

If your app has a small number of severe performance bottlenecks, you can use this technique to get massively better responsiveness under load. For example, if the httr::GET call in this server function takes 30 seconds to complete:

Even if the httr::GET(url) takes 30 seconds, the r reactive executes almost instantly, and returns control to the caller. The code inside future(...) is executed in a different R process that runs in the background, and whenever its result becomes available (i.e. in 30 seconds), the right-hand side of %...>% will be executed with that result. (%...>% is called a “promise pipe”; it works similarly to a magrittr pipe that knows how to wait for and “unwrap” promises.)

If the original (synchronous) code appeared in a Shiny app, then during that 30 seconds, the R process is stuck dealing with the download and can’t respond to any requests being made by other users. But with the async version, the R process only needs to kick off the operation, and then is free to service other requests. This means other users will only have to wait milliseconds, not minutes, for the app to respond.

Case study

We’ve created a detailed case study that walks through the async conversion of a realistic example app. This app processes low-level logging data from RStudio’s CRAN mirrors, to let us explore the heaviest downloaders for each day.

To load test this example app, we launched 50 sessions of simulated load, with a 5 second delay between each launch, and directed this traffic to a single R process. We then rewrote the app to use futures and promises, and reran the load test with this async version. (The tools we used to perform the load testing are not yet publicly available, but you can refer to Sean Lopp’s talk at rstudio::conf 2018 for a preview.)

Under these conditions, the finished async version displays significantly lower (mean) response times than the original. In the table below, “HTTP traffic” refers to requests that are made during page load time, and “reactive processing” refers to the time between the browser sending a reactive input value and the server returning updated reactive outputs.

Over the last year, we created or enhanced several other packages to support async Shiny:

The promises package (released 2018-04-13) mentioned above provides the actual API you’ll use to do async programming in R. We implemented this as a separate package so that other parts of the R community, not just Shiny users, can take advantage of these techniques. The promises package was inspired by the basic ideas of JavaScript promises, but also have significantly improved syntax and extensibility to make them work well with R and Shiny. Currently, promises is most useful when used with the future package by Henrik Bengtsson.

later (released 2017-06-25) adds a low-level feature to R that is critical to async programming: the ability to schedule R code to be executed in the future, within the same R process. You can do all sorts of cool stuff on top of this, as some people are discovering.

httpuv (1.4.0 released 2018-04-19) has long been the HTTP web server that Shiny, and most other web frameworks for R, sit on top of. Version 1.4.0 adds support for asynchronous handling of HTTP requests, and also adds a dedicated I/O-handling thread for greatly improved performance under load.

In the coming weeks, you can also expect updates for async compatibility to htmlwidgets, plotly, and DT. Most other HTML widgets will automatically become async compatible once htmlwidgets is updated.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

Data == knowledge! Much of the data we use, whether it be from
government repositories, social media, GitHub, or e-commerce sites comes
from public-facing APIs. The quantity of data available is truly
staggering, but munging JSON output into a format that is easily
analyzable in R is an equally staggering undertaking. When JSON is
turned into an R object, it usually becomes a deeply nested list riddled
with missing values that is difficult to untangle into a tidy format.
Moreover, every API presents its own challenges; code you’ve written to
clean up data from GitHub isn’t necessarily going to work on Twitter
data, as each API spews data out in its own unique, headache-inducing
nested list structure. To ease and generalize this process, Amanda
Dobbyn proposed an unconf18 project for a general API response tidier! Welcome roomba,
our first stab at easing the process of tidying nested lists!

roomba will eventually be able to walk nested lists in a variety of
different structures from JSON output, replace NULL or .empty values
with NAs or a user-specified value, and return a tibble with names
matching a user-specified list. Of course, in two days we haven’t fully achieved this vision, but we’re off to a promising start.

The birth of roomba

It was clear Amanda was on to something good by the lively discussion in
the #runconf18 issues
repository leading up to the unconf. Thanks to input from Jenny Bryan,
Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam
Ross, we had a lot of ideas to work with when the unconf began.
Fortunately, Jim already had a function called dfs_idx()
(here)
written to perform depth-first searches of nested lists from the GitNub
GraphQL API. With the core
list-traversal code out of the way, we split our efforts between
developing a usable interface, stockpiling .JSON files to test on, and
developing a Shiny app.

What’s working

We’ve got the basic structure of roomba sorted out, and you should install it from GitHub to try out! Here are a few of the examples we’ve
put together.

library(roomba)
#load twitter data example
data(twitter_data)
#roomba-fy!
roomba(twitter_data, c("created_at", "name"))
## # A tibble: 24 x 2
## name created_at
##
## 1 Code for America Mon Aug 10 18:59:29 +0000 2009
## 2 Ben Lorica Mon Dec 22 22:06:18 +0000 2008
## 3 Dan Sholler Thu Apr 03 20:09:24 +0000 2014
## 4 Code for America Mon Aug 10 18:59:29 +0000 2009
## 5 FiveThirtyEight Tue Jan 21 21:39:32 +0000 2014
## 6 Digital Impact Wed Oct 07 21:10:53 +0000 2009
## 7 Drew Williams Thu Aug 07 18:41:29 +0000 2014
## 8 joe Fri May 29 13:25:25 +0000 2009
## 9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows

And just the first element of the twitter_data list will show you
that roomba has simplified this process quite a bit.

We created a Shiny app too, which in its current state allows you to
select a .Rda or .JSON file, pick two variables, and create a
scatterplot of them.

Run the app like this:

shiny_roomba()

What’s not

Of course, in two days we weren’t able to build a magical
one-size-fits-all solution to every API response data headache. Right
now, the main barrier to usability is that both the roomba() function
and shiny_roomba() app only work on sub-list items of the same length
and same data type stored at the same depth. To illustrate on the twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))
## # A tibble: 1,007 x 1
## user
##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## # ... with 997 more rows
#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))
## # A tibble: 0 x 0

In addition, we’ve got some features we want to add, such as handling a
larger variety of column names (i.e. passing a string for a single
column name, keeping all values even if they are all NULL). We would
love your feedback on other things we can add (open an issue in our Git repository)!

The team

Amanda Dobbyn Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for dfs_idx()

Jim Hester Job: Software Engineer at RStudio Project
contributions: dfs_idx() and remove_nulls() functions, package
building, README, and debugging

Christine Stawitz Job: Postdoctoral researcher at University of Washington’s School
of Aquatic and Fishery Sciences Project contributions: Shiny app,
README and blog post writing

Laura DeCicco Job: Data Scientist at U.S. Geological Survey

Project contributions: Fixing merge conflicts

Isabella Velasquez Job: Data Analyst at the Bill & Melinda
Gates Foundation Project contributions: hex sticker!

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. My collaborator suspected that the relationship between EK and academics would be different for immigrant and non-immigrant children, so we agreed that this would be a key focus of the analysis.

In model terms, we would describe the relationship for each student (i) as;

[ T_i = beta_0 + beta_1 I_i + beta_2 EK_i + beta_3 I_i times EK_i + epsilon_i,]
where (T) is the academic outcome, (I) is an indicator for immigrant status (either 0 or 1), and (EK) is a continuous measure of emotion knowledge. By including the (I times EK) interaction term, we allow for the possibility that the effect of emotion knowledge will be different for immigrants. In particular, if we code (I=0) for non-immigrant kids and (I=1) for immigrant kids, (beta_2) represents the relationship of EK and academic performance for non-immigrant kids, and (beta_2 + beta_3) is the relationship for immigrant kids. In this case, non-immigrant kids are the reference category.

## id I EK T fI
## 1: 1 1 -1.9655562 5.481254 Imm
## 2: 2 1 0.9230118 16.140710 Imm
## 3: 3 0 -2.5315312 9.443148 not Imm
## 4: 4 1 0.9103722 15.691873 Imm
## 5: 5 0 -0.2126550 9.524948 not Imm
## ---
## 246: 246 0 -1.2727195 7.546245 not Imm
## 247: 247 0 -1.2025184 6.658869 not Imm
## 248: 248 0 -1.7555451 11.027569 not Imm
## 249: 249 0 2.2967681 10.439577 not Imm
## 250: 250 1 -0.3056299 11.673933 Imm

Let’s say our primary interest in this exploration is point estimates of (beta_2) and (beta_2 + beta_3), along with 95% confidence intervals of the estimates. (We could have just as easily reported (beta_3), but we decided the point estimates would be more intuitive to understand.) The point estimates are quite straightforward: we can estimate them directly from the estimates of (beta_2) and (beta_3). And the standard error (and confidence interval) for (beta_2) can be read directly off of the model output table. But what about the standard error for the relationship of EK and academic performance for the immigrant kids? How do we handle that?

I’ve always done this the cumbersome way, using this definition:

In R, this is relatively easy (though maybe not super convenient) to do manually, by extracting the information from the estimated parameter variance-covariance matrix.

First, fit a linear model with an interaction term:

The estimate for the relationship of EK and academic performance for non-immigrant kids is 0.46 (se = 0.093). And the point estimate for the relationship for immigrant kids is (2.06=0.46 + 1.60)

The standard error can be calculated from the variance-covariance matrix that is derived from the linear model:

The standard error of the estimate is (sqrt{0.0109} = 0.105).

So?

OK – so maybe that isn’t really all that interesting. Why am I even talking about this? Well, in the actual study, we have a fair amount of missing data. In some cases we don’t have an EK measure, and in others we don’t have an outcome measure. And since the missingness is on the order of 15% to 20%, we decided to use multiple imputation. We used the mice package in R to impute the data, and we pooled the model estimates from the completed data sets to get our final estimates. mice is a fantastic package, but one thing that is does not supply is some sort of pooled variance-covariance matrix. Looking for a relatively quick solution, I decided to use bootstrap methods to estimate the confidence intervals.

(“Relatively quick” is itself a relative term, since bootstrapping and imputing together is not exactly a quick process – maybe something to work on. I was also not fitting standard linear models but mixed effect models. Needless to say, it took a bit of computing time to get my estimates.)

Seeking credit (and maybe some sympathy) for all of my hard work, I mentioned this laborious process to my collaborator. She told me that you can easily estimate the group specific effects merely by changing the reference group and refitting the model. I could see right away that the point estimates would be fine, but surely the standard errors would not be estimated correctly? Of course, a few simulations ensued.

First, I just changed the reference group so that (beta_2) would be measuring the relationship of EK and academic performance for immigrant kids, and (beta_2 + beta_3) would represent the relationship for the non-immigrant kids. Here are the levels before the change:

head(dT$fI)

## [1] Imm Imm not Imm Imm not Imm not Imm
## Levels: not Imm Imm

And after:

dT$fI

## [1] Imm Imm not Imm Imm not Imm not Imm
## Levels: Imm not Imm

The estimate for this new (beta_2) is 2.07 (se=0.105), pretty much aligned with our estimate that required a little more work. While this is not a proof by any means, I did do variations on this simulation (adding other covariates, changing the strength of association, changing sample size, changing variation, etc.) and both approaches seem to be equivalent. I even created 10000 samples to see if the coverage rates of the 95% confidence intervals were correct. They were. My collaborator was right. And I felt a little embarrassed, because it seems like something I should have known.

But …

Would this still work with missing data? Surely, things would go awry in the pooling process. So, I did one last simulation, generating the same data, but then added missingness. I imputed the missing data, fit the models, and pooled the results (including pooled 95% confidence intervals). And then I looked at the coverage rates.

First I added some missingness into the data

defM

## varname formula logit.link baseline monotonic
## 1: EK 0.05 + 0.10*I FALSE FALSE FALSE
## 2: T 0.05 + 0.05*I FALSE FALSE FALSE

And then I generated 500 data sets, imputed the data, and fit the models. Each iteration, I stored the final model results for both models (in one where the reference is non-immigrant and the the other where the reference group is immigrant).

library(mice)
nonRes

The proportion of confidence intervals that contain the true values is pretty close to 95% for both estimates:

mean(nonRes[term == "EK", conf.low 0.5])

## [1] 0.958

mean(immRes[term == "EK", conf.low 2.0])

## [1] 0.948

And the estimates of the mean and standard deviations are also pretty good:

Because I like to include at least one visual in a post, here is a plot of the 95% confidence intervals, with the CIs not covering the true values colored blue:

The re-reference approach seems to work quite well (in the context of this simulation, at least). My guess is the hours of bootstrapping may have been unnecessary, though I haven’t fully tested all of this out in the context of clustered data. My guess is it will turn out OK in that case as well.

Appendix: ggplot code

nonEK 0.5))]
immEK 2))]
EK

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

A great many R users will have to run some python code from time to time, and this short video from our Head of Data Engineering, Mark Sellors outlines one approach you can take that doesn’t mean leaving the RStudio IDE.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

Data science tools are getting better and better, which is improving the predictive performance of machine learning models in business. With new, high-performance tools like, H2O for automated machine learning and Keras for deep learning, the performance of models are increasing tremendously. There’s one catch: Complex models are unexplainable… that is until LIME came along!LIME, which stands for Local Interpretable Model-agnostic Explanations, has opened the doors to black-box (complex, high-performance, but unexplainable) models in business applications! Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

In this Machine Learning Tutorial, Brad Boehmke, Director of Data Science at 84.51°, shows us how to use LIME for machine learning interpretability on a Human Resources Employee Turnover Problem, specifically showing the value of developing interpretablity visualizations. He shows us options for Global Importance and compares it to LIME for Local Importance. We use machine learning R packages h2o, caret, and ranger in the tutorial, showcasing how to use lime for local explanations. Let’s get started!

LIME: A Secret Weapon For ROI-Driven Data Science

Introduction by Matt Dancho, Founder of Business Science

Business success is dependent on the ability for managers, process stakeholders, and key decision makers to make the right decisions often using data to understand what’s going on. This is where machine learning can help. Machine learning can analyze vast amounts of data, creating highly predictive models that tell managers key information such as how likely someone is likely to leave an organization. However, machine learning alone is not enough. Business leaders require explanations so they can determine adjustments that will improve results. These explanations require a different tool: LIME. Let’s find out why LIME is truly a secret weapon for ROI-driven data science!

In the HR Employee Attrition example discussed in this article, the machine learning model predicts the probability of someone leaving the company. This probability is then converted to a prediction of either leave or stay through a process called Binary Classification. However, this doesn’t solve the main objective, which is to make better decisions. It only tells us if someone is a high flight risk (i.e. has high attrition probability).

Employee Attrition: Machine Learning Predicts Which Employees Are Likely To Leave

How do we change decision making and therefore improve? It comes down to levers and probability. Machine learning tells us which employees are highest risk and therefore high probability. We can hone in on these individuals, but we need a different tool to understand why an individual is leaving. This is where LIME comes into play. LIME uncovers the levers or features we can control to make business improvements.

LIME: Uncovers Levers or Features We Can Control

In our HR Employee Attrition Example, LIME detects “Over Time” (lever) as a key feature that supports employee turnover. We can control the “Over Time” feature by implementing a “limited-overtime” or “no-overtime” policy.

Analyzing A Policy Change: Targeting Employees With Over Time

Toggling the “OverTime” feature to “No” enables calculating an expected value or benefit of reducing overtime by implementing a new OT policy. For the individual employee, a expected savings results. When applied to the entire organization, this process of adjusting levers can result in impactful policy changes that save the organization millions per year and generate ROI.

Adjusting The Over Time Results In Expected Savings

Interested in Learning LIME While Solving A Real-World Churn Problem?

If you want to solve this real-world employee churn problem developing models with H2O Automated Machine Learning, using LIME For Black-Box ML Model Explanation, and analyzing the impact of a policy change through optimization and sensitivity analysis, get started today with Data Science For Business (DS4B 201 / HR 201). You’ll learn ROI-Driven Data Science, implementing the tools (H2O + LIME) and our data science framework (BSPF) under my guidance (Matt Dancho, Instructor and Founder of Business Science) in our new, self-paced course part of the Business Science University virtual data science workshop.

Learning Trajectory

Now that we have a flavor for what LIME does, let’s get on with learning how to use it! In this machine learning tutorial, you will learn:

In fact, one of the coolest things you’ll learn is how to create a visualization that compares multiple H2O modeling algorithms that examine employee attrition. This is akin to getting different perspectives for how each of the models view the problem:

Random Forest (RF)

Generalized Linear Regression (GLM)

Gradient Boosted Machine (GBM).

Comparing LIME results of different H2O modeling algorithms

About The Author

This MACHINE LEARNING TUTORIAL comes from Brad Boehmke, Director of Data Science at 84.51°, where he and his team develops algorithmic processes, solutions, and tools that enable 84.51° and its analysts to efficiently extract insights from data and provide solution alternatives to decision-makers. Brad is not only a talented data scientist, he’s an adjunct professor at the University of Cincinnati, Wake Forest, and Air Force Institute of Technology. Most importantly, he’s an active contributor to the Data Science Community and he enjoys giving back via advanced machine learning education available at the UC Business Analytics R Programming Guide!

Machine Learning Tutorial: Visualizing Machine Learning Models with LIME

By Brad Boehmke, Director of Data Science at 84.51°

Machine learning (ML) models are often considered “black boxes” due to their complex inner-workings. More advanced ML models such as random forests, gradient boosting machines (GBM), artificial neural networks (ANN), among others are typically more accurate for predicting nonlinear, faint, or rare phenomena. Unfortunately, more accuracy often comes at the expense of interpretability, and interpretability is crucial for business adoption, model documentation, regulatory oversight, and human acceptance and trust. Luckily, several advancements have been made to aid in interpreting ML models.

Moreover, it’s often important to understand the ML model that you’ve trained on a global scale, and also to zoom into local regions of your data or your predictions and derive local explanations. Global interpretations help us understand the inputs and their entire modeled relationship with the prediction target, but global interpretations can be highly approximate in some cases. Local interpretations help us understand model predictions for a single row of data or a group of similar rows.

This post demonstrates how to use the lime package to perform local interpretations of ML models. This will not focus on the theoretical and mathematical underpinnings but, rather, on the practical application of using lime. ^{1}

Libraries

This tutorial leverages the following packages.

To demonstrate model visualization techniques we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem. Note: I force ordered factors to be unordered as h2o does not support ordered categorical variables.

For this exemplar I retain most of the observations in the training data sets and retain 5 observations in the local_obs set. These 5 observations are going to be treated as new observations that we wish to understand why the particular predicted response was made.

We will explore how to visualize a few of the more popular machine learning algorithms and packages in R. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces:

Random forest model using ranger via the caret package

Random forest model using h2o

Elastic net model using h2o

GBM model using h2o

Random forest model using ranger directly

Global Interpretation

The most common ways of obtaining global interpretation is through:

variable importance measures

partial dependence plots

Variable importance quantifies the global contribution of each input variable to the predictions of a machine learning model. Variable importance measures rarely give insight into the average direction that a variable affects a response function. They simply state the magnitude of a variable’s relationship with the response as compared to other variables used in the model. For example, the ranger random forest model identified monthly income, overtime, and age as the top 3 variables impacting the objective function.

After the most globally relevant variables have been identified, the next step is to attempt to understand how the response variable changes based on these variables. For this we can use partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. These techniques plot the change in the predicted value as specified feature(s) vary over their marginal distribution. Consequently, we can gain some local understanding how the reponse variable changes across the distribution of a particular variable but this still only provides a global understanding of this relationships across all observed data.

For example, if we plot the PDP of the monthly income variable we see that the probability of an employee attriting decreases, on average, as their monthly income approaches $5,000 and then remains relatively flat.

We can gain further insight by using centered ICE curves which can help draw out further details. For example, the following ICE curves show a similar trend line as the PDP above but by centering we identify the decrease as monthly income approaches $5,000 followed by an increase in probability of attriting once an employee’s monthly income approaches $20,000. Futhermore, we see some turbulence in the flatlined region between $5-$20K) which means there appears to be certain salary regions where the probability of attriting changes.

These visualizations help us to understand our model from a global perspective: identifying the variables with the largest overall impact and the typical influence of a feature on the response variable across all observations. However, what these do not help us understand is given a new observation, what were the most influential variables that determined the predicted outcome. Say we obtain information on an employee that makes about $10,000 per month and we need to assess their probabilty of leaving the firm. Although monthly income is the most important variable in our model, it may not be the most influential variable driving this employee to leave. To retain the employee, leadership needs to understand what variables are most influential for that specific employee. This is where lime can help.

Local Interpretation

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.

The generalized algorithm LIME applies is:

Given an observation, permute it to create replicated feature data with slight value modifications.

Compute similarity distance measure between original observation and permuted observations.

Apply selected machine learning model to predict outcomes of permuted data.

Select m number of features to best describe predicted outcomes.

Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .

Use the resulting feature weights to explain local behavior.

Each of these steps will be discussed in further detail as we proceed.

lime::lime

The application of the LIME algorithm via the lime package is split into two operations: lime::lime() and lime::explain(). The lime::lime() function creates an “explainer” object, which is just a list that contains the machine learning model and the feature distributions for the training data. The feature distributions that it contains includes distribution statistics for each categorical variable level and each continuous variable split into n bins (default is 4 bins). These feature attributes will be used to permute data.

The following creates our lime::lime() object and I change the number to bin our continuous variables into to 5.

lime::explain

Once we created our lime objects, we can now perform the generalized LIME algorithm using the lime::explain() function. This function has several options, each providing flexibility in how we perform the generalized algorithm mentioned above.

x: Contains the one or more single observations you want to create local explanations for. In our case, this includes the 5 observations that I included in the local_obs data frame. Relates to algorithm step 1.

explainer: takes the explainer object created by lime::lime(), which will be used to create permuted data. Permutations are sampled from the variable distributions created by the lime::lime() explainer object. Relates to algorithm step 1.

n_permutations: The number of permutations to create for each observation in x (default is 5,000 for tabular data). Relates to algorithm step 1.

dist_fun: The distance function to use. The default is Gower’s distance but can also use euclidean, manhattan, or any other distance function allowed by ?dist(). To compute similarity distance of permuted observations, categorical features will be recoded based on whether or not they are equal to the actual observation. If continuous features are binned (the default) these features will be recoded based on whether they are in the same bin as the observation. Using the recoded data the distance to the original observation is then calculated based on a user-chosen distance measure. Relates to algorithm step 2.

kernel_width: To convert the distance measure to a similarity value, an exponential kernel of a user defined width (defaults to 0.75 times the square root of the number of features) is used. Smaller values restrict the size of the local region. Relates to algorithm step 2.

n_features: The number of features to best describe predicted outcomes. Relates to algorithm step 4.

feature_select: To select the best n features, lime can use forward selection, ridge regression, lasso, or a tree to select the features. In this example I apply a ridge regression model and select the m features with highest absolute weights. Relates to algorithm step 4.

For classification models we also have two additional features we care about and one of these two arguments must be given:

labels: Which label do we want to explain? In this example, I want to explain the probability of an observation to attrit (“Yes”).

n_labels: The number of labels to explain. With this data I could select n_labels = 2 to explain the probability of “Yes” and “No” responses.

The explain() function above first creates permutations, then calculates similarities, followed by selecting the m features. Lastly, explain() will then fit a model (algorithm steps 5 & 6). lime applies a ridge regression model with the weighted permuted observations as the simple model. [^{2}] If the model is a regressor, the simple model will predict the output of the complex model directly. If the complex model is a classifier, the simple model will predict the probability of the chosen class(es).

The explain() output is a data frame containing different information on the simple model predictions. Most importantly, for each observation in local_obs it contains the simple model fit (model_r2) and the weighted importance (feature_weight) for each important feature (feature_desc) that best describes the local relationship.

Visualizing results

However the simplest approach to interpret the results is to visualize them. There are several plotting functions provided by lime but for tabular data we are only concerned with two. The most important of which is plot_features(). This will create a visualization containing an individual plot for each observation (case 1, 2, …, n) in our local_obs data frame. Since we specified labels = "Yes" in the explain() function, it will provide the probability of each observation attriting. And since we specified n_features = 10 it will plot the 10 most influential variables that best explain the linear model in that observations local region and whether the variable is causes an increase in the probability (supports) or a decrease in the probability (contradicts). It also provides us with the model fit for each model (“Explanation Fit: XX”), which allows us to see how well that model explains the local region.

Consequently, we can infer that case 3 has the highest liklihood of attriting out of the 5 observations and the 3 variables that appear to be influencing this high probability include working overtime, being single, and working as a lab tech.

The other plot we can create is a heatmap showing how the different variables selected across all the observations influence each case. We use the plot_explanations() function. This plot becomes useful if you are trying to find common features that influence all observations or if you are performing this analysis across many observations which makes plot_features() difficult to discern.

Tuning LIME

As you saw in the above plot_features() plot, the output provides the model fit. In this case the best simple model fit for the given local regions was R^2 = 0.59 for case 3. Considering there are several knobs we can turn when performing the LIME algorithm, we can treat these as tuning parameters to try find the best fit model for the local region. This helps to maximize the amount of trust we can have in the local region explanation.

As an example, the following changes the distance function to use the manhattan distance algorithm, we increase the kernel width substantially to create a larger local region, and we change our feature selection approach to a LARS lasso model. The result is a fairly substantial increase in our explanation fits.

Supported vs Non-support models

Currently, lime supports supervised models produced in caret, mlr, xgboost, h2o, keras, and MASS::lda. Consequently, any supervised models created with these packages will function just fine with lime.

However, any models that do not have built in support will produce an error. For example, the model we created directly with ranger is not supported and produces an error.

We can work with this pretty easily by building two functions that make lime compatible with an unsupported package. First, we need to create a model_type() function that specifies what type of model this unsupported package is using. model_type() is a lime specific function, we just need to create a ranger specific method. We do this by taking the class name for our ranger object and creating the model_type.ranger method and simply return the type of model (“classification” for this example).

We then need to create a predict_model() method for ranger as well. The output for this function should be a data frame. For a regression problem it should produce a single column data frame with the predicted response and for a classification problem it should create a column containing the probabilities for each categorical class (binary “Yes” “No” in this example).

Now that we have those methods developed and in our global environment we can run our lime functions and produce our outputs.^{3}

Learning More

At Business Science, we’ve been using the lime package with clients to help explain our machine learning models – It’s been our secret weapon. Our primary use cases are with h2o and keras, both of which are supported in lime. In fact, we actually built the h2o integration to gain the beneifts of LIME with stacked ensembles, deep learning, and other black-box algorithms. We’ve used it with clients to help them detect which employees should be considered for executive promotion. We’ve even provided previous real-world business problem / machine learning tutorials:

In fact, those that want to learn lime while solving a real world data science problem can get started today with our new course: Data Science For Business (DS4B 201)

Additional Resources

LIME provides a great, model-agnostic approach to assessing local interpretation of predictions. To learn more I would start with the following resources:

Original paper: Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 1135-1144. DOI: https://doi.org/10.1145/2939672.2939778

Matt was recently on Episode 165 of the SuperDataScience Podcast. In his second appearance (also was on Episode 109 where he spoke about the transition to data science), he talks about giving back to the data science community if the form of education, open source software, and blogging!

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), Business Science University is for you! Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

TO SOLVE A REAL WORLD CHURN PROBLEM: Employee Turnover!

Tying data science to financial improvement (ROI-Driven Data Science)

Special Autographed “Deep Learning With R” Giveaway!!!

One lucky student that enrolls in June will receive an autographed copy of Deep Learning With R, signed by JJ Allaire, Founder of Rstudio and DLwR co-author.

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

DS4B 301 (Coming Soon): Building A Shiny Web Application

DS4B 302 (EST Q4): Data Communication With RMarkdown Reports and Presentations

DS4B 303 (EST Q4): Building An R Package For Your Organization, tidyattrition

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.

To this end, you are encouraged to read through the

If you’ve never applied a weighted ridge regression model you can see some details on its application in the glmnet vignette

If you are curious as to why simply creating the model_type.ranger and predict_model.ranger methods and hosting them in your global environment causes the lime functions to work then I suggest you read chapter 6 of Advanced R.

## Recent Comments