The Financial Times and BBC use R for publication graphics

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

… and this chart on gender pay gaps at large British banks:

Meanwhile, the chart below was made for the Financial Times using just R and the ggplot2 package, “down to the custom FT font and the white bar in the top left”, according to data journalist John Burn-Murdoch.

FT-world-cup

There are also entire collections devoted to recreating Tufte’s own visualizations in R, presumably meeting his typography standards. Tufte later clarified saying “Problem is not code, problem is published practice”, which is true of any programming environment, which is why it was strange that he’d call out R in particular.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Finalfit now in CRAN

By Ewen Harrison

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Your favourite package for getting model outputs directly into publication ready tables is now available on CRAN. They make you work for it! Thank you to all that helped. The development version will continue to be available from github.

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New Course: Marketing Analytics in R

By Ryan Sheehy

(This article was first published on DataCamp Community – r programming, and kindly contributed to R-bloggers)

Course Description

Here is a link to our new R course.

This is your chance to dive into the worlds of marketing and business analytics using R. Day by day, there are a multitude of decisions that companies have to face. With the help of statistical models, you’re going to be able to support the business decision-making process based on data, not your gut feeling. Let us show you what a great impact statistical modeling can have on the performance of businesses. You’re going to learn about and apply strategies to communicate your results and help them make a difference.

Chapter 1: Modeling Customer Lifetime Value with Linear Regression

How can you decide which customers are most valuable for your business? Learn how to model the customer lifetime value using linear regression.

Chapter 2: Logistic Regression for Churn Prevention

Predicting if a customer will leave your business, or churn, is important for targeting valuable customers and retaining those who are at risk. Learn how to model customer churn using logistic regression.

Chapter 3: Modeling Time to Reorder with Survival Analysis

Learn how to model the time to an event using survival analysis. This could be the time until next order or until a person churns.

Chapter 4: Reducing Dimensionality with Principal Component Analysis

Learn how to reduce the number of variables in your data using principal component analysis. Not only does this help to get a better understanding of your data. PCA also enables you to condense information to single indices and to solve multicollinearity problems in a regression analysis with many intercorrelated variables.

If you are interested in learning more about marketing and data science, check out this tutorial for Python, Data Science for Search Engine Marketing.

The following R courses are prerequisites to take Marketing Analytics in R: Statistical Modeling.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community – r programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Bayesian GANs [#2]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

As an illustration of the lack of convergence of the Gibbs sampler applied to the two “conditionals” defined in the Bayesian GANs paper discussed yesterday, I took the simplest possible example of a Normal mean generative model (one parameter) with a logistic discriminator (one parameter) and implemented the scheme (during an ISBA 2018 session). With flat priors on both parameters. And a Normal random walk as Metropolis-Hastings proposal. As expected, since there is no stationary distribution associated with the Markov chain, simulated chains do not exhibit a stationary pattern,

And they eventually reach an overflow error or a trapping state as the log-likelihood gets approximately to zero (red curve).

Too bad I missed the talk by Shakir Mohammed yesterday, being stuck on the Edinburgh by-pass at rush hour!, as I would have loved to hear his views about this rather essential issue…

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New R package flatxml: working with XML files as R dataframes

By Topics in R

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

The world is flat The new R package flatxml provides functions to easily deal with XML files. When parsing an XML document fxml_importXMLFlat produces a special dataframe that is ‘flat’ by its very nature but contains all necessary information about the hierarchical structure of the underlying XML document (for details on the dataframe see the reference for the fxml_importXMLFlat function). flatxml offers a set of functions to work with this dataframe. Apart from representing the XML document in a dataframe structure, there is yet another way in which flatxml relates to dataframes: the fxml_toDataFrame function can be used to extract data from an XML document into a dataframe, e.g. to work on the data with statistical functions. Because in this case there is no need to represent the XML document structure as such (it’s all about the data contained in the document), there is no representation of the hierarchical structure of the document any more, it’s just a normal dataframe. Each XML element, for example Here is some text has certain characteristics that can be accessed via the flatxml interface functions, after an XML document has been imported with fxml_importXMLFlat. These characteristics are:

  • value: The (text) value of the element, “Here is some text” in the example above
  • attributes: The XML attributes of the element, attribute with its value “some value” in the example above
  • children: The elements on the next lower hierarchical level
  • parent: The element of the next higher hierarchical level, i.e. the element to which the current element is a child
  • siblings: The elements on the same hierarchical level as the current element

Structure of the flatxml interface The flatxml interface to access these characteristics follows a simple logic: For each of the characteristics there are typically three functions available:

  • fxml_has…(): Determines if the current XML element has (at least one instance of) the characteristic
  • fxml_num…(): Returns the number of the characteristics of the current XML (e.g. the number of children elements
  • fxml_get…(): Returns (the IDs of) the respective characteristics of the current XML element (e.g. the children of the current element)

Learn more For more information on the flatxml package please go to http://www.zuckarelli.de/flatxml/index.html.

To leave a comment for the author, please follow the link and comment on their blog: Topics in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny 1.1.0: Scaling Shiny with async

By Joe Cheng

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

This is a significant release for Shiny, with a major new feature that was nearly a year in the making: support for asynchronous operations!

Without this capability, when Shiny performs long-running calculations or tasks on behalf of one user, it stalls progress for all other Shiny users that are connected to the same process. Therefore, Shiny apps that feature long-running calculations or tasks have generally been deployed using many R processes, each serving a small number of users; this works, but is not the most efficient approach. Such applications now have an important new tool in the toolbox to improve performance under load.

Shiny async is implemented via integration with the future and promises packages. These two packages are used together:

  1. Use future to perform long-running operations in a worker process that runs in the background, leaving Shiny processes free to serve other users in the meantime. This yields much better responsiveness under load, and much more predictable latency.
  2. Use promises to handle the result of each long-running background operation back in the Shiny process, where additional processing can occur, such as further data manipulation, or displaying to the user via a reactive output.

If your app has a small number of severe performance bottlenecks, you can use this technique to get massively better responsiveness under load. For example, if the httr::GET call in this server function takes 30 seconds to complete:

server %
      httr::content("parsed")
  })

  output$plot %
      ggplot(aes(speed, dist)) + geom_point()
  })
}

then the entire R process is stalled for those 30 seconds.

We can rewrite it asynchronously like this:

library(promises)
library(future)
plan(multisession)

server %
      httr::content("parsed") %...>%
  })
  
  output$plot % {
      ggplot(., aes(speed, dist)) + geom_point()
    }
  })
}

Even if the httr::GET(url) takes 30 seconds, the r reactive executes almost instantly, and returns control to the caller. The code inside future(...) is executed in a different R process that runs in the background, and whenever its result becomes available (i.e. in 30 seconds), the right-hand side of %...>% will be executed with that result. (%...>% is called a “promise pipe”; it works similarly to a magrittr pipe that knows how to wait for and “unwrap” promises.)

If the original (synchronous) code appeared in a Shiny app, then during that 30 seconds, the R process is stuck dealing with the download and can’t respond to any requests being made by other users. But with the async version, the R process only needs to kick off the operation, and then is free to service other requests. This means other users will only have to wait milliseconds, not minutes, for the app to respond.

Case study

We’ve created a detailed case study that walks through the async conversion of a realistic example app. This app processes low-level logging data from RStudio’s CRAN mirrors, to let us explore the heaviest downloaders for each day.

To load test this example app, we launched 50 sessions of simulated load, with a 5 second delay between each launch, and directed this traffic to a single R process. We then rewrote the app to use futures and promises, and reran the load test with this async version. (The tools we used to perform the load testing are not yet publicly available, but you can refer to Sean Lopp’s talk at rstudio::conf 2018 for a preview.)

Under these conditions, the finished async version displays significantly lower (mean) response times than the original. In the table below, “HTTP traffic” refers to requests that are made during page load time, and “reactive processing” refers to the time between the browser sending a reactive input value and the server returning updated reactive outputs.

Response type Original Async Delta
HTTP traffic 605 ms 139 ms -77%
Reactive processing 10.7 sec 3.48 sec -67%

Learn more

Visit the promises website to learn more, or watch my recent webinar on Shiny async.

See the full changelog for Shiny v1.1.0.

Related packages

Over the last year, we created or enhanced several other packages to support async Shiny:

  • The promises package (released 2018-04-13) mentioned above provides the actual API you’ll use to do async programming in R. We implemented this as a separate package so that other parts of the R community, not just Shiny users, can take advantage of these techniques. The promises package was inspired by the basic ideas of JavaScript promises, but also have significantly improved syntax and extensibility to make them work well with R and Shiny. Currently, promises is most useful when used with the future package by Henrik Bengtsson.
  • later (released 2017-06-25) adds a low-level feature to R that is critical to async programming: the ability to schedule R code to be executed in the future, within the same R process. You can do all sorts of cool stuff on top of this, as some people are discovering.
  • httpuv (1.4.0 released 2018-04-19) has long been the HTTP web server that Shiny, and most other web frameworks for R, sit on top of. Version 1.4.0 adds support for asynchronous handling of HTTP requests, and also adds a dedicated I/O-handling thread for greatly improved performance under load.

In the coming weeks, you can also expect updates for async compatibility to htmlwidgets, plotly, and DT. Most other HTML widgets will automatically become async compatible once htmlwidgets is updated.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A package for tidying nested lists

By rOpenSci – open tools for open science

Drawing

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

Data == knowledge! Much of the data we use, whether it be from
government repositories, social media, GitHub, or e-commerce sites comes
from public-facing APIs. The quantity of data available is truly
staggering, but munging JSON output into a format that is easily
analyzable in R is an equally staggering undertaking. When JSON is
turned into an R object, it usually becomes a deeply nested list riddled
with missing values that is difficult to untangle into a tidy format.
Moreover, every API presents its own challenges; code you’ve written to
clean up data from GitHub isn’t necessarily going to work on Twitter
data, as each API spews data out in its own unique, headache-inducing
nested list structure. To ease and generalize this process, Amanda
Dobbyn proposed
an
unconf18 project for a general API response tidier! Welcome roomba,
our first stab at easing the process of tidying nested lists!

roomba will eventually be able to walk nested lists in a variety of
different structures from JSON output, replace NULL or .empty values
with NAs or a user-specified value, and return a tibble with names
matching a user-specified list. Of course, in two days we haven’t
fully achieved this vision, but we’re off to a promising start.

The birth of roomba

It was clear Amanda was on to something good by the lively discussion in
the #runconf18 issues
repository leading up to the unconf. Thanks to input from Jenny Bryan,
Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam
Ross, we had a lot of ideas to work with when the unconf began.
Fortunately, Jim already had a function called dfs_idx()
(here)
written to perform depth-first searches of nested lists from the GitNub
GraphQL API
. With the core
list-traversal code out of the way, we split our efforts between
developing a usable interface, stockpiling .JSON files to test on, and
developing a Shiny app.

What’s working

We’ve got the basic structure of roomba sorted out, and you should
install it from GitHub to try out! Here are a few of the examples we’ve
put together.

library(roomba)
#load twitter data example
data(twitter_data)

#roomba-fy!
roomba(twitter_data, c("created_at", "name"))

## # A tibble: 24 x 2
##    name                 created_at                    
##                             
##  1 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  2 Ben Lorica     Mon Dec 22 22:06:18 +0000 2008
##  3 Dan Sholler          Thu Apr 03 20:09:24 +0000 2014
##  4 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  5 FiveThirtyEight      Tue Jan 21 21:39:32 +0000 2014
##  6 Digital Impact       Wed Oct 07 21:10:53 +0000 2009
##  7 Drew Williams        Thu Aug 07 18:41:29 +0000 2014
##  8 joe                  Fri May 29 13:25:25 +0000 2009
##  9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick       Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows

And just the first element of the twitter_data list will show you
that roomba has simplified this process quite a bit.

twitter_data[[1]]

## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
## 
## $id
## [1] 9.98624e+17
## 
## $id_str
## [1] "998623997397876743"
## 
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl… https://t.co/7tVf1qmNmq"
## 
## $truncated
## [1] TRUE
## 
## $entities
## $entities$hashtags
## list()
## 
## $entities$symbols
## list()
## 
## $entities$user_mentions
## list()
## 
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
## 
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
## 
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9…"
## 
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
## 
## $entities$urls[[1]]$indices[[2]]
## [1] 140
## 
## 
## 
## 
## 
## $source
## [1] "TweetDeck"
## 
## $in_reply_to_status_id
## NULL
## 
## $in_reply_to_status_id_str
## NULL
## 
## $in_reply_to_user_id
## NULL
## 
## $in_reply_to_user_id_str
## NULL
## 
## $in_reply_to_screen_name
## NULL
## 
## $user
## $user$id
## [1] 64482503
## 
## $user$id_str
## [1] "64482503"
## 
## $user$name
## [1] "Code for America"
## 
## $user$screen_name
## [1] "codeforamerica"
## 
## $user$location
## [1] "San Francisco, California"
## 
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
## 
## $user$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
## 
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
## 
## 
## 
## 
## 
## $user$entities$description
## $user$entities$description$urls
## list()
## 
## 
## 
## $user$protected
## [1] FALSE
## 
## $user$followers_count
## [1] 49202
## 
## $user$friends_count
## [1] 1716
## 
## $user$listed_count
## [1] 2659
## 
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
## 
## $user$favourites_count
## [1] 4490
## 
## $user$utc_offset
## [1] -25200
## 
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
## 
## $user$geo_enabled
## [1] TRUE
## 
## $user$verified
## [1] TRUE
## 
## $user$statuses_count
## [1] 15912
## 
## $user$lang
## [1] "en"
## 
## $user$contributors_enabled
## [1] FALSE
## 
## $user$is_translator
## [1] FALSE
## 
## $user$is_translation_enabled
## [1] FALSE
## 
## $user$profile_background_color
## [1] "EBEBEB"
## 
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_tile
## [1] FALSE
## 
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
## 
## $user$profile_link_color
## [1] "CF1B41"
## 
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
## 
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
## 
## $user$profile_text_color
## [1] "333333"
## 
## $user$profile_use_background_image
## [1] FALSE
## 
## $user$has_extended_profile
## [1] FALSE
## 
## $user$default_profile
## [1] FALSE
## 
## $user$default_profile_image
## [1] FALSE
## 
## $user$following
## [1] TRUE
## 
## $user$follow_request_sent
## [1] FALSE
## 
## $user$notifications
## [1] FALSE
## 
## $user$translator_type
## [1] "none"
## 
## 
## $geo
## NULL
## 
## $coordinates
## NULL
## 
## $place
## NULL
## 
## $contributors
## NULL
## 
## $is_quote_status
## [1] FALSE
## 
## $retweet_count
## [1] 0
## 
## $favorite_count
## [1] 0
## 
## $favorited
## [1] FALSE
## 
## $retweeted
## [1] FALSE
## 
## $possibly_sensitive
## [1] FALSE
## 
## $possibly_sensitive_appealable
## [1] FALSE
## 
## $lang
## [1] "en"

We created a Shiny app too, which in its current state allows you to
select a .Rda or .JSON file, pick two variables, and create a
scatterplot of them.

Run the app like this:

shiny_roomba()

What’s not

Of course, in two days we weren’t able to build a magical
one-size-fits-all solution to every API response data headache. Right
now, the main barrier to usability is that both the roomba() function
and shiny_roomba() app only work on sub-list items of the same length
and same data type stored at the same depth. To illustrate on the
twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))

## # A tibble: 1,007 x 1
##    user      
##        
##  1  
##  2  
##  3  
##  4  
##  5  
##  6  
##  7  
##  8 
##  9  
## 10  
## # ... with 997 more rows

#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))

## # A tibble: 0 x 0

In addition, we’ve got some features we want to add, such as handling a
larger variety of column names (i.e. passing a string for a single
column name, keeping all values even if they are all NULL). We would
love your feedback on other things we can add (open an issue in our Git repository)!

The team

Amanda Dobbyn
Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for
dfs_idx()

Jim Hester
Job: Software Engineer at RStudio
Project
contributions: dfs_idx() and remove_nulls() functions, package
building, README, and debugging

Christine Stawitz
Job: Postdoctoral researcher at University of Washington’s School
of Aquatic and Fishery Sciences
Project contributions: Shiny app,
README and blog post writing

Laura DeCicco
Job: Data Scientist at U.S. Geological Survey

Project contributions: Fixing merge conflicts 🙂

Isabella Velasquez
Job: Data Analyst at the Bill & Melinda
Gates Foundation
Project contributions: hex sticker!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution

By Keith Goldfeld

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. My collaborator suspected that the relationship between EK and academics would be different for immigrant and non-immigrant children, so we agreed that this would be a key focus of the analysis.

In model terms, we would describe the relationship for each student (i) as;

[ T_i = beta_0 + beta_1 I_i + beta_2 EK_i + beta_3 I_i times EK_i + epsilon_i,]
where (T) is the academic outcome, (I) is an indicator for immigrant status (either 0 or 1), and (EK) is a continuous measure of emotion knowledge. By including the (I times EK) interaction term, we allow for the possibility that the effect of emotion knowledge will be different for immigrants. In particular, if we code (I=0) for non-immigrant kids and (I=1) for immigrant kids, (beta_2) represents the relationship of EK and academic performance for non-immigrant kids, and (beta_2 + beta_3) is the relationship for immigrant kids. In this case, non-immigrant kids are the reference category.

Here’s the data generation:

library(simstudy)
library(broom)

set.seed(87265145)

def 
##       id I         EK         T      fI
##   1:   1 1 -1.9655562  5.481254     Imm
##   2:   2 1  0.9230118 16.140710     Imm
##   3:   3 0 -2.5315312  9.443148 not Imm
##   4:   4 1  0.9103722 15.691873     Imm
##   5:   5 0 -0.2126550  9.524948 not Imm
##  ---                                   
## 246: 246 0 -1.2727195  7.546245 not Imm
## 247: 247 0 -1.2025184  6.658869 not Imm
## 248: 248 0 -1.7555451 11.027569 not Imm
## 249: 249 0  2.2967681 10.439577 not Imm
## 250: 250 1 -0.3056299 11.673933     Imm

Let’s say our primary interest in this exploration is point estimates of (beta_2) and (beta_2 + beta_3), along with 95% confidence intervals of the estimates. (We could have just as easily reported (beta_3), but we decided the point estimates would be more intuitive to understand.) The point estimates are quite straightforward: we can estimate them directly from the estimates of (beta_2) and (beta_3). And the standard error (and confidence interval) for (beta_2) can be read directly off of the model output table. But what about the standard error for the relationship of EK and academic performance for the immigrant kids? How do we handle that?

I’ve always done this the cumbersome way, using this definition:

[
begin{aligned}
se_{beta_2 + beta_3} &= [Var(beta_2 + beta_3)]^frac{1}{2}
&=[Var(beta_2) + Var(beta_3) + 2 times Cov(beta_2,beta_3)]^frac{1}{2}
end{aligned}
]

In R, this is relatively easy (though maybe not super convenient) to do manually, by extracting the information from the estimated parameter variance-covariance matrix.

First, fit a linear model with an interaction term:

lm1 
##          term  estimate  std.error statistic       p.value
## 1 (Intercept) 10.161842 0.16205385 62.706574 2.651774e-153
## 2       fIImm  1.616929 0.26419189  6.120281  3.661090e-09
## 3          EK  0.461628 0.09252734  4.989098  1.147653e-06
## 4    fIImm:EK  1.603680 0.13960763 11.487049  9.808529e-25

The estimate for the relationship of EK and academic performance for non-immigrant kids is 0.46 (se = 0.093). And the point estimate for the relationship for immigrant kids is (2.06=0.46 + 1.60)

The standard error can be calculated from the variance-covariance matrix that is derived from the linear model:

vcov(lm1)
##              (Intercept)        fIImm           EK     fIImm:EK
## (Intercept)  0.026261449 -0.026261449 -0.000611899  0.000611899
## fIImm       -0.026261449  0.069797354  0.000611899 -0.006838297
## EK          -0.000611899  0.000611899  0.008561309 -0.008561309
## fIImm:EK     0.000611899 -0.006838297 -0.008561309  0.019490291

[Var(beta_2+beta_3) = 0.0086 + 0.0195 + 2times(-.0086) = 0.0109]

The standard error of the estimate is (sqrt{0.0109} = 0.105).

So?

OK – so maybe that isn’t really all that interesting. Why am I even talking about this? Well, in the actual study, we have a fair amount of missing data. In some cases we don’t have an EK measure, and in others we don’t have an outcome measure. And since the missingness is on the order of 15% to 20%, we decided to use multiple imputation. We used the mice package in R to impute the data, and we pooled the model estimates from the completed data sets to get our final estimates. mice is a fantastic package, but one thing that is does not supply is some sort of pooled variance-covariance matrix. Looking for a relatively quick solution, I decided to use bootstrap methods to estimate the confidence intervals.

(“Relatively quick” is itself a relative term, since bootstrapping and imputing together is not exactly a quick process – maybe something to work on. I was also not fitting standard linear models but mixed effect models. Needless to say, it took a bit of computing time to get my estimates.)

Seeking credit (and maybe some sympathy) for all of my hard work, I mentioned this laborious process to my collaborator. She told me that you can easily estimate the group specific effects merely by changing the reference group and refitting the model. I could see right away that the point estimates would be fine, but surely the standard errors would not be estimated correctly? Of course, a few simulations ensued.

First, I just changed the reference group so that (beta_2) would be measuring the relationship of EK and academic performance for immigrant kids, and (beta_2 + beta_3) would represent the relationship for the non-immigrant kids. Here are the levels before the change:

head(dT$fI)
## [1] Imm     Imm     not Imm Imm     not Imm not Imm
## Levels: not Imm Imm

And after:

dT$fI 
## [1] Imm     Imm     not Imm Imm     not Imm not Imm
## Levels: Imm not Imm

And the model:

lm2 
##           term  estimate std.error  statistic       p.value
## 1  (Intercept) 11.778770 0.2086526  56.451588 8.367177e-143
## 2    fInot Imm -1.616929 0.2641919  -6.120281  3.661090e-09
## 3           EK  2.065308 0.1045418  19.755813  1.112574e-52
## 4 fInot Imm:EK -1.603680 0.1396076 -11.487049  9.808529e-25

The estimate for this new (beta_2) is 2.07 (se=0.105), pretty much aligned with our estimate that required a little more work. While this is not a proof by any means, I did do variations on this simulation (adding other covariates, changing the strength of association, changing sample size, changing variation, etc.) and both approaches seem to be equivalent. I even created 10000 samples to see if the coverage rates of the 95% confidence intervals were correct. They were. My collaborator was right. And I felt a little embarrassed, because it seems like something I should have known.

But …

Would this still work with missing data? Surely, things would go awry in the pooling process. So, I did one last simulation, generating the same data, but then added missingness. I imputed the missing data, fit the models, and pooled the results (including pooled 95% confidence intervals). And then I looked at the coverage rates.

First I added some missingness into the data

defM 
##    varname       formula logit.link baseline monotonic
## 1:      EK 0.05 + 0.10*I      FALSE    FALSE     FALSE
## 2:       T 0.05 + 0.05*I      FALSE    FALSE     FALSE

And then I generated 500 data sets, imputed the data, and fit the models. Each iteration, I stored the final model results for both models (in one where the reference is non-immigrant and the the other where the reference group is immigrant).

library(mice)

nonRes 

The proportion of confidence intervals that contain the true values is pretty close to 95% for both estimates:

mean(nonRes[term == "EK", conf.low  0.5])
## [1] 0.958
mean(immRes[term == "EK", conf.low  2.0])
## [1] 0.948

And the estimates of the mean and standard deviations are also pretty good:

nonRes[term == "EK", .(mean = round(mean(estimate),3), 
                       obs.SD = round(sd(estimate),3), 
                       avgEst.SD = round(sqrt(mean(std.error^2)),3))]
##     mean obs.SD avgEst.SD
## 1: 0.503  0.086     0.088
immRes[term == "EK", .(mean = round(mean(estimate),3), 
                       obs.SD = round(sd(estimate),3), 
                       avgEst.SD = round(sqrt(mean(std.error^2)),3))]
##     mean obs.SD avgEst.SD
## 1: 1.952  0.117     0.124

Because I like to include at least one visual in a post, here is a plot of the 95% confidence intervals, with the CIs not covering the true values colored blue:

The re-reference approach seems to work quite well (in the context of this simulation, at least). My guess is the hours of bootstrapping may have been unnecessary, though I haven’t fully tested all of this out in the context of clustered data. My guess is it will turn out OK in that case as well.

Appendix: ggplot code

nonEK  0.5))]

immEK  2))]

EK 
To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Running Python inside the RStudio Server IDE

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

A great many R users will have to run some python code from time to time, and this short video from our Head of Data Engineering, Mark Sellors outlines one approach you can take that doesn’t mean leaving the RStudio IDE.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Science For Business Tutorial: Using Machine Learning With LIME To Understand Employee Churn

By business-science.io – Articles

↩

(This article was first published on business-science.io – Articles, and kindly contributed to R-bloggers)

Data science tools are getting better and better, which is improving the predictive performance of machine learning models in business. With new, high-performance tools like, H2O for automated machine learning and Keras for deep learning, the performance of models are increasing tremendously. There’s one catch: Complex models are unexplainable… that is until LIME came along! LIME, which stands for Local Interpretable Model-agnostic Explanations, has opened the doors to black-box (complex, high-performance, but unexplainable) models in business applications! Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

In this Machine Learning Tutorial, Brad Boehmke, Director of Data Science at 84.51°, shows us how to use LIME for machine learning interpretability on a Human Resources Employee Turnover Problem, specifically showing the value of developing interpretablity visualizations. He shows us options for Global Importance and compares it to LIME for Local Importance. We use machine learning R packages h2o, caret, and ranger in the tutorial, showcasing how to use lime for local explanations. Let’s get started!

LIME: A Secret Weapon For ROI-Driven Data Science

Introduction by Matt Dancho, Founder of Business Science

Business success is dependent on the ability for managers, process stakeholders, and key decision makers to make the right decisions often using data to understand what’s going on. This is where machine learning can help. Machine learning can analyze vast amounts of data, creating highly predictive models that tell managers key information such as how likely someone is likely to leave an organization. However, machine learning alone is not enough. Business leaders require explanations so they can determine adjustments that will improve results. These explanations require a different tool: LIME. Let’s find out why LIME is truly a secret weapon for ROI-driven data science!

In the HR Employee Attrition example discussed in this article, the machine learning model predicts the probability of someone leaving the company. This probability is then converted to a prediction of either leave or stay through a process called Binary Classification. However, this doesn’t solve the main objective, which is to make better decisions. It only tells us if someone is a high flight risk (i.e. has high attrition probability).

Employee Attrition: Machine Learning Predicts Which Employees Are Likely To Leave

How do we change decision making and therefore improve? It comes down to levers and probability. Machine learning tells us which employees are highest risk and therefore high probability. We can hone in on these individuals, but we need a different tool to understand why an individual is leaving. This is where LIME comes into play. LIME uncovers the levers or features we can control to make business improvements.

LIME: Uncovering Levers We Can Control

LIME: Uncovers Levers or Features We Can Control

In our HR Employee Attrition Example, LIME detects “Over Time” (lever) as a key feature that supports employee turnover. We can control the “Over Time” feature by implementing a “limited-overtime” or “no-overtime” policy.

Employee Attrition: Targeting Employees With Over Time

Analyzing A Policy Change: Targeting Employees With Over Time

Toggling the “OverTime” feature to “No” enables calculating an expected value or benefit of reducing overtime by implementing a new OT policy. For the individual employee, a expected savings results. When applied to the entire organization, this process of adjusting levers can result in impactful policy changes that save the organization millions per year and generate ROI.

Employee Attrition: Adjusting Levers

Adjusting The Over Time Results In Expected Savings

Interested in Learning LIME While Solving A Real-World Churn Problem?

If you want to solve this real-world employee churn problem developing models with H2O Automated Machine Learning, using LIME For Black-Box ML Model Explanation, and analyzing the impact of a policy change through optimization and sensitivity analysis, get started today with Data Science For Business (DS4B 201 / HR 201). You’ll learn ROI-Driven Data Science, implementing the tools (H2O + LIME) and our data science framework (BSPF) under my guidance (Matt Dancho, Instructor and Founder of Business Science) in our new, self-paced course part of the Business Science University virtual data science workshop.

Learning Trajectory

Now that we have a flavor for what LIME does, let’s get on with learning how to use it! In this machine learning tutorial, you will learn:

In fact, one of the coolest things you’ll learn is how to create a visualization that compares multiple H2O modeling algorithms that examine employee attrition. This is akin to getting different perspectives for how each of the models view the problem:

  • Random Forest (RF)
  • Generalized Linear Regression (GLM)
  • Gradient Boosted Machine (GBM).

LIME Multiple Models

Comparing LIME results of different H2O modeling algorithms

About The Author

This MACHINE LEARNING TUTORIAL comes from Brad Boehmke, Director of Data Science at 84.51°, where he and his team develops algorithmic processes, solutions, and tools that enable 84.51° and its analysts to efficiently extract insights from data and provide solution alternatives to decision-makers. Brad is not only a talented data scientist, he’s an adjunct professor at the University of Cincinnati, Wake Forest, and Air Force Institute of Technology. Most importantly, he’s an active contributor to the Data Science Community and he enjoys giving back via advanced machine learning education available at the UC Business Analytics R Programming Guide!

Machine Learning Tutorial: Visualizing Machine Learning Models with LIME

By Brad Boehmke, Director of Data Science at 84.51°

Machine learning (ML) models are often considered “black boxes” due to their complex inner-workings. More advanced ML models such as random forests, gradient boosting machines (GBM), artificial neural networks (ANN), among others are typically more accurate for predicting nonlinear, faint, or rare phenomena. Unfortunately, more accuracy often comes at the expense of interpretability, and interpretability is crucial for business adoption, model documentation, regulatory oversight, and human acceptance and trust. Luckily, several advancements have been made to aid in interpreting ML models.

Moreover, it’s often important to understand the ML model that you’ve trained on a global scale, and also to zoom into local regions of your data or your predictions and derive local explanations. Global interpretations help us understand the inputs and their entire modeled relationship with the prediction target, but global interpretations can be highly approximate in some cases. Local interpretations help us understand model predictions for a single row of data or a group of similar rows.

This post demonstrates how to use the lime package to perform local interpretations of ML models. This will not focus on the theoretical and mathematical underpinnings but, rather, on the practical application of using lime. 1

Libraries

This tutorial leverages the following packages.

# required packages
# install vip from github repo: devtools::install_github("koalaverse/vip")
library(lime)       # ML local interpretation
library(vip)        # ML global interpretation
library(pdp)        # ML global interpretation
library(ggplot2)    # visualization pkg leveraged by above packages
library(caret)      # ML model building
library(h2o)        # ML model building

# other useful packages
library(tidyverse)  # Use tibble, dplyr
library(rsample)    # Get HR Data via rsample::attrition
library(gridExtra)  # Plot multiple lime plots on one graph

# initialize h2o
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 minutes 17 seconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.19.0.4208 
##     H2O cluster version age:    4 months and 6 days !!! 
##     H2O cluster name:           H2O_started_from_R_mdancho_zbl980 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.50 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.4 (2018-03-15)
h2o.no_progress()

To demonstrate model visualization techniques we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem. Note: I force ordered factors to be unordered as h2o does not support ordered categorical variables.

For this exemplar I retain most of the observations in the training data sets and retain 5 observations in the local_obs set. These 5 observations are going to be treated as new observations that we wish to understand why the particular predicted response was made.

# create data sets
df  rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = factor(Attrition, levels = c("Yes", "No")))

index  1:5
train_obs  df[-index, ]
local_obs  df[index, ]

# create h2o objects for modeling
y  "Attrition"
x  setdiff(names(train_obs), y)
train_obs.h2o  as.h2o(train_obs)
local_obs.h2o  as.h2o(local_obs)

We will explore how to visualize a few of the more popular machine learning algorithms and packages in R. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces:

  • Random forest model using ranger via the caret package
  • Random forest model using h2o
  • Elastic net model using h2o
  • GBM model using h2o
  • Random forest model using ranger directly
# Create Random Forest model with ranger via caret
fit.caret  train(
  Attrition ~ ., 
  data       = train_obs, 
  method     = 'ranger',
  trControl  = trainControl(method = "cv", number = 5, classProbs = TRUE),
  tuneLength = 1,
  importance = 'impurity'
  )

# create h2o models
h2o_rf   h2o.randomForest(x, y, training_frame = train_obs.h2o)
h2o_glm  h2o.glm(x, y, training_frame = train_obs.h2o, family = "binomial")
h2o_gbm  h2o.gbm(x, y, training_frame = train_obs.h2o)

# ranger model --> model type not built in to LIME
fit.ranger  ranger::ranger(
  Attrition ~ ., 
  data        = train_obs, 
  importance  = 'impurity',
  probability = TRUE
)

Global Interpretation

The most common ways of obtaining global interpretation is through:

  • variable importance measures
  • partial dependence plots

Variable importance quantifies the global contribution of each input variable to the predictions of a machine learning model. Variable importance measures rarely give insight into the average direction that a variable affects a response function. They simply state the magnitude of a variable’s relationship with the response as compared to other variables used in the model. For example, the ranger random forest model identified monthly income, overtime, and age as the top 3 variables impacting the objective function.

vip(fit.ranger) + ggtitle("ranger: RF")

plot of chunk unnamed-chunk-4

After the most globally relevant variables have been identified, the next step is to attempt to understand how the response variable changes based on these variables. For this we can use partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. These techniques plot the change in the predicted value as specified feature(s) vary over their marginal distribution. Consequently, we can gain some local understanding how the reponse variable changes across the distribution of a particular variable but this still only provides a global understanding of this relationships across all observed data.

For example, if we plot the PDP of the monthly income variable we see that the probability of an employee attriting decreases, on average, as their monthly income approaches $5,000 and then remains relatively flat.

# built-in PDP support in H2O
h2o.partialPlot(h2o_rf, data = train_obs.h2o, cols = "MonthlyIncome")

plot of chunk unnamed-chunk-5

## PartialDependence: Partial Dependence Plot of model DRF_model_R_1529928665020_106 on column 'MonthlyIncome'
##    MonthlyIncome mean_response stddev_response
## 1    1009.000000      0.229433        0.229043
## 2    2008.473684      0.216096        0.234960
## 3    3007.947368      0.167686        0.234022
## 4    4007.421053      0.161024        0.231287
## 5    5006.894737      0.157775        0.231151
## 6    6006.368421      0.156628        0.231455
## 7    7005.842105      0.157734        0.230267
## 8    8005.315789      0.160137        0.229286
## 9    9004.789474      0.164437        0.229305
## 10  10004.263158      0.169652        0.227902
## 11  11003.736842      0.165502        0.226240
## 12  12003.210526      0.165297        0.225529
## 13  13002.684211      0.165051        0.225335
## 14  14002.157895      0.165051        0.225335
## 15  15001.631579      0.164983        0.225316
## 16  16001.105263      0.165051        0.225019
## 17  17000.578947      0.165556        0.224890
## 18  18000.052632      0.165556        0.224890
## 19  18999.526316      0.166498        0.224726
## 20  19999.000000      0.182348        0.219882

We can gain further insight by using centered ICE curves which can help draw out further details. For example, the following ICE curves show a similar trend line as the PDP above but by centering we identify the decrease as monthly income approaches $5,000 followed by an increase in probability of attriting once an employee’s monthly income approaches $20,000. Futhermore, we see some turbulence in the flatlined region between $5-$20K) which means there appears to be certain salary regions where the probability of attriting changes.

fit.ranger %>%
  pdp::partial(pred.var = "MonthlyIncome", grid.resolution = 25, ice = TRUE) %>%
  autoplot(rug = TRUE, train = train_obs, alpha = 0.1, center = TRUE)

plot of chunk unnamed-chunk-6

These visualizations help us to understand our model from a global perspective: identifying the variables with the largest overall impact and the typical influence of a feature on the response variable across all observations. However, what these do not help us understand is given a new observation, what were the most influential variables that determined the predicted outcome. Say we obtain information on an employee that makes about $10,000 per month and we need to assess their probabilty of leaving the firm. Although monthly income is the most important variable in our model, it may not be the most influential variable driving this employee to leave. To retain the employee, leadership needs to understand what variables are most influential for that specific employee. This is where lime can help.

Local Interpretation

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.

The generalized algorithm LIME applies is:

  1. Given an observation, permute it to create replicated feature data with slight value modifications.
  2. Compute similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select m number of features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
  6. Use the resulting feature weights to explain local behavior.

Each of these steps will be discussed in further detail as we proceed.

lime::lime

The application of the LIME algorithm via the lime package is split into two operations: lime::lime() and lime::explain(). The lime::lime() function creates an “explainer” object, which is just a list that contains the machine learning model and the feature distributions for the training data. The feature distributions that it contains includes distribution statistics for each categorical variable level and each continuous variable split into n bins (default is 4 bins). These feature attributes will be used to permute data.

The following creates our lime::lime() object and I change the number to bin our continuous variables into to 5.

explainer_caret  lime::lime(train_obs, fit.caret, n_bins = 5)

class(explainer_caret)
## [1] "data_frame_explainer" "explainer"           
## [3] "list"
summary(explainer_caret)
##                      Length Class  Mode     
## model                24     train  list     
## bin_continuous        1     -none- logical  
## n_bins                1     -none- numeric  
## quantile_bins         1     -none- logical  
## use_density           1     -none- logical  
## feature_type         31     -none- character
## bin_cuts             31     -none- list     
## feature_distribution 31     -none- list

lime::explain

Once we created our lime objects, we can now perform the generalized LIME algorithm using the lime::explain() function. This function has several options, each providing flexibility in how we perform the generalized algorithm mentioned above.

  • x: Contains the one or more single observations you want to create local explanations for. In our case, this includes the 5 observations that I included in the local_obs data frame. Relates to algorithm step 1.
  • explainer: takes the explainer object created by lime::lime(), which will be used to create permuted data. Permutations are sampled from the variable distributions created by the lime::lime() explainer object. Relates to algorithm step 1.
  • n_permutations: The number of permutations to create for each observation in x (default is 5,000 for tabular data). Relates to algorithm step 1.
  • dist_fun: The distance function to use. The default is Gower’s distance but can also use euclidean, manhattan, or any other distance function allowed by ?dist(). To compute similarity distance of permuted observations, categorical features will be recoded based on whether or not they are equal to the actual observation. If continuous features are binned (the default) these features will be recoded based on whether they are in the same bin as the observation. Using the recoded data the distance to the original observation is then calculated based on a user-chosen distance measure. Relates to algorithm step 2.
  • kernel_width: To convert the distance measure to a similarity value, an exponential kernel of a user defined width (defaults to 0.75 times the square root of the number of features) is used. Smaller values restrict the size of the local region. Relates to algorithm step 2.
  • n_features: The number of features to best describe predicted outcomes. Relates to algorithm step 4.
  • feature_select: To select the best n features, lime can use forward selection, ridge regression, lasso, or a tree to select the features. In this example I apply a ridge regression model and select the m features with highest absolute weights. Relates to algorithm step 4.

For classification models we also have two additional features we care about and one of these two arguments must be given:

  • labels: Which label do we want to explain? In this example, I want to explain the probability of an observation to attrit (“Yes”).
  • n_labels: The number of labels to explain. With this data I could select n_labels = 2 to explain the probability of “Yes” and “No” responses.
explanation_caret  lime::explain(
  x               = local_obs, 
  explainer       = explainer_caret, 
  n_permutations  = 5000,
  dist_fun        = "gower",
  kernel_width    = .75,
  n_features      = 5, 
  feature_select  = "highest_weights",
  labels          = "Yes"
  )

The explain() function above first creates permutations, then calculates similarities, followed by selecting the m features. Lastly, explain() will then fit a model (algorithm steps 5 & 6). lime applies a ridge regression model with the weighted permuted observations as the simple model. [2] If the model is a regressor, the simple model will predict the output of the complex model directly. If the complex model is a classifier, the simple model will predict the probability of the chosen class(es).

The explain() output is a data frame containing different information on the simple model predictions. Most importantly, for each observation in local_obs it contains the simple model fit (model_r2) and the weighted importance (feature_weight) for each important feature (feature_desc) that best describes the local relationship.

tibble::glimpse(explanation_caret)
## Observations: 25
## Variables: 13
## $ model_type        "classification", "classification", "cla...
## $ case              "1", "1", "1", "1", "1", "2", "2", "2", ...
## $ label             "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
## $ label_prob        0.216, 0.216, 0.216, 0.216, 0.216, 0.070...
## $ model_r2          0.5517840, 0.5517840, 0.5517840, 0.55178...
## $ model_intercept   0.1498396, 0.1498396, 0.1498396, 0.14983...
## $ model_prediction  0.3233514, 0.3233514, 0.3233514, 0.32335...
## $ feature           "OverTime", "MaritalStatus", "BusinessTr...
## $ feature_value     Yes, Single, Travel_Rarely, Sales, Very_...
## $ feature_weight    0.14996805, 0.05656074, -0.03929651, 0.0...
## $ feature_desc      "OverTime = Yes", "MaritalStatus = Singl...
## $ data              [[41, Yes, Travel_Rarely, 1102, Sales, ...
## $ prediction        [[0.216, 0.784], [0.216, 0.784], [0.216...

Visualizing results

However the simplest approach to interpret the results is to visualize them. There are several plotting functions provided by lime but for tabular data we are only concerned with two. The most important of which is plot_features(). This will create a visualization containing an individual plot for each observation (case 1, 2, …, n) in our local_obs data frame. Since we specified labels = "Yes" in the explain() function, it will provide the probability of each observation attriting. And since we specified n_features = 10 it will plot the 10 most influential variables that best explain the linear model in that observations local region and whether the variable is causes an increase in the probability (supports) or a decrease in the probability (contradicts). It also provides us with the model fit for each model (“Explanation Fit: XX”), which allows us to see how well that model explains the local region.

Consequently, we can infer that case 3 has the highest liklihood of attriting out of the 5 observations and the 3 variables that appear to be influencing this high probability include working overtime, being single, and working as a lab tech.

plot_features(explanation_caret)

plot of chunk unnamed-chunk-10

The other plot we can create is a heatmap showing how the different variables selected across all the observations influence each case. We use the plot_explanations() function. This plot becomes useful if you are trying to find common features that influence all observations or if you are performing this analysis across many observations which makes plot_features() difficult to discern.

plot_explanations(explanation_caret)

plot of chunk unnamed-chunk-11

Tuning LIME

As you saw in the above plot_features() plot, the output provides the model fit. In this case the best simple model fit for the given local regions was R^2 = 0.59 for case 3. Considering there are several knobs we can turn when performing the LIME algorithm, we can treat these as tuning parameters to try find the best fit model for the local region. This helps to maximize the amount of trust we can have in the local region explanation.

As an example, the following changes the distance function to use the manhattan distance algorithm, we increase the kernel width substantially to create a larger local region, and we change our feature selection approach to a LARS lasso model. The result is a fairly substantial increase in our explanation fits.

# tune LIME algorithm
explanation_caret  lime::explain(
  x               = local_obs, 
  explainer       = explainer_caret, 
  n_permutations  = 5000,
  dist_fun        = "manhattan",
  kernel_width    = 3,
  n_features      = 5, 
  feature_select  = "lasso_path",
  labels          = "Yes"
  )

plot_features(explanation_caret)

plot of chunk unnamed-chunk-13

Supported vs Non-support models

Currently, lime supports supervised models produced in caret, mlr, xgboost, h2o, keras, and MASS::lda. Consequently, any supervised models created with these packages will function just fine with lime.

explainer_h2o_rf   lime(train_obs, h2o_rf, n_bins = 5)
explainer_h2o_glm  lime(train_obs, h2o_glm, n_bins = 5)
explainer_h2o_gbm  lime(train_obs, h2o_gbm, n_bins = 5)

explanation_rf   lime::explain(local_obs, 
                                 explainer_h2o_rf, 
                                 n_features      = 5, 
                                 labels          = "Yes", 
                                 kernel_width    = .1, 
                                 feature_select  = "highest_weights")
explanation_glm  lime::explain(local_obs, 
                                 explainer_h2o_glm, 
                                 n_features      = 5, 
                                 labels          = "Yes", 
                                 kernel_width    = .1, 
                                 feature_select  = "highest_weights")
explanation_gbm  lime::explain(local_obs, 
                                 explainer_h2o_gbm, 
                                 n_features      = 5, 
                                 labels          = "Yes", 
                                 kernel_width    = .1, 
                                 feature_select  = "highest_weights")

p1  plot_features(explanation_rf,  ncol = 1) + ggtitle("rf")
p2  plot_features(explanation_glm, ncol = 1) + ggtitle("glm")
p3  plot_features(explanation_gbm, ncol = 1) + ggtitle("gbm")

gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

plot of chunk unnamed-chunk-14

However, any models that do not have built in support will produce an error. For example, the model we created directly with ranger is not supported and produces an error.

explainer_ranger  lime(train, fit.ranger, n_bins = 5)
## Error in UseMethod("lime", x): no applicable method for 'lime' applied to an object of class "function"

We can work with this pretty easily by building two functions that make lime compatible with an unsupported package. First, we need to create a model_type() function that specifies what type of model this unsupported package is using. model_type() is a lime specific function, we just need to create a ranger specific method. We do this by taking the class name for our ranger object and creating the model_type.ranger method and simply return the type of model (“classification” for this example).

# get the model class
class(fit.ranger)
## [1] "ranger"
# need to create custom model_type function
model_type.ranger  function(x, ...) {
  # Function tells lime() what model type we are dealing with
  # 'classification', 'regression', 'survival', 'clustering', 'multilabel', etc
  
  return("classification")
}

model_type(fit.ranger)
## [1] "classification"

We then need to create a predict_model() method for ranger as well. The output for this function should be a data frame. For a regression problem it should produce a single column data frame with the predicted response and for a classification problem it should create a column containing the probabilities for each categorical class (binary “Yes” “No” in this example).

# need to create custom predict_model function
predict_model.ranger  function(x, newdata, ...) {
  # Function performs prediction and returns data frame with Response
  pred  predict(x, newdata)
  return(as.data.frame(pred$predictions))
}

predict_model(fit.ranger, newdata = local_obs)
##         Yes        No
## 1 0.2915452 0.7084548
## 2 0.0701619 0.9298381
## 3 0.4406563 0.5593437
## 4 0.3465294 0.6534706
## 5 0.2525397 0.7474603

Now that we have those methods developed and in our global environment we can run our lime functions and produce our outputs.3

explainer_ranger    lime(train_obs, fit.ranger, n_bins = 5)

explanation_ranger  lime::explain(local_obs, 
                                    explainer_ranger, 
                                    n_features   = 5, 
                                    n_labels     = 2, 
                                    kernel_width = .1)

plot_features(explanation_ranger, ncol = 2) + ggtitle("ranger")

plot of chunk unnamed-chunk-18

Learning More

At Business Science, we’ve been using the lime package with clients to help explain our machine learning models – It’s been our secret weapon. Our primary use cases are with h2o and keras, both of which are supported in lime. In fact, we actually built the h2o integration to gain the beneifts of LIME with stacked ensembles, deep learning, and other black-box algorithms. We’ve used it with clients to help them detect which employees should be considered for executive promotion. We’ve even provided previous real-world business problem / machine learning tutorials:

In fact, those that want to learn lime while solving a real world data science problem can get started today with our new course: Data Science For Business (DS4B 201)

Additional Resources

LIME provides a great, model-agnostic approach to assessing local interpretation of predictions. To learn more I would start with the following resources:

Announcements

Matt was recently on Episode 165 of the SuperDataScience Podcast. In his second appearance (also was on Episode 109 where he spoke about the transition to data science), he talks about giving back to the data science community if the form of education, open source software, and blogging!

SuperDataScience Episode 165

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), Business Science University is for you! Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

  • TO SOLVE A REAL WORLD CHURN PROBLEM: Employee Turnover!
  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement (ROI-Driven Data Science)

Special Autographed “Deep Learning With R” Giveaway!!!

One lucky student that enrolls in June will receive an autographed copy of Deep Learning With R, signed by JJ Allaire, Founder of Rstudio and DLwR co-author.

Deep Learning With R

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get 15% OFF in June!

HR 301 Shiny Application: Employee Prediction

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get 15% OFF in June!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

Footnotes

  1. To this end, you are encouraged to read through the

  2. If you’ve never applied a weighted ridge regression model you can see some details on its application in the glmnet vignette ↩

  3. If you are curious as to why simply creating the model_type.ranger and predict_model.ranger methods and hosting them in your global environment causes the lime functions to work then I suggest you read chapter 6 of Advanced R. ↩

To leave a comment for the author, please follow the link and comment on their blog: business-science.io – Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News