Shiny: data presentation with an extra

By Pablo Bernabeu

(This article was first published on, and kindly contributed to R-bloggers)

A Shiny app with three tabs presenting different sections of the same data.


is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Shiny apps are based on the Internet: This allows for private consultation of the data on one’s own browser as well as for online publication. Free apps can handle quite a lot of data, which can be increased with a subscription.

The target user of Shiny is extremely broad. Let’s take science—open science. At a time when openly archiving all data is becoming standard practice (e.g.,,,, Shiny can be used to walk the extra mile by letting people tour the data at once without programming. It’s the right tool for acknowledging all aspects of the data. Needless to say, these apps do not replace raw data archiving.

The apps simply add. For instance, the data in the lower app is a little noisy, right? Indeed it shows none of the succession of waves that characterizes word reading. The app helped me in identifying this issue. Instead of running along a host of participants and electrodes through a heavy code score, I found that the drop-down lists of the app let me seamlessly surf the data. By Participant 7, though, my wave dropped me…

Those data were very poor—systematically poorer than those of any other participants. I then checked the EEG preprocessing logs, and confirmed that those data had to be discarded. So much for the analysis utility of such an app. On the open science utility, what I did on discovering the fault was maintain the discarded data in the app, with a note, so that any colleagues and reviewers could consider it too. Now, although this example of use concerns a rather salient trait in the data, some other Shiny app might help to spot patterns such as individual differences, third variables.

Building a Shiny app is not difficult. Apps basically draw on some data presentation code (tables, plots) that you already have. Then just add a couple of scripts into the folder: one for the user interface (named iu.R), one for the process (named server.R), and perhaps another one compiling the commands for deploying the app and checking any errors.

The steps to start a Shiny app from scratch are:

1: Tutorials. Being open-source software, the best manuals are available through a Google search.

2: User token. Signing up and reading in your private key—just once.

3: GO. Plunge into the ui and server scripts, and deployApp().

4: Bugs and logs. They are not bugs in fact—rather fancies. For instance, some special characters have to get even more special (technically, UTF-8 encoding). For a character such as μ, Shiny prefers Âμ. Just cling to error logs by calling:

showLogs(appPath = getwd(), appFile = NULL, appName = NULL, account = NULL, entries = 50, streaming = FALSE) 

The log output will mention any typos and unaccepted characters, pointing to specific lines in your code.

It may take a couple of intense days to get a first app running. As usual with programming, it’s easy to run into the traps which are there to spice up the way. The app’s been around for years, so tips and tricks abound on the Internet. For greater companionship, there are dedicated Google groups, and then there’s StackExchange etc., where you can post any needs/despair. Post your code, log, and explanation, and you’ll be rescued out in a couple of days. Long live those contributors.

It will often be enough to upload a bare app, but you might then think it can look better.

5 (optional): Pro up.
Use tabs to combine multiple apps in one, use different widgets, etc. Tutorials like this one on Youtube can take you there, especially those that provide the code, as in the description of that video. Use those scripts as templates. For example, see this script in which the function conditionalPanel() is used to modify the app’s sidebar based on which tab is selected. The utility of tabs is illustrated in the upper cover of this article and in the app shown in the text: When having multiple data sections, the tabs allow you to have all in one (cover screenshot), instead of linking to other apps in each (screenshot in text).

Time for logistics. You can include any text in your app’s webpage, such as explanations of any length, web links, and author information. Oh, also importantly: the Shiny application allows for the presentation of data in any of the forms available in R—notably, plots and tables. Costs: Shiny is free to use, in principle. You may only have to pay if you use your app(s) a lot—first month, commonly—, in which case you may pay 9 euros a month. There are different subscription plans. The free plan allows 25 hours of use per month, distributed among up to five apps.

There do exist alternatives to Shiny. One is fairly similar: It’s called Tableau. A nice comparison of the two apps is available here. Then, there’s a more advanced application called D3.js, which allows for lush graphics but proves more restrictive to newbies.

In sum, if you already program in R or even in Python, but have not tried online data presentation yet, consider it.

Feel free to share your ideas, experiences, or doubts in a comment on the original post.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Watch presentations from R/Finance 2017

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It was another great year for the R/Finance conference, held earlier this month in Chicago. This is normally a fairly private affair: with attendance capped at around 300 people every year, it’s a somewhat exclusive gathering of the best and brightest minds from industry and academia in financial data analysis with R. But for the first time this year (and with thanks to sponsorship from Microsoft), videos of the presentations are available for viewing by everyone. I’ve included the complete list (copied from the R/Finance website) below, but here are a few of my favourites:

You can find an up-to-date version of the table below at the R/Finance website (click on the “Program” tab), and you can also browse the videos at Channel 9. Note that the lightning talk sessions (in orange) are bundled together in a single video, which you can find linked after the first talk in each session.

Friday, May 19th, 2017
09:30 – 09:35 Kickoff (video)
09:35 – 09:40 Sponsor Introduction
09:40 – 10:10 Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa (pdf) (video)
Jeffrey Mazar: The obmodeling Package (html)
Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market (pdf)
Stephen Rush: Adverse Selection and Broker Execution (pdf)
Jerzy Pawlowski: How Can Machines Learn to Trade? (html)
10:10 – 10:30 Michael Hirsch: Revealing High-Frequency Trading Provisions of Liquidity with Visualization in R (html) (video)
10:30 – 10:50 Eric Glass: Equity Factor Portfolio Case Study (html) (video)
10:50 – 11:10 Break
11:10 – 11:30 Seoyoung Kim: Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News (pdf) (video)
11:30 – 12:10 Szilard Pafka: No-Bullshit Data Science (pdf) (video)
12:10 – 13:30 Lunch
13:30 – 14:00 Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models (pdf) (video)
Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation (pdf)
Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods (pdf)
Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling (pdf)
Andy Tang: How much structure is best? (pptx)
14:00 – 14:20 Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis (pdf)
14:20 – 14:50 Break
14:50 – 15:10 Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage (pdf) (video)
15:10 – 15:30 Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization (pdf) (video)
15:30 – 16:00 Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe (pdf) (video)
Shuang Zhou: A Nonparametric Estimate of the Risk-Neutral Density and Its Applications (pdf)
Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility
Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy (pdf)
Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns (pdf)
16:00 – 16:20 Qiang Kou: Text analysis using Apache MxNet (pdf) (video)
16:20 – 16:40 Robert Krzyzanowski: Syberia: A development framework for R (pdf) (video)
16:40 – 16:52 Matt Dancho: New Tools for Performing Financial Analysis Within the ‘Tidy’ Ecosystem (pptx) (video)
Leonardo Silvestri: ztsdb, a time-series DBMS for R users (pdf)
Saturday, May 20th, 2017
09:05 – 09:35 Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework (pdf) (video)
Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package (pdf)
Guanhao Feng: Regularizing Bayesian Predictive Regressions (pdf)
Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series (pdf)
Carson Sievert: Interactive visualization for multiple time series (pdf)
09:35 – 09:55 Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package (pptx) (video)
09:55 – 10:15 Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model (pdf) (video)
10:15 – 10:45 Break
10:45 – 11:05 Jason Foster: Scenario Analysis of Risk Parity using RcppParallel (pdf) (video)
11:05 – 11:35 Michael Weylandt: Convex Optimization for High-Dimensional Portfolio Construction (pdf) (video)
Lukas Elmiger: Risk Parity Under Parameter Uncertainty (pdf)
Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum (pptx)
Vyacheslav Arbuzov: Dividend strategy: towards the efficient market (pdf)
Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing (pdf)
11:35 – 12:15 Dave DeMers: Risk Fast and Slow (pdf) (video)
12:15 – 13:35 Lunch
13:35 – 13:55 Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models (pdf) (video)
13:55 – 14:15 Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map (html) (video)
14:15 – 14:35 David Ardia: Markov-Switching GARCH Models in R: The MSGARCH Package (pdf) (video)
14:35 – 14:55 Keven Bluteau: Forecasting Performance of Markov-Switching GARCH Models: A Large-Scale Empirical Study (pdf) (video)
14:55 – 15:07 Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation (pdf) (video)
Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending (pdf)
15:07 – 15:27 David Smith: Detecting Fraud at 1 Million Transactions per Second (pptx) (video)
15:27 – 15:50 Break
15:50 – 16:10 Thomas Harte: The PE package: Modeling private equity in the 21st century (pdf) (video)
16:10 – 16:30 Guanhao Feng: The Market for English Premier League (EPL) Odds (pdf) (video)
16:30 – 16:50 Bryan Lewis: Project and conquer (html) (video)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

VIF and multicollinearity diagnostics

By Thom

(This article was first published on R code – Serious Stats, and kindly contributed to R-bloggers)

In the book I use the car package to get VIF and other multicollinearity diagnostics. I’ve occasionally found this breaks down (usually through mixing different versions of R on different machines at work home or on the move). I recently saw the mctest package and thought it would be useful to use that as a backup – and also because it offers a slightly different set of diagnostics.

If you’d like to try it out I’ve written a helper function that makes it easier apply directly to a linear model. You can find the function and a simple example here.

Filed under: R code, serious stats Tagged: multicollinearity, R, statistics, vif

To leave a comment for the author, please follow the link and comment on their blog: R code – Serious Stats. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Bland-Altman/Tukey Mean-Difference Plots using ggplot2

By S. Richter-Walsh

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

A very useful data visualisation tool in science, particularly in medical and sports settings, is the Bland-Altman/Tukey Mean-Difference plot. When comparing two sets of measurements for the same variable made by different instruments, it is often required to determine whether the instruments are in agreement or not.

Correlation and linear regression can tell us something about the bivariate relationship which exists between two sets of measurements. We can identify the strength, form and direction of a relationship but this approach is not recommended for comparative analyses.

The Bland-Altman plot’s first use was in 1983 by J.M Bland and D.G Altman who applied it to medical statistics. The Tukey Mean-Difference Plot was one of many exploratory data visualisation tools created by John Tukey who, interestingly, also created the beloved boxplot.

A simple implementation of the Bland-Altman/Tukey Mean-Difference plot in R is described below. The dataset used for this example is one of R’s built-in datasets, Loblolly, which contains data on the growth of Loblolly pine trees.

Prepare the workspace and have a look at the structure of the data.


Randomly split the data into two samples of equal length. These samples will be used as our two sets of pine tree height measurements. Arrange in order of descending height to mimic realistic dual measurements of single trees and combine into a dataframe. Note that these data are used simply to demonstrate and the measures in reality are NOT of the same tree using different instruments.

sample_df %
 select(height) %>% 
 sample_n(pine_df, size = nrow(pine_df) * 0.5) %>%
 select(height) %>%

Assign sensible names to the two pine tree height samples in the dataframe.


Each row in the dataframe consists of a pair of measurements. The Bland-Altman plot has the average of the two measures in a pair on the x-axis. The y-axis contains the difference between the two measures in each pair. Add the averages and differences data to the dataframe.


Finally, code the plot and add the mean difference (blue line) and a 95% confidence interval (red lines) for predictions of a mean difference. This prediction interval gives the level of agreement (1.96 * SD).

ggplot(sample_df, aes(x = Avg, y = Dif)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = mean(sample_df$Dif), colour = "blue", size = 0.5) +
  geom_hline(yintercept = mean(sample_df$Dif) - (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +
  geom_hline(yintercept = mean(sample_df$Dif) + (1.96 * sd(sample_df$Dif)), colour = "red", size = 0.5) +
  ylab("Diff. Between Measures") +
  xlab("Average Measure")

To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Evaluate your model with R Exercises

By Guillaume Touzin

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient.

Now that even an average computer give us incredible computing power, it’s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models.

In this set of exercises, we’ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation.
Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily
use on everyone of my models, but there’s many ways to code those equations. If your code is different from the one in the solution, feel free to post your code in the comments.

Answers to the exercises are available here.

Exercise 1
We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R2 which show what percentage of variance is explained by the model and the adjusted R2 which penalize model who use variables that doesn’t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R2 and the adjusted R2.

Exercise 2
Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page.
Calculate the RMSE of the prediction made by your three models.

Exercise 3
The mean absolute error (MAE) is a good alternative to the RMSE if you don’t want to penalise too much the large estimation error of your model. The MAE is given by the equation
The mathematical definition of the MAE can be found here.
Calculate the MAE of the prediction made by the 3 models.

Exercise 4
Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would
put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error
metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page.
Calculate the RMSLE of the prediction made by the three models.

Exercise 5
Now that we’ve seen some examples of error metrics which can be used in a regression context, let’s see five examples of error metrics which can be used when you perform clustering analysis. But
first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct
labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the “Lloyd” algorithm. Once it’s done, take time to rearrange the labels of your
prediction so they are compatible with the factors in the iris dataset.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 6
Print the confusion matrix of your model.

Exercise 7
The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models.

Exercise 8
The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values.
Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5.

Exercise 9
The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you’ll get a purity score of one and if there’s no majority class in any of the cluster, you’ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions.

Exercise 10
The last error metrics we’ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New online datacamp course: Forecasting in R

By Tal Galili

(This article was first published on, and kindly contributed to R-bloggers)

Forecasting in R is taught by Rob J. Hyndman, author of the forecast package

Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning. This course provides an introduction to time series forecasting using R.

What You’ll Learn
Chapter 1: Exploring and Visualizing Time Series in R
The first thing to do in any data analysis task is to plot the data.
Chapter 2: Benchmark Methods and Forecast Accuracy
In this chapter, you will learn general tools that are useful for many different forecasting situations.
Chapter 3: Exponential Smoothing
is framework generates reliable forecasts quickly and for a wide range of time series.
Chapter 4: Forecasting with ARIMA Models
ARIMA models provide another approach to time series forecasting.
Chapter 5: Advanced Methods
In this chapter, you will look at some methods that handle more complicated seasonality.

You can start the free chapter for free of Forecasting in R.

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Drilling Into CSVs — Teaser Trailer

By hrbrmstr

(This article was first published on R –, and kindly contributed to R-bloggers)

I used reading a directory of CSVs as the foundational example in my recent post on idioms.

During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill.

I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.

You can get ahead of those posts by doing two things:

  1. Download, install and test your Apache Drill setup (it’s literally 10 minutes on any platform)
  2. Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example)

My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I’m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it’s worth your time to do Step #1 above.

Getting EPA Air Quality Data

The EPA has air quality data going back to 1990 (so, 27 files as of this post). They’re ~1-4MB ZIP compressed and ~10-30MB uncompressed.

You can use the following code to grab them all with the caveat that the libcurl method of performing simultaneous downloads caused some pretty severe issues — like R crashing — for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.



I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I’ve got a super-fast internet connection (no long-lived sockets) so it should be fine.

Putting all those files under the “control” of Drill is what the next post is for. For now, i’m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it’s a nice, compact task that’s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.

Data Setup

I’ve converted the EPA annual ZIP files into bzip2 format. ZIP is fine for storage and downloads but it’s not a great format for data analysis tasks. gzip would be slightly faster but it’s not easily splittable and — even though I’m not using the data in a Hadoop context — I think it’s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there’s no need to waste space.

All the following files are in a regular filesystem directory accessible to both Drill and R:

> (epa_annual_fils 

Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet & avro files (both are used frequently in distributed “big data” setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I’ve already mentioned it works fine in plain ‘ol filesystems, too).

I’ve tweaked my Drill configuration to support reading column header info from .csv files (which I’ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. COLUMN[0]) so having named columns makes for less verbose query code.

I turned those individual bzip2 files into parquet format with one Drill query:

CREATE TABLE dfs.pq.`/epa/annual.parquet` AS 
  SELECT * FROM dfs.csv.`/epa/annual/*.csv.bz2`

Future posts will explain the dfs... component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and/or target files. That one statement said ‘take all the CSV files in that directory and make one big table out of them”.

The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We’ll see how that speeds up things in a bit.

Benchmark Setup

The tests were performed on a maxed out 2016 13″ MacBook Pro.

There are 55 columns of data in the EPA annual summary files.

To give both read_csv and fread some benchmark boosts, we’ll define the columns up-front and pass those in to each function on data ingestion and I’ll leave them out of this post for brevity (they’re just a cols() specification and colClasses vector). Drill gets no similar help for this at least when it comes to CSV processing.

I’m also disabling progress & verbose reporting in both fread and read_csv despite not stopping Drill from writing out log messages.

Now, we need some setup code to connect to drill and read in the list of files, plus we’ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.


(epa_annual_fils % 
    distinct(Year) %>% 

mb_drill_parquet % 
    distinct(Year) %>% 

mb_tidyverse  tmp

mb_datatable  tmp

mb_rda  tmp

  csv = { mb_drill_csv()     },
   pq = { mb_drill_parquet() },
   df = { mb_tidyverse()     },
   dt = { mb_datatable()     },
  rda = { mb_rda()           },
  times = 5
) -> mb

Yep, it’s really as simple as:

tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")

to have Drill treat a directory tree as a single table. It’s also not necessary for all the columns to be in all the files (i.e. you get the bind_rows/map_df/rbindlist behaviour for “free”).

I’m only doing 5 evaluations here since I don’t want to make you wait if you’re going to try this at home now or after the Drill series. I’ve run it with a more robust benchmark configuration and the results are aligned with this one.

Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
  csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450     5
   pq   493.7779   513.3704   616.2634   550.5374   732.6553   790.9759     5
   df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498     5
   dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247     5
  rda  9466.6506  9551.7312 10012.8560  9562.9114  9881.8351 11601.1517     5

The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they’re not as flexible as having individual CSVs (that may get added hourly or daily to a directory).

Drill’s CSV slurping handily beats the other R methods even with some handicaps the others did not have.

This particular example is gamed a bit, which helped parquet to ultimately “win”. Since Drill can target the singular column (Year) that was asked for, it doesn’t need to read all the extra columns just to compute the final product (the distinct list of years).

IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it’s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks.


I’m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I’m hoping one of those tribes is the R/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats/ML functions directly available).

I’m hopeful that this post has showcased enough of Drill’s utility to general R users that you’ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife — which is what Drill really is — handy.

You can find the sergeant package on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

GSoC 2017 : Parser for Biodiversity Checklists

By vijaybarve

Guest post by Qingyue Xu

Compiling taxonomic checklists from varied sources of data is a common task that biodiversity informaticians encounter. In the GSoC 2017 project Parser for Biodiversity checklists, my overall goal is to extract taxonomic names from given text into a tabular format so that easy aggregation of biodiversity data in a structured format that can be used for further processing can be facilitated.

I mainly plan to build three major functions which serve different purposes and take various sources of text into account.

However, before building functions, we need to first identify and cover as many different formats of scientific names as possible. The inconsistencies of scientific names make things complicated for us. The common rules for scientific names follow the order of:

genus, [species], [subspecies], [author, year], [location]

Many components are optional, and some components like author and location can be one or even more. Therefore, when we’re parsing the text, we need to analyze the structure of the text and match it with all possible patterns of scientific names and identify the most likely one. To resolve the problem more accurately, we can even draw help from NLTK (Natural Language Toolkit) packages to help us identify “PERSON” and “LOCATION” so that we can analyze the components of scientific names more efficiently.

Function: find_taxoname (url/input_file, output_file)

  • Objective: This is a function to search scientific names with supplied texts, especially applied to the situation when the text is not well structured.
  • Parameters: The first parameter is the URL of a web page (HTML based) or the file path of a PDF/TXT file, which is our source text to search for the biology taxonomic names. The second parameter is the file path of the output file and it will be in a tabular format including columns of the genus, species, subspecies, author, year.
  • Approach: Since this function is intended for the unstructured text, we can’t find a certain pattern to parse the taxonomic names. What we can do is utilizing existing standard dictionaries of scientific names to locate the genus, species, subspecies. By analyzing the surrounding structure and patterns, we can find corresponding genus, species, subspecies, author, year, if they exist, and output the findings in a tabular format.

Function: parse_taxolist(input_file, filetype, sep, output_file, location)

  • Objective: This is a function to parse and extract taxonomic names from a given structured text file and each row must contain exactly one entry of the scientific names. If the location information is given, the function can also list and return the exact location (latitude and longitude) of the species. The output is also in a tabular format including columns of genus, species, subspecies, author(s), year(s), location(s), latitude(s), longitude(s).
  • Parameters: The first parameter is the file path of the input file and the second parameter is the file type, which supports txt, PDF, CSV types of files. The third parameter ‘sep’ should indicate the separators used in the input file to separate every word in the same row. The fourth parameter is the intended file path of the output file. The last parameter is a Boolean, indicating whether the input file contains the location information. If ‘true’, then the output will contain detailed location information.
  • Approach: The function will parse the input file based on rows and the given separators into a well-organized tabular format. An extended function is to point the exact location of the species if the related information is given. With the location info such as “Mirik, West Bengal, India”, the function will return the exact latitude and longitude of this location as 26°53’7.07″N and 88°10’58.06″E. It can be realized through crawling the web page of or utilizing the API of Google Map. This is also a possible solution to help us identify whether the content of the text represents a location. If it cannot get exact latitude and longitude, then it’s not a location. If a scientific name doesn’t contain location information, the function will return NULL value for the location part. If it contains multiple locations, the function will return multiple values as a list as well as the latitudes and longitudes.

Function: recursive_crawler(url, htmlnode_taxo, htmlnode_next, num, output_file, location)

  • Objective: This function is intended to crawl the web pages containing information about taxonomic names recursively. The start URL must be given and the html_node of the scientific names should also be indicated. Also, if the text contains location info, the output will also include the detailed latitude and longitude.
  • Parameters: The first parameter is the start URL of the web page and the following web pages must follow the same structure as the first web page. The second parameter is the html_node of the taxonomic names, such as “.SP .SN > li”. (There’re a lot of tools for the users to identify the HTML nodes code for certain contexts). The third parameter is the html_node of the next page, which can lead us to the next page of another genus. The fourth parameter ‘num’ is the intended number of web pages the user indicates. If ‘num’ is not given, the function will automatically crawl and stop until the htmlnode_next cannot return a valid URL. The next two parameters are the same with the above two functions.
  • Approach: For the parsing part and getting the location parts, the approach is the same as the above functions. For the crawling part, for a series of structured web pages, we can parse and get valid scientific names based on the given HTML nodes. The HTML nodes for the next pages should also be given, and we can always get the URL of the next page by extracting it from the source code. For example, the following screenshot from the web page we used provides a link which leads us to the next page. By recursively fetching the info from the current page and jump to the following pages, we can output a well-organized tabular file including all the following web pages.
  • Other possible functionalities to be realized

Since inconsistencies might exist in the format of scientific names, I also need to construct a function to normalize the names. The complication always lies in the author part, and there can be two approaches to address the problem. The first one is still analyzing the structure of the scientific name and we can try to capture as many exceptions as possible, such as author names which have multiple parts or there’re two authors. The second approach is to draw help from the NLTK package to identify possible PERSON names. However, when it gets too complicated, the parsing result won’t be very accurate all the time. Therefore, we can add a parameter to suggest how reliable our result is and indicate the need for further manual process if the parser cannot work reliably.

Source:: R News

Marketing Multi-Channel Attribution model with R (part 2: practical issues)

By Sergey Bryl’ ( blog)

(This article was first published on R language – AnalyzeCore – data is beautiful, data is a story, and kindly contributed to R-bloggers)

This is the second post about the Marketing Multi-channel Attribution Model with Markov chains (here is the first one). Even though the concept of the first-order Markov chains is pretty simple, you can face other issues and challenges when implementing the approach in practice. In this article, we will review some of them. I tried to organize this article in a way that you can use it as a framework or can help you to create your own.

The main steps that we will review are the following:

  • splitting paths depending on purchases counts
  • replacing some channels/touch points
  • a unique channel/touchpoint case
  • consequent duplicated channels in the path and higher order Markov chains
  • paths that haven’t led to a conversion
  • customer journey duration
  • attributing revenue and costs comparisons

As usually, we start by simulating the data sample for experiments that includes customer ids, date stamp of contact with a marketing channel, marketing channel and conversion mark (0/1).

click to expand R code


##### simulating the "real" data #####
df_raw %
 group_by(customer_id) %>%
 mutate(conversion = sample(c(0, 1), n(), prob = c(0.975, 0.025), replace = TRUE)) %>%
 ungroup() %>%
 dmap_at(c(1, 3), as.character) %>%
 arrange(customer_id, date)

df_raw %
 mutate(channel = ifelse(channel == 'channel_2', NA, channel))

In addition, I’ve replaced channel_2 with NA values. The initial data sample looks like:

1. Splitting paths depending on purchases counts

It makes sense to attribute paths of the first purchase and of the n-th purchase separately.
We assume that each subsequent purchase moves the customer deeper in her lifecycle with a company. We can expect a huge difference between the first customer journey and, for instance, the tenth. Therefore, marketing channels work differently for customers on different phases of their lifecycles. I recommend splitting paths by purchase and compute the model for first-time buyers and n-times buyers separately. For rational splitting, the concept of Life-Cycle Grids can be very helpful.
For instance, if the customer’s path of her lifetime with the company looks like this:
C1 -> C4 -> C2 -> C3 -> conversion (first purchase) -> C2 -> C3 -> conversion (second purchase) -> C3 -> conversion (third purchase) -> C5.
We can split it like this:
a) C1 -> C4 -> C2 -> C3 -> conversion (first purchase),
b) C2 -> C3 -> conversion (second purchase),
c) C3 -> conversion (third purchase),
d) C5.
After this, compute the attribution models for the path a), and b) and c) separately. Path d) can be used for the generic probabilistic model (see point #5).
We can add the serial number of the path by using, for example, the lagged cumulative sum of conversion binary marks with the following simple code:
click to expand R code

##### splitting paths #####
df_paths %
 group_by(customer_id) %>%
 mutate(path_no = ifelse(, 0, lag(cumsum(conversion))) + 1) %>%
You can see now that, for example, customer id18055 has 4 paths:
1) channel_1 -> conversion #1
2) channel_4 -> channel_0 -> channel _6 -> conversion #2
3) channel_4 -> conversion #3
4) channel_6 -> NA -> channel_5
Using the Life-Cycle Grids concept (or a different one) we can split the data set into different sets and compute attribution separately for different customer segments. For simplicity, we will compute attribution for first-purchasers only. Therefore, the code is the following:
click to expand R code

df_paths_1 %
 filter(path_no == 1) %>%

2. Replacing/removing touchpoints

It makes sense to replace or remove some channels when:
1) the marketing channel is unknown (NA value) due to variety of reasons
2) there is a specific channel in the path that we don’t want to attribute such as Direct channel.
There are two main methods for these cases: either to remove NA/Direct channels or to replace them with the previous channel in the path or we can combine both methods: remove NAs and replace Direct channel.
Note: by using replacing with the first-order Markov chains, we will obtain the same outcomes as those comparing with the removing method because duplicated touchpoints don’t affect the outcomes (see point #4).
The following are examples of typical transformations from the combined approach:
1) initial path “C1 -> C2 -> Direct -> C3 -> conversion” we transform to “C1 -> C2 -> С2 -> C3 -> conversion”
2) initial path “Direct -> C3 -> C1 -> C2 -> conversion” we transform to “C3 -> C1 -> C2 -> conversion”
3) initial path “C1 -> C2 -> NA -> C3 -> conversion” we transform to “C1 -> C2 -> C3 -> conversion”
4) initial path “C3 -> NA -> Direct -> C2 -> conversion” we transform to “C3 -> C3 -> C2 -> conversion”.
Let’s assume that we want to replace channel_6 with the previous non-channel_6 and skip unknown (NA) touch points. Note the following assumptions:
1) we will remove paths NA -> conversion. My point is that it doesn’t make sense to attribute an unknown channel even if it brings a conversion. We would not use this information.
2) we will remove Direct (channel_6) if it is the first in the path because we don’t have a previous channel to be replaced with. Again, in the case Direct -> conversion path we will remove conversion and my idea is the same as with the NA case.
On the other hand, you can adapt the code to your own point of view. We will use the following code:
click to expand R code

##### replace some channels #####
df_path_1_clean %
 # removing NAs
 filter(! %>%
 # adding order of channels in the path
 group_by(customer_id) %>%
 mutate(ord = c(1:n()),
 is_non_direct = ifelse(channel == 'channel_6', 0, 1),
 is_non_direct_cum = cumsum(is_non_direct)) %>%
 # removing Direct (channel_6) when it is the first in the path
 filter(is_non_direct_cum != 0) %>%
 # replacing Direct (channel_6) with the previous touch point
 mutate(channel = ifelse(channel == 'channel_6', channel[which(channel != 'channel_6')][is_non_direct_cum], channel)) %>%
 ungroup() %>%
 select(-ord, -is_non_direct, -is_non_direct_cum)

3. One- and multi-channel paths issue

It makes sense to split a unique channel and multi-channel paths.
As I mentioned in the previous article, when using the Removal Effect, you should calculate the weighted importance for each channel/touchpoint because the sum of the Removal Effects doesn’t equal to 1.
In case we have a path with a unique channel, the Removal Effect and importance of this channel for that exact path is 1. However, weighting with other multi-channel paths will decrease the importance of one-channel occurrences. That means that, in case we have a channel that occurs in one-channel paths, usually it will be underestimated if attributed with multi-channel paths.
There is also a pretty straight logic behind splitting – for one-channel paths, we definitely know the channel that brought a conversion and we don’t need to distribute that value into other channels.
We can do this with a simple algorithm:
  1. Split data for paths with one or more unique channels
  2. Calculate total conversions for one-channel paths and compute the Markov model for multi-channel paths
  3. Summarize results for each channel.
Let’s check the difference between when we don’t split the data and when we do split it with the following code:
click to expand R code

##### one- and multi-channel paths #####
df_path_1_clean %
 group_by(customer_id) %>%
 mutate(uniq_channel_tag = ifelse(length(unique(channel)) == 1, TRUE, FALSE)) %>%

df_path_1_clean_uniq %
 filter(uniq_channel_tag == TRUE) %>%

df_path_1_clean_multi %
 filter(uniq_channel_tag == FALSE) %>%

### experiment ###
# attribution model for all paths
df_all_paths %
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 filter(conversion == 1)

mod_attrib %
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 filter(conversion == 1)

mod_attrib_alt %
 filter(conversion == 1) %>%
 group_by(channel) %>%
 summarise(conversions = sum(conversion)) %>%

d_multi %
 mutate(result = total_conversions + conversions)

As you can see, the total sum is equal but attribution is different because for the reason that I mentioned earlier.

4. Higher order of Markov chains and consequent duplicated channels in the path

It doesn’t matter to skip or not duplicates for the first-order Markov chains.
The ChannelAttribution package allows us to change the order of Markov chains via the “order” parameter. This means we can compute transition probabilities based on the previous two, three or more channels.
When using one-order Markov chains, a subsequence of the same channels in a path (duplicates) can be reduced to one channel (for example C2 in the path C1 → C2 → C2 → C2 → C3 can be reduced to C1 → C2 → C3). Because, mathematically, it doesn’t matter how many times each C2 goes through the loop with itself in the transition matrix, it will be in the C3 state finally. Therefore, we will obtain different transition matrices but the same Removal Effect for channels with or without subsequent duplicates.
However, increasing the order of the Markov graph will affect the results if we skip C2. This is due to computing probabilities based on, for example, two subsequent channels as one segment of the path in the case of the second-order Markov graph. Therefore, removing consequent duplicated channels can have a huge effect for higher order Markov chains.
In order to check the effect of skipping duplicates in the first-order Markov chain, we will use my script for “manual” calculation because the package skips duplicates automatically. The following is the code for computing the attribution “manually”:
click to expand R code

##### Higher order of Markov chains and consequent duplicated channels in the path #####

# computing transition matrix - 'manual' way
df_multi_paths_m %
 mutate(path = paste0('(start) > ', path, ' > (conversion)'))
m ')) + 1 # maximum path length

df_multi_paths_cols  ', names = c(1:m))
colnames(df_multi_paths_cols) %
 select(num_range("ord_", c(i, i+1))) %>%
 na.omit() %>%
 group_by_(.dots = c(paste0("ord_", c(i, i+1)))) %>%
 summarise(n = n()) %>%
 colnames(df_cache)[c(1, 2)] %
 group_by(channel_from, channel_to) %>%
 summarise(n = sum(n)) %>%
 ungroup() %>%
 group_by(channel_from) %>%
 mutate(tot_n = sum(n),
 perc = n / tot_n) %>%

df_dummy %
 filter(conversion == 1) %>%
channels_list % select(channel_from, channel_to)
df_attrib %
 mutate(channel_from = ifelse(channel_from == channel, NA, channel_from),
 channel_to = ifelse(channel_to == channel, '(null)', channel_to)) %>%
 df_res_tot1 %
 group_by(channel_from, channel_to) %>%
 summarise(n = sum(n)) %>%
 ungroup() %>%
 group_by(channel_from) %>%
 mutate(tot_n = sum(n),
 perc = n / tot_n) %>%
 df_res_tot1 %
 mutate(tot_conversions = sum(df_multi_paths_m$conversion),
 impact = (tot_conversions - conversions) / tot_conversions,
 tot_impact = sum(impact),
 weighted_impact = impact / tot_impact,
 attrib_model_conversions = round(tot_conversions * weighted_impact)
 ) %>%
 select(channel_name, attrib_model_conversions)
As you can see, even when we’ve obtained different transition matrices (trans_matrix_prob_m vs. trans_matrix_prob), the removal effects and attribution results are the same (df_attrib vs. mod_attrib_alt$result) for the package (that skipped duplicated subsequent channels) as with “manual” calculations (with duplicates).

5. Deal with paths that haven’t led to conversion

If you can track both user journeys that finished or not in conversions, doing this will help you obtain extra value from advanced methods.
We need paths with conversions only for attributing marketing channels. However, if you collect both customer journeys that finished or not in conversions, that will give you some advanced opportunities: a generic probabilistic model that has both positive (conversion) and negative (null) outcomes. The model can allow you to look at the complete “picture” of the business, not just valuable transitions.
Additionally, the complete probabilistic model can be used as a predictive model:
  1. At the least, you can model:
  • customers’ flow through marketing channels,
  • a projection of how many customers will touch the exact marketing channel after N touches or how many conversions you can obtain if you attract traffic (N visits) from this or that channel,
  • what channels have a higher probability of user churn and thus know whether or not we should change the acquisition through them.
  1. an advanced approach that allows to use transition probabilities and other features like visits recency, frequency, durations, demographics, etc. in order to predict conversions with machine learning algorithms.
These are rather complex questions that cover such topics as channels capacity (when a linear increase in the costs of acquisition doesn’t lead to a linear increase in conversions), marketing budgets allocation, a possible reduction in customers life-time value from attracting less loyal customers or one-time buyers, etc. These aspects are not a topic of this article. I just want you to pay attention to the fact that the real world is more complex than in the Markov chains model. Nonetheless, using the approach with the right amount of attention and understanding of the business domain can bring about good results.
Ok, let’s compute complete probabilistic model for the first paths we have with the following code:
click to expand R code

##### Generic Probabilistic Model #####
df_all_paths_compl %
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 conversion = sum(conversion)) %>%
 ungroup() %>%
 mutate(null_conversion = ifelse(conversion == 1, 0, 1))

mod_attrib_complete %
 dmap_at(c(1, 2), as.character)

##### viz #####
edges %
 distinct(id) %>%
 arrange(id) %>%
 label = id,
 color = ifelse(
 label %in% c('(start)', '(conversion)'),
 ifelse(label == '(null)', '#ce472e', '#ffd73e')
 shadow = TRUE,
 shape = "box"

 height = "2000px",
 width = "100%",
 main = "Generic Probabilistic model's Transition Matrix") %>%
 visIgraphLayout(randomSeed = 123) %>%
 visNodes(size = 5) %>%
 visOptions(highlightNearest = TRUE)
This time I used the really cool R package visNetwork for Markov graph visualization. It looks a bit messy with nine nodes but it is interactive and you can move nodes, highlight them as well as edges and do other transformation thanks to the flexibility this package provides.
The following is an example of how you can model customers transition through marketing channels and project the number of conversions. Let’s assume that we are going to attract 1000 visits from channel_5 and we want to model how many conversions we will obtain or see what channels customers will make contact with in several steps. The script involves manipulating the matrix of transition probabilities associated with the Markov chain.
Once we have the transition matrix computed, we can project in what state (channel) that customer contact will be, for example, in 5 steps (5th degree) or how many conversion we can obtain (we use 100,000 degrees to makes sure that all customers transited through the transition matrix).
click to expand R code

##### modeling states and conversions #####
# transition matrix preprocessing
trans_matrix_complete %
 mutate(transition_probability = perc) %>%
 select(channel_from, channel_to, transition_probability))

6. Customer journey duration

Usually, we compute the model regularly for the company’s standard reporting period.
When speaking about the Attribution model, it is quite simple to deal with durations. Once we have a conversion, we extract retrospective contacts with marketing channels and we can limit a retrospective period of for example 30 or 90 days based on our knowledge about the typical duration of customer lifetime cycle and omit contacts that happened earlier (or even compute the model without limitations). For defining retrospective periods, we can look at the distribution of durations from the first touch to conversion and choose e.g. 95% of occurrences.
In R we can obtain these analytics with the following code:
click to expand R code

##### Customer journey duration #####
# computing time lapses from the first contact to conversion/last contact
df_multi_paths_tl %
 group_by(customer_id) %>%
 summarise(path = paste(channel, collapse = ' > '),
 first_touch_date = min(date),
 last_touch_date = max(date),
 tot_time_lapse = round(as.numeric(last_touch_date - first_touch_date)),
 conversion = sum(conversion)) %>%

# distribution plot
ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) +
 theme_minimal() +
 geom_histogram(fill = '#4e79a7', binwidth = 1)

# cumulative distribution plot
ggplot(df_multi_paths_tl %>% filter(conversion == 1), aes(x = tot_time_lapse)) +
 theme_minimal() +
 stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) +
 geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) +
 geom_vline(xintercept = 23, color = '#e15759', size = 1.5, linetype = 2)
You can see that 23 days period covers 95% of paths.
It is much more complex to deal with duration for the generic probabilistic model. We can use the same approach for paths that finished in a conversion as of reporting date but, in addition, we need to manage paths that did not. We can’t be sure which of them we should accept as fruitless but which of them has a higher chance to bring us a conversion in the next reporting date/period.
Let’s visualize this issue for better understanding with the following code assuming reporting date as of January 10, 2016:
click to expand R code

### for generic probabilistic model ###
df_multi_paths_tl_1 % select(customer_id, first_touch_date, last_touch_date, conversion),
 id.vars = c('customer_id', 'conversion'), = 'touch_date') %>%
You can see that id10072 finished the path in a conversion so we can add its retrospective touchpoints into the model. On the other hand, id10010‘s, id1001‘s and id10005‘s paths are fruitless as of reporting date but customer id10010 will purchase on January 19, 2016, customer id1001 will contact with a marketing channel on January 15, 2016, but won’t purchase and customer id10005 won’t have any new contacts with marketing channels, it is fruitless.
Therefore, our task is to compute generic model trying to identify which paths are completed as of reporting date both in a conversion or not. For example, we should use paths of id10072 and id10005 customers for computing the model, because we don’t expect new contacts or first purchases from them anymore.
We need to develop some criteria for identifying if an exact path has a higher chance to finish in a conversion or not. Again, we can analyze stats that characterize successful paths and make the assumption that if a path violates common stats towards success then it is fruitless with high probability.
There are at least two values I can suggest to start with:
1) time lapse from the first contact,
2) time lapse between the conversion date and a previous contact.
These values can be used as a combination of rules. From the above analysis we know that 95% of purchases were done in 23 days and we can compute time lapses between the conversion date and the previous contact with the following code:
click to expand R code

df_multi_paths_tl_2 %
 group_by(customer_id) %>%
 mutate(prev_touch_date = lag(date)) %>%
 ungroup() %>%
 filter(conversion == 1) %>%
 mutate(prev_time_lapse = round(as.numeric(date - prev_touch_date)))
# distribution
ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) +
 theme_minimal() +
 geom_histogram(fill = '#4e79a7', binwidth = 1)

# cumulative distribution
ggplot(df_multi_paths_tl_2, aes(x = prev_time_lapse)) +
 theme_minimal() +
 stat_ecdf(geom = 'step', color = '#4e79a7', size = 2, alpha = 0.7) +
 geom_hline(yintercept = 0.95, color = '#e15759', size = 1.5) +
 geom_vline(xintercept = 12, color = '#e15759', size = 1.5, linetype = 2)
We can see that 95% of customers have made a purchase during 12 days from the previous contact. Therefore, we can assume that if a customer made contact with a marketing channel the first time for more than 23 days and/or hasn’t made contact with a marketing channel for the last 12 days, then it is a fruitless path. Hint: for more accurate assumptions, these rules can be computed depending on the exact marketing channels.
We can extract a data for the generic probabilistic model when both rules are true with the following code:
click to expand R code

# extracting data for generic model
df_multi_paths_tl_3 %
 group_by(customer_id) %>%
 mutate(prev_time_lapse = round(as.numeric(date - lag(date)))) %>%
 summarise(path = paste(channel, collapse = ' > '),
 tot_time_lapse = round(as.numeric(max(date) - min(date))),
 prev_touch_tl = prev_time_lapse[which(max(date) == date)],
 conversion = sum(conversion)) %>%
 ungroup() %>%
 mutate(is_fruitless = ifelse(conversion == 0 & tot_time_lapse > 20 & prev_touch_tl > 10, TRUE, FALSE)) %>%
 filter(conversion == 1 | is_fruitless == TRUE)

7. Attributing revenue and costs comparison

Once we have the marketing channels attributed, we can compute their effectiveness. The first way is by comparing cost per action or CPA (conversion in our case) for different channels. For this, we need to divide the total cost spent on the exact channel by the modeled number of conversions and compare. For instance:
Additionally, the ChannelAttribution package allows to distribute revenue through channels. For this, we need to add the parameter “var_value” with a column of revenues into the markov_model() function. Therefore, it is possible to compare channels’ gross margin.


As you can see, using the Markov chains approach provides an effective way for marketing channels attribution. There are quite a lot of other uses of the approach:
  • the Markov model can be effectively used for LTV prediction
  • we can include additional types of interactions into the model (for example, banner impressions, tv ads, calls to callcenter and so on) and evaluate their influence
  • develop “likely to buy” predictive model using ML algorithms
and so on.
I’m going to share another method for attributing marketing channels that is a mix of probabilistic and heuristic models based on sales funnel. It is very interesting, don’t miss it!

The post Marketing Multi-Channel Attribution model with R (part 2: practical issues) appeared first on AnalyzeCore – data is beautiful, data is a story.

To leave a comment for the author, please follow the link and comment on their blog: R language – AnalyzeCore – data is beautiful, data is a story. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

simmer 3.6.2

By Iñaki Úcar


(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The second update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN, thus inaugurating a bi-monthly release cycle. I must thank Duncan Garmonsway (@nacnudus) for creating and now maintaining “The Bank Tutorial: Part I” vignette, Franz Fuchs for finding an important and weird memory bug (here) that prevented simmer from freeing the allocated memory (all 3.x.x versions are affected up to this release), and the Rcpp people for enduring me while I was helplessly searching for a solution to this.

My special thanks to Kevin Ushey (@kevinushey), who finally found the bug. As it happens, the bug was not in simmer or Rcpp but in magrittr, and the problem is that the pipe operator, in its inscrutable magic, creates a new environment for unnamed functions (instead of the current one, as it should be), and there it stores a reference to the first object in the pipe. More or less. Further details here.

Anyway, if somebody faces the same problem, know that there is a workaround: you just need to delete that hidden reference, as simmer does in this release to get rid of the memory issues. Happy simmering!

Minor changes and fixes:

  • Update “The Bank Tutorial: Part I” vignette (@nacnudus in #90).
  • Fix trap()‘s handler cloning and associated test (#91).
  • Apply select()‘s policy also when resources is a function (#92).
  • Accept dynamic timeouts in batches (#93).
  • Change rollback()‘s default behaviour to times=Inf, i.e., infinite loop (#95).
  • Stop and throw an error when timeout() returns a missing value (#96 and #97).
  • Fix memory management: resetting the environment was clearing but not deallocating memory (#98, fixed in #99).
  • Fix object destruction: workaround for tidyverse/magrittr#146 (#98, fixed in effcb6b).

Article originally published in simmer 3.6.2.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News