Live Presentation: My Talk at the CDC

By Ari Lamstein

Last Thursday I had the honor of giving a talk about my open source work to the CDC’s R Users Group. A big thank you to the CDC for having me!

As the creator of a software package that maps open data, it was wonderful to speak at the CDC: not only do they publish interesting data, they also frequently need to map that data!

My presentation was an hour long. This allowed me to cover a lot of the material that appears in my various courses. I also got to discuss several features that I don’t often have a chance to write about.

I thought that other people might enjoy this presentation as well. So this Thursday at 9am PT I will be running the presentation as a live webinar that is open to the public. You can attend by registering below.

Note: if you are interested in watching the talk but unable to attend live, I encourage you to register anyway: I will be sending out a link to the recording afterwards.


The post Live Presentation: My Talk at the CDC appeared first on

Source:: R News

Announcing community R workshops

By Steph

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

A big part of why I’ve launched Locke Data is so that I can give back more to my communities. I want to give more time and more support to others. One of the first steps is doing some activities that give financial support to community groups without damaging my startup cashflow! Community R workshops that fund local user groups is the first activity I’ll be trialling.

Here’s what’s involved, and what you might want to consider if you’d like to be a part of this endeavour:

The guinea pig

My first community R workshop will be in Exeter, UK next month. The Logistic Regressions in R workshop will be an afternoon of practical hands-on training; building models from data selection to evaluation. After the workshop, I’ll then be doing a talk on data.table at the Data Science Exeter User Group.


After covering the room costs and my travel, all the funds will be going to the user group to cover costs for as long as the money will stretch.

  • The training is low-cost for attendees. For Exeter, that means ~£50 for the afternoon per person, with a group discount available.
  • I’m not sure how many of these I can do in a month, but I’m currently looking at just one a month for now.
  • If you can get a free room, that will also lower costs considerably.
  • It’s also probably the most practical in the UK, where large travel costs won’t be incurred so more money can go to the group.

Potential workshops

Right now, I’m looking to make these “cookie-cutter” workshops so that I’ll be able to ramp up how many can be performed later. Workshop topics that can currently be selected are:

  • SQL fundamentals
  • Introduction to R
  • Logistic regression in R
  • Reproducible analysis in R
  • Improving R data manipulation skills
  • R package development

Give feedback

I’d be really interested to hear what people think about this endeavour, especially any past experiences trialling anything similar!

Book one

If you’re an R, data science, Microsoft Data Platform, or developer group who would like to do one of these paid workshops and user group talk combos, then please fill out this form to kick off the process.

The post Announcing community R workshops appeared first on Locke Data.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

March R Course Finder Update: Web Analysis, Data Mining and Advanced R

By Onno Dijt


A few months ago we launched R Course Finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.

If you haven’t looked at it yet, go to the R Course Finder now by clicking here.

Last month we added 17 courses to the Course Finder. Currently we are at 166 courses on 16 different online platforms, and 2 offline Learning Institutes.

New added websites include:

Highlighted Courses

A complete journey to web analytics using R tool

If your an aspiring data scientist, or just interested in consumer behavior and marketing, A complete journey to web analytics using R tool is the course for you. In todays online webshops, web analytics is central in the business strategy, and this course teaches you everything! learn how to extract data from google analytics, turn it into real strategic business indicators. In section 7 of this course you will find out about differentiating new users, analyzing data from mobile browsers and optimizing your traffic by time of the day.

Data Mining-Unsupervised Learning Using R

Without a doubt one of the hottest issues in data science remains how to combine fresh sources of data to optimize the solution to your problem. The course, Data Mining-Unsupervised Learning Using R will teach you in over 4.5 hours of lectures and Quizzes anything related to Network analysis, dimensions reduction and association rules.

Advanced R

Do you have 6 months to 1 year of R experience? Do you feel that all courses are aimed at beginners and can’t help you take the next step in your progress? Advanced R actually does what it says, this high pace course dives into advanced topics that will allow you to optimize your workflow in R. How do we speed up code? Embed code from different languages in R and quickly move on to regex text parsing. In just 4.5 hours the author covers a whole lot of topics that, as an intermediate R user will feel like a fresh breath of air to motivate you forward.

Besides these courses, we also added:

Learning Path: R Programming
R Data Mining Projects
Object-Oriented Programming in R: S3 and R6
ARIMA Modeling with R
Understanding SQL and R Training Video
Exploratory Multivariate Data Analysis

How you can help to make R Course Finder better

  • If you miss a course that is not included yet, please post a reminder in the comments and we’ll add it.
  • If you miss an important filter or search functionality, please let us know in the comments below.
  • If you already took one of the courses, please let all of us know about your experiences in the review section, an example is available here.

And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.

Source:: R News


By Jason C Fisher


(This article was first published on jfisher-usgs, and kindly contributed to R-bloggers)

Early in my science career I believed that any good piece of software included a graphical user interface (GUI).
That was many years ago, and during that time I delved deep into the art of Tcl/Tk programming;
using the tcltk package to create GUI’s in R.
Countless hours were spent struggling over the R-Tcl/Tk interface and educating myself on the best practices and principles for GUI design.
After a few years of toil, I finally started feeling proficient with GUI construction.
It was around this time that I also became aware of the importance of reproducible research;
thanks to the many discussions on the topic within the R community.
Rather quickly my GUI development ceased, and all of the GUI-based applications
I relied so heavily on for my research (such as Excel, ArcGIS, and Illustrator) were no longer relevant.
Scripting has become the essential ingredient for making my research reproducible,
and I believe my work has benefited immensely from taking this approach.

More recently I’ve become a bit of a zealot about preaching the importance of reproducibility to my fellow scientists.
These are career researchers in the earth sciences,
typically with little to no programming experience—smart individuals on a strictly GUI-based diet.
I’m disappointed to report that my efforts have resulted in very few converts.
My colleagues have gone through a couple of R-training workshops,
been made aware of countless online materials for learning to code, and are still having difficulties crossing over.
The biggest complaint I here is that the learning curve is just too steep.
And I get it, learning your first programming language can be extremely difficult,
especially when you’re trying to meet publication deadlines.

This may come as blasphemy to some, but perhaps the best way to reach out to these folks is through R-GUI applications
(such as R Commander).
Start them with the familiar and let these gateway applications slowly get them hooked on R.
Reasoning such as this have spurred me to dust off one of my old R GUI’s,
RSurvey, a cross-platform geographic information system (GIS) application.
Perhaps not as feature rich as other GIS applications (such as QGIS),
but completely written in R and fairly simple to use.
The package repository is located on GitHub.

To install the stable version of RSurvey from CRAN,
open an R session and type the following command


If you’re running into difficulties with package installation, see the package
README file for possible solutions.
To load the package and launch the main GUI, use the command


If any of the suggested packages are missing, you will be prompted to install them when it first starts up.
The main GUI should look something like this

No manual currently exists for the RSurvey package; its absence is mainly attributed to the authors laziness;
but also from a sense that the software should reach a certain level of maturity before embarking on such an endeavor,
thus avoiding endless document revisions.
That said, please keep in mind that considerable effort was put into making the user interface easy enough to use without instruction.

In the remainder of this post, windows are shown for a typical RSurvey session that includes
data import, data wrangling, and data visualization.
This example session illustrates only a small fraction of the package’s functionality.
You’ll need to explore the package on your own to discover its full potential.
Any comments/guidance would be greatly appreciated, you can contact me via email.
Bug reports and feature requests can be submitted on the GitHub
Issues page.







And let’s not forget the R session information.

## Session info ------------------------------------
##  setting  value
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, mingw32
##  ui       Rgui
##  language (EN)
##  collate  English_United States.1252
##  tz       America/Los_Angeles
##  date     2017-02-24
## Packages ----------------------------------------
##  package     * version  date       source
##  devtools      1.12.0   2016-06-24 CRAN (R 3.3.2)
##  digest        0.6.12   2017-01-27 CRAN (R 3.3.2)
##  htmltools     0.3.5    2016-03-21 CRAN (R 3.3.2)
##  htmlwidgets   0.8      2016-11-09 CRAN (R 3.3.2)
##  httpuv        1.3.3    2015-08-04 CRAN (R 3.3.2)
##  jsonlite      1.2      2016-12-31 CRAN (R 3.3.2)
##  knitr         1.15.1   2016-11-22 CRAN (R 3.3.2)
##  magrittr      1.5      2014-11-22 CRAN (R 3.3.2)
##  memoise       1.0.0    2016-01-29 CRAN (R 3.3.2)
##  mime          0.5      2016-07-07 CRAN (R 3.3.2)
##  R6            2.2.0    2016-10-05 CRAN (R 3.3.2)
##  Rcpp          0.12.9   2017-01-14 CRAN (R 3.3.2)
##  rgl           0.97.0   2017-01-10 CRAN (R 3.3.2)
##  RSurvey     * 0.9.1    2017-02-24 CRAN (R 3.3.2)
##  shiny         1.0.0    2017-01-12 CRAN (R 3.3.2)
##  withr         1.0.2    2016-06-20 CRAN (R 3.3.2)
##  XML           3.98-1.5 2016-11-10 CRAN (R 3.3.2)
##  xtable        1.8-2    2016-02-05 CRAN (R 3.3.2)
To leave a comment for the author, please follow the link and comment on their blog: jfisher-usgs. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Building deep neural nets with h2o and rsparkling that predict arrhythmia of the heart

By Shirin's playgRound

Last week, I introduced how to run machine learning applications on Spark from within R, using the sparklyr package. This week, I am showing how to build feed-forward deep neural networks or multilayer perceptrons. The models in this example are built to classify ECG data into being either from healthy hearts or from someone suffering from arrhythmia. I will show how to prepare a dataset for modeling, setting weights and other modeling parameters and finally, how to evaluate model performance with the h2o package via rsparkling.

Deep learning with neural networks

Deep learning with neural networks is arguably one of the most rapidly growing applications of machine learning and AI today. They allow building complex models that consist of multiple hidden layers within artifiical networks and are able to find non-linear patterns in unstructured data. Deep neural networks are usually feed-forward, which means that each layer feeds its output to subsequent layers, but recurrent or feed-back neural networks can also be built. Feed-forward neural networks are also called multilayer perceptrons (MLPs).

H2O and Sparkling Water

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O can be integrated with Apache Spark (Sparkling Water) and therefore allows the implementation of complex or big models in a fast and scalable manner. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

Sparkling Water can be accessed from R with the rsparkling extension package to sparklyr and h2o. Check the documentation for rsparkling to find out which H2O, Sparkling Water and Spark versions are compatible.

Preparing the R session

First, we need to load the packages and connect to the Spark instance (for demonstration purposes, I am using a local instance).

options(rsparkling.sparklingwater.version = "2.0.3")


sc <- spark_connect(master = "local", version = "2.0.0")

I am also preparing my custom plotting theme.


my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "darkgrey", color = "grey", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)

Arrhythmia data

The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.

arrhythmia <- read.table("", sep = ",")

# making sure, that all feature columns are numeric
arrhythmia[-280] <- lapply(arrhythmia[-280], as.numeric)

#  renaming output column and converting to factor
colnames(arrhythmia)[280] <- "class"
arrhythmia$class <- as.factor(arrhythmia$class)

As usual, I want to get acquainted with the data and explore it’s properties before I am building any model. So, I am first going to look at the distribution of classes and of healthy and arrhythmia samples.

p1 <- ggplot(arrhythmia, aes(x = class)) +
  geom_bar(fill = "navy", alpha = 0.7) +

Because I am interested in distinguishing healthy from arrhythmia ECGs, I am converting the output to binary format by combining all arrhythmia cases into one class.

arrhythmia$diagnosis <- ifelse(arrhythmia$class == 1, "healthy", "arrhythmia")
arrhythmia$diagnosis <- as.factor(arrhythmia$diagnosis)
p2 <- ggplot(arrhythmia, aes(x = diagnosis)) +
  geom_bar(fill = "navy", alpha = 0.7) +

grid.arrange(p1, p2, ncol = 2)

With binary classification, we have almost the same numbers of healthy and arrhythmia cases in our dataset.

I am also interested in how much the normal and arrhythmia cases cluster in a Principal Component Analysis (PCA). I am first preparing the PCA plotting function and then run it on the feature data.


pca_func <- function(pcaOutput2, group_name){
    centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
    conf.rgn  <-, lapply(unique(pcaOutput2$groups), function(t)
          data.frame(groups = as.character(t),
                     ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                           centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                           level = 0.95),
                     stringsAsFactors = FALSE)))
    plot <- ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
      geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
      geom_point(size = 2, alpha = 0.5) + 
      labs(color = paste(group_name),
           fill = paste(group_name),
           x = paste0("PC1: ", round(pcaOutput$pov[1], digits = 2) * 100, "% variance"),
           y = paste0("PC2: ", round(pcaOutput$pov[2], digits = 2) * 100, "% variance")) +
pcaOutput <- pca(t(arrhythmia[-c(280, 281)]), printDropped = FALSE, scale = TRUE, center = TRUE)
pcaOutput2 <-$scores)

pcaOutput2$groups <- arrhythmia$class
p1 <- pca_func(pcaOutput2, group_name = "class")

pcaOutput2$groups <- arrhythmia$diagnosis
p2 <- pca_func(pcaOutput2, group_name = "diagnosis")

grid.arrange(p1, p2, ncol = 2)

The PCA shows that there is a big overlap between healthy and arrhythmia samples, i.e. there does not seem to be major global differences in all features. The class that is most distinct from all others seems to be class 9. I want to give the arrhythmia cases that are very different from the rest a stronger weight in the neural network, so I define a weight column where every sample outside the central PCA cluster will get a “2”, they will in effect be used twice in the model.

weights <- ifelse(pcaOutput2$PC1 < -5 & abs(pcaOutput2$PC2) > 10, 2, 1)

I also want to know what the variance is within features.


colvars <- data.frame(feature = colnames(arrhythmia[-c(280, 281)]),
                      variance = colVars(as.matrix(arrhythmia[-c(280, 281)])))

subset(colvars, variance > 50) %>%
  mutate(feature = factor(feature, levels = colnames(arrhythmia[-c(280, 281)]))) %>%
  ggplot(aes(x = feature, y = variance)) +
    geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Features with low variance are less likely to strongly contribute to a differentiation between healthy and arrhythmia cases, so I am going to remove them. I am also concatenating the weights column:

arrhythmia_subset <- cbind(weights, arrhythmia[, c(281, 280, which(colvars$variance > 50))])

Working with rsparkling and h2o

Now that I have my final data frame for modeling, I copy it to the Spark instance. For working with h2o functions, the data needs to be converted from a Spark DataFrame to an H2O Frame. This is done with the as_h2o_frame() function.

arrhythmia_sc <- copy_to(sc, arrhythmia_subset)
arrhythmia_hf <- as_h2o_frame(sc, arrhythmia_sc, strict_version_check = FALSE)

We can now access all functions from the h2o package that are built to work on H2O Frames. A useful such function is h2o.describe(). It is similar to base R’s summary() function but outputs many more descriptive measures for our data. To get a good overview about these measures, I am going to plot them.

library(tidyr) # for gathering
h2o.describe(arrhythmia_hf[, -1]) %>% # excluding the weights column
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% # separating them into facets makes them easier to see
  mutate(Label = factor(Label, levels = colnames(arrhythmia_hf[, -1]))) %>%
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

I am also interested in the correlation between features and the output. We can use the h2o.cor() function to calculate the correlation matrix. It is again much easier to understand the data when we visualize it, so I am going to create another plot.

library(reshape2) # for melting

arrhythmia_hf[, 2] <- h2o.asfactor(arrhythmia_hf[, 2]) # diagnosis is now a characer column and we need to convert it again
arrhythmia_hf[, 3] <- h2o.asfactor(arrhythmia_hf[, 3]) # same for class

cor <- h2o.cor(arrhythmia_hf[, -c(1, 3)])
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    my_theme() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")

Training, test and validation data

Now we can use the h2o.splitFrame() function to split the data into training, validation and test data. Here, I am using 70% for training and 15% each for validation and testing. We could also just split the data into two sections, a training and test set but when we have sufficient samples, it is a good idea to evaluate model performance on an independent test set on top of training with a validation set. Because we can easily overfit a model, we want to get an idea about how generalizable it is – this we can only assess by looking at how well it works on previously unknown data.

I am also defining response, feature and weight column names now.

splits <- h2o.splitFrame(arrhythmia_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "diagnosis"
weights <- "weights"
features <- setdiff(colnames(train), c(response, weights, "class"))
summary(train$diagnosis, exact_quantiles = TRUE)
##  diagnosis      
##  healthy   :163 
##  arrhythmia:155
summary(valid$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :43 
##  arrhythmia:25
summary(test$diagnosis, exact_quantiles = TRUE)
##  diagnosis     
##  healthy   :39 
##  arrhythmia:27

If we had more categorical features, we could use the h2o.interaction() function to define interaction terms, but since we only have numeric features here, we don’t need this.

We can also run a PCA on the training data, using the h2o.prcomp() function to calculate the singular value decomposition of the Gram matrix with the power method.

pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           k = 3,
           seed = 42)

## Model Details:
## ==============
## H2ODimReductionModel: pca
## Model ID:  PCA_model_R_1488113493291_1 
## Importance of components: 
##                             pc1      pc2      pc3
## Standard deviation     0.598791 0.516364 0.424850
## Proportion of Variance 0.162284 0.120680 0.081695
## Cumulative Proportion  0.162284 0.282965 0.364660
## H2ODimReductionMetrics: pca
## No model metrics available for PCA
## H2ODimReductionMetrics: pca
## No model metrics available for PCA
eigenvec <-$eigenvectors)
eigenvec$label <- features

ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel() +


Now, we can build a deep neural network model. We can specify quite a few parameters, like

  • Cross-validation: Cross validation can tell us the training and validation errors for each model. The final model will be overwritten with the best model, if we don’t specify otherwise.

  • Adaptive learning rate: For deep learning with h2o, we by default use stochastic gradient descent optimization with an an adaptive learning rate. The two corresponding parameters rho and epsilon help us find global (or near enough) optima.

  • Activation function: The activation function defines the node output relative to a given set of inputs. We want our activation function to be non-linear and continuously differentiable.

  • Hidden nodes: Defines the number of hidden layers and the number of nodes per layer.

  • Epochs: Increasing the number of epochs (one full training cycle on all training samples) can increase model performance, but we also run the risk of overfitting. To determine the optimal number of epochs, we need to use early stopping.

  • Early stopping: By default, early stopping is enabled. This means that training will be stopped when we reach a certain validation error to prevent overfitting.

Of course, you need quite a bit of experience and intuition to hit on a good combination of parameters. That’s why it usually makes sense to do a grid search for hyper-parameter tuning. Here, I want to focus on building and evaluating deep learning models, though. I will cover grid search in next week’s post.

dl_model <- h2o.deeplearning(x = features,
                             y = response,
                             weights_column = weights,
                             model_id = "dl_model",
                             training_frame = train,
                             validation_frame = valid,
                             nfolds = 15,                                   # 10x cross validation
                             keep_cross_validation_fold_assignment = TRUE,
                             fold_assignment = "Stratified",
                             activation = "RectifierWithDropout",
                             score_each_iteration = TRUE,
                             hidden = c(200, 200, 200, 200, 200),           # 5 hidden layers, each of 200 neurons
                             epochs = 100,
                             variable_importances = TRUE,
                             export_weights_and_biases = TRUE,
                             seed = 42)

Because training can take a while, depending on how many samples, features, nodes and hidden layers you are training on, it is a good idea to save your model.

h2o.saveModel(dl_model, path = "dl_model")

We can then re-load the model again any time to check the model quality and make predictions on new data.

dl_model <- h2o.loadModel("dl_model/dl_model")

Model performance

We now want to know how our model performed on the validation data. The summary() function will give us a detailed overview of our model. I am not showing the output here, because it is quite extensive.


One performance metric we are usually interested in is the mean per class error for training and validation data.

h2o.mean_per_class_error(dl_model, train = TRUE, valid = TRUE, xval = TRUE)
##           train          valid            xval 
## 0.0304878 0.2232558 0.2257781

The confusion matrix tells us, how many classes have been predicted correctly and how many predictions were accurate. Here, we see the errors in predictions on validation data

h2o.confusionMatrix(dl_model, valid = TRUE)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.00425904880659062:
##            arrhythmia healthy    Error    Rate
## arrhythmia         15      10 0.400000  =10/25
## healthy             2      41 0.046512   =2/43
## Totals             17      51 0.176471  =12/68

We can also plot the classification error over all epochs or samples.

     timestep = "epochs",
     metric = "classification_error")

     timestep = "samples",
     metric = "classification_error")

Next to the classification error, we are usually interested in the logistic loss (negative log-likelihood or log loss). It describes the sum of errors for each sample in the training or validation data or the negative logarithm of the likelihood of error for a given prediction/ classification. Simply put, the lower the loss, the better the model (if we ignore potential overfitting).

     timestep = "epochs",
     metric = "logloss")

We can also plot the mean squared error (MSE). The MSE tells us the average of the prediction errors squared, i.e. the estimator’s variance and bias. The closer to zero, the better a model.

     timestep = "epochs",
     metric = "rmse")

Next, we want to know the area under the curve (AUC). AUC is an important metric for measuring binary classification model performances. It gives the area under the curve, i.e. the integral, of true positive vs false positive rates. The closer to 1, the better a model. AUC is especially useful, when we have unbalanced datasets (meaning datasets where one class is much more common than the other), because it is independent of class labels.

h2o.auc(dl_model, train = TRUE)
## [1] 0.9864582
h2o.auc(dl_model, valid = TRUE)
## [1] 0.852093
h2o.auc(dl_model, xval = TRUE)
## [1] 0.8577735

The weights for connecting two adjacent layers and per-neuron biases that we specified the model to save, can be accessed with:

w <- h2o.weights(dl_model, matrix_id = 1)
b <- h2o.biases(dl_model, vector_id = 1)

Variable importance can be extracted as well (but keep in mind, that variable importance in deep neural networks is difficult to assess and should be considered only as rough estimates).

## Variable Importances: 
##   variable relative_importance scaled_importance percentage
## 1     V169            1.000000          1.000000   0.013925
## 2     V239            0.987290          0.987290   0.013748
## 3     V103            0.913953          0.913953   0.012727
## 4      V15            0.907422          0.907422   0.012636
## 5      V91            0.904267          0.904267   0.012592
## ---
##    variable relative_importance scaled_importance percentage
## 85      V88            0.717914          0.717914   0.009997
## 86     V269            0.715800          0.715800   0.009968
## 87     V137            0.712923          0.712923   0.009928
## 88     V168            0.711402          0.711402   0.009906
## 89      V33            0.707356          0.707356   0.009850
## 90     V219            0.696149          0.696149   0.009694

Test data

Now that we have a good idea about model performance on validation data, we want to know how it performed on unseen test data. A good model should find an optimal balance between accuracy on training and test data. A model that has 0% error on the training data but 40% error on the test data is in effect useless. It overfit on the training data and is thus not able to generalize to unknown data.

perf <- h2o.performance(dl_model, test)
## H2OBinomialMetrics: deeplearning
## MSE:  0.2154912
## RMSE:  0.4642103
## LogLoss:  1.378809
## Mean Per-Class Error:  0.2250712
## AUC:  0.7796771
## Gini:  0.5593542
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            arrhythmia healthy    Error      Rate
## arrhythmia         19       8 0.296296    =8/27
## healthy                6      33 0.153846   =6/39
## Totals                 25      41 0.212121   =14/66
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.132564 0.825000  40
## 2                       max f2  0.000002 0.894495  61
## 3                 max f0point5  0.132564 0.812808  40
## 4                 max accuracy  0.132564 0.787879  40
## 5                max precision  0.982938 1.000000   0
## 6                   max recall  0.000002 1.000000  61
## 7              max specificity  0.982938 1.000000   0
## 8             max absolute_mcc  0.132564 0.557317  40
## 9   max min_per_class_accuracy  0.837616 0.743590  34
## 10 max mean_per_class_accuracy  0.132564 0.774929  40
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Plotting the test performance’s AUC plot shows us approximately how good the predictions are.


We also want to know the log loss, MSE and AUC values, as well as other model metrics for the test data:

## [1] 1.378809
## [1] 0.2154912
## [1] 0.7796771
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  0.982938 0.050000 0.031847 0.116279 0.424242  1.000000 0.025641
## 2  0.982811 0.097561 0.063291 0.212766 0.439394  1.000000 0.051282
## 3  0.982460 0.095238 0.062893 0.196078 0.424242  0.666667 0.051282
## 4  0.982256 0.139535 0.093750 0.272727 0.439394  0.750000 0.076923
## 5  0.982215 0.181818 0.124224 0.338983 0.454545  0.800000 0.102564
## 6  0.981170 0.177778 0.123457 0.317460 0.439394  0.666667 0.102564
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.103203               0.025641                0.512821
## 2    1.000000     0.147087               0.051282                0.525641
## 3    0.962963     0.033624               0.051282                0.507123
## 4    0.962963     0.082188               0.076923                0.519943
## 5    0.962963     0.121754               0.102564                0.532764
## 6    0.925926     0.048725               0.102564                0.514245
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1  27  38   0   1 1.000000 0.974359 0.000000 0.025641   0
## 2  27  37   0   2 1.000000 0.948718 0.000000 0.051282   1
## 3  26  37   1   2 0.962963 0.948718 0.037037 0.051282   2
## 4  26  36   1   3 0.962963 0.923077 0.037037 0.076923   3
## 5  26  35   1   4 0.962963 0.897436 0.037037 0.102564   4
## 6  25  35   2   4 0.925926 0.897436 0.074074 0.102564   5

The confusion matrix alone can be seen with the h2o.confusionMatrix() function, but is is also part of the performance summary.

h2o.confusionMatrix(dl_model, test)

The final predictions with probabilities can be extracted with the h2o.predict() function. Beware though, that the number of correct and wrong classifications can be slightly different from the confusion matrix above. Here, I combine the predictions with the actual test diagnoses and classes into a data frame. For plotting I also want to have a column, that tells me whether the predictions were correct. By default, a prediction probability above 0.5 will get scored as a prediction for the respective category. I find it often makes sense to be more stringent with this, though and set a higher threshold. Therefore, I am creating another column with stringent predictions, where I only count predictions that were made with more than 80% probability. Everything that does not fall within this range gets scored as “uncertain”. For these stringent predictions, I am also creating a column that tells me whether they were accurate.

finalRf_predictions <- data.frame(class = as.vector(test$class), actual = as.vector(test$diagnosis), = dl_model, newdata = test)))

finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$arrhythmia > 0.8, "arrhythmia", 
                                                ifelse(finalRf_predictions$healthy > 0.8, "healthy", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no"))
finalRf_predictions %>%
  group_by(actual, predict) %>%
  summarise(n = n())
## Source: local data frame [4 x 3]
## Groups: actual [?]
##       actual    predict     n
##       <fctr>     <fctr> <int>
## 1 arrhythmia arrhythmia    16
## 2 arrhythmia    healthy    11
## 3    healthy arrhythmia     6
## 4    healthy    healthy    33
finalRf_predictions %>%
  group_by(actual, predict_stringent) %>%
  summarise(n = n())
## Source: local data frame [6 x 3]
## Groups: actual [?]
##       actual predict_stringent     n
##       <fctr>             <chr> <int>
## 1 arrhythmia        arrhythmia    19
## 2 arrhythmia           healthy     6
## 3 arrhythmia         uncertain     2
## 4    healthy        arrhythmia     8
## 5    healthy           healthy    29
## 6    healthy         uncertain     2

To get a better overview, I am going to plot the predictions (default and stringent):

p1 <- finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Werenpredictionsnaccurate?",
         title = "Default predictions")

p2 <- finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    my_theme() +
    labs(fill = "Werenpredictionsnaccurate?",
         title = "Stringent predictions")

grid.arrange(p1, p2, ncol = 2)

Being more stringent with the prediction threshold slightly reduced the number of errors but not by much.

I also want to know whether there are certain classes of arrhythmia that are especially prone to being misclassified:

p1 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Default predictions",
         x = "predicted to be")

p2 <- subset(finalRf_predictions, actual == "arrhythmia") %>%
  ggplot(aes(x = predict_stringent, fill = class)) +
    geom_bar(position = "dodge") +
    my_theme() +
    labs(title = "Prediction accuracy of arrhythmia cases",
         subtitle = "Stringent predictions",
         x = "predicted to be")

grid.arrange(p1, p2, ncol = 2)

There are no obvious biases towards some classes but with the small number of samples for most classes, this is difficult to assess.

Final conclusions: How useful is the model?

Most samples were classified correctly, but the total error was not particularly good. Moreover, when evaluating the usefulness of a specific model, we need to keep in mind what we want to achieve with it and which questions we want to answer. If we wanted to deploy this model in a clinical setting, it should assist with diagnosing patients. So, we need to think about what the consequences of wrong classifications would be. Would it be better to optimize for high sensitivity, in this example as many arrhythmia cases as possible get detected – with the drawback that we probably also diagnose a few healthy people? Or do we want to maximize precision, meaning that we could be confident that a patient who got predicted to have arrhythmia does indeed have it, while accepting that a few arrhythmia cases would remain undiagnosed? When we consider stringent predictions, this model correctly classified 19 out of 27 arrhythmia cases, but 6 were misdiagnosed. This would mean that some patients who were actually sick, wouldn’t have gotten the correct treatment (if decided solely based on this model). For real-life application, this is obviously not sufficient!

Next week, I’ll be trying to improve the model by doing a grid search for hyper-parameter tuning.

So, stay tuned… (sorry, couldn’t resist ;-))

Other machine learning topics I have covered include

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## other attached packages:
##  [1] reshape2_1.4.2       tidyr_0.6.1          matrixStats_0.51.0  
##  [4] pcaGoPromoter_1.18.0 Biostrings_2.42.1    XVector_0.14.0      
##  [7] IRanges_2.8.1        S4Vectors_0.12.1     BiocGenerics_0.20.0 
## [10] ellipse_0.3-8        gridExtra_2.2.1      ggrepel_0.6.5       
## [13] ggplot2_2.2.1        sparklyr_0.5.2       dplyr_0.5.0         
## [16] h2o_3.10.3.6         rsparkling_0.1.0    
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9          RColorBrewer_1.1-2   plyr_1.8.4          
##  [4] zlibbioc_1.20.0      bitops_1.0-6         base64enc_0.1-3     
##  [7] tools_3.3.2          digest_0.6.12        memoise_1.0.0       
## [10] RSQLite_1.1-2        jsonlite_1.2         evaluate_0.10       
## [13] tibble_1.2           gtable_0.2.0         DBI_0.5-1           
## [16] yaml_2.1.14          withr_1.0.2          httr_1.2.1          
## [19] stringr_1.2.0        knitr_1.15.1         rappdirs_0.3.1      
## [22] rprojroot_1.2        Biobase_2.34.0       R6_2.2.0            
## [25] AnnotationDbi_1.36.0 rmarkdown_1.3        magrittr_1.5        
## [28] backports_1.0.5      scales_0.4.1         htmltools_0.3.5     
## [31] assertthat_0.1       colorspace_1.3-2     labeling_0.3        
## [34] config_0.2           stringi_1.1.2        RCurl_1.95-4.8      
## [37] lazyeval_0.2.0       munsell_0.4.3

Source:: R News

Reinforcement Learning in R

By blogisr

Reinforcement learning has gained considerable traction as it mines real experiences with the help of trial-and-error learning to model decision-making. Thus, this approach attempts to imitate the fundamental method used by humans of learning optimal behavior without the requirement of an explicit model of the environment. In contrast to many other approaches from the domain of machine learning, reinforcement learning works well with learning tasks of arbitrary length and can be used to learn complex strategies for many scenarios, such as robotics and game playing.

Our slide deck is positioned at the intersection of teaching the basic idea of reinforcement learning and providing practical insights into R. While existing packages, such as MDPtoolbox, are well suited to tasks that can be formulated as a Markov decision process, we also provide practical guidance regarding how to set up reinforcement learning in more vague environments. Therefore, each algorithm comes with an easy-to-understand explanation of how to use it in R.

We hope that the slide deck enables practitioners to quickly adopt reinforcement learning for their applications in R. Moreover, the materials might lay the groundwork for courses on human decision-making and machine learning.

Download the slides here

Download the exercise sheet here (solutions are available on request)

Source:: R News

Iteration and closures in R

By John Mount


(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I recently read an interesting thread on unexpected behavior in R when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.

The issue is: are references or values captured during iteration?

Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.

Please read on for a some of the history and future of this issue.

for loops

Consider the following code run in R version 3.3.2 (2016-10-31):

 functionsFor <- vector(2, mode='list')
 for(x in 1:2) { 
   functionsFor[[x]] <- function() return(x)


 # [1] 2

In real applications the functions would take additional arguments and perform calculations involving both the “partially applied” x and these future arguments. Obviously if we just wanted values we would not use functions. However, this trivial example is much simpler (except for the feeling it is silly) than a substantial application. The notation gets confusing even as we stand. But partial application (binding values into functions) is a common functional programming pattern (which happens to not always interact well with iteration).

Notice the answer printed is 2 (not 1).

This is because all the functions created in the loop captured a closure or reference to the same variable x (which is 2 at the end of the loop). The functions did not capture the value x had when the functions were created. We can confirm this by moving x around by hand, as we show below.

 x <- 4

 # [1] 4

This is a well know language design issue.

Trying to work-around it

The more complicated examples referenced in the thread are variations of the standard work-around: build a function factory so each function has a different closure (the new closures being the execution environments of each factory invocation). That code looks like the following:

 functionsFor2 <- vector(2, mode='list')
 for(x in 1:2) {
   functionsFor2[[x]] <- (function(x) {
     return(function() return(x))


 # [1] 2

The outer function (which gets called) is called the factory and is trivial (we are only using it to get new environments). The inner function is our example, which in the real world would take additional arguments and perform calculations involving these arguemnts in addition to x.

Notice the “fix” did not work. There is more than one problem lurking, and this is why so many experienced functional programmers are surprised by the behavior (despite probably having experience in many of the other functional languages we have mentioned). R “functions” are different than many current languages in that they have semantics closer to what Lisp called an fexpr. In particular arguments are subject to “lazy evaluation” (a feature R implements by a bookeeping process called “promises“).

So in addition to the (probably expected) unwanted shared closure issue, we have a lazy evaluation issue. The complete fix involves both introducing new closures (by the using the function factory’s execution closure) and forcing evaluation in these new environments. We show the code below:

 functionsFor3 <- vector(2, mode='list')
 for(x in 1:2) {
   functionsFor3[[x]] <- (function(x) {
     return(function() return(x))

 # [1] 1

Lazy evaluation is a fairly rare language feature (most famously used in Haskell), so it is not always everybody’s mind. R has lazy evaluation a number of places (function arguments and dplyr pipelines and data-structures being some of the most prominent uses).

lapply and purrr::map

I’ve taught this issue for years in our advanced R-programming workshops.

One thing I didn’t know is: R fixed this issue for base::lapply(). Consider the following code:

 functionsL <- lapply(1:2, 
   function(x) { function() return(x) })


 # [1] 1

Apparently lapply used to have the problem and was fixed by the time we got to R 3.2.

Coming back to the original thread, the current CRAN release of purrr (0.2.2) also has the reference behavior, as we can see below:

 functionsM <- purrr::map(1:2, 
   function(x) { function() return(x) })


 # [1] 2

Apparently this is scheduled for a fix.

Though, there is no way purrr::map() can behave the same as both for(){} and lapply() as the two currently have different behavior.


Lazy evaluation can increase complexity as it makes it less obvious to the programmer when something will be executed and increases the number of possible interactions the programmer can experience (as it is not determined when code will run, so one can not always know the state of the world it will run in).

My opinion is: lazy evaluation should be used sparingly in R, and only where it is trading non-determinism for some benefit. I would also point out that lazy evaluation is not the only possible way to capture specifications of calculations for future interpretation even in R. For example, formula-like interfaces also provide this capability.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

hrbrthemes 0.1.0 is now on CRAN

By hrbrmstr

I’m pleased to announce the inaugural release of my hrbrthemes (0.1.0) package on CRAN

The primary goal of said package is to provide opinionated typographical and other aesthetic defaults for ggplot2 charts.

Two core themes are included:

The Roboto Condensed Google Font comes with the package along with an installer for said font (it’s an R installer, you still need to install it on your OS manually).

Other niceties include:

  • scale_[xy]_comma() — shortcut for expand=c(0,0), labels=scales::comma
  • scale_[xy]_percent() — shortcut for expand=c(0,0), labels=scales::percent
  • scale_[color|fill]_ipsum() — discrete scale with 9 colors
  • gg_check() — pass-through spell checker for ggplot2 label elements

Source version is tracked on GitHub.

Critiques, bug reports and enhancement requests are most welcome as GitHub issues.

Source:: R News

Make your R simulation models 20 times faster

By Bluecology blog

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

Make your R simulation models 20 times faster

R can be frustratingly slow if you use its loops. However, you can speed
it up significantly (e.g. 20 times!) using the Rcpp package. That
could turn a day long simulation into an hour long simulation.

I had heard years ago about the Rcpp package. Rcpp lets you write
and use C ++ functions in R. However, I had never bothered to learn how
to write C ++. Instead if my simulation model got too slow, I just
redefined the problem (e.g. by using smaller spatial domain) so I could
continue with R.

Output of one of the population model runs showing solutions from an R function and an Rcpp function. The C++ version gives an identical results and was up to 20 times faster.

I persisted with R, rather than use another language, because of its
powerful graphics and the convenience of using a functional language
like R to perform sensitivity analyses. More on this later.

The other day I was browsing Wickhams Advanced
book and realised it is actually pretty easy
to write basic C++ loops.

Then I wondered if it would still be faster if you had to make repeated
calls to the same C++ function, for instance if you wanted to run a
sensitivity analysis, varying some model parameters. I like to use R for
this task because the purrr package makes it incredibly easy to run
arbitrary combinations of parameters through a function. Then it is
straightforward to summarize and plot the results with ggplot2.

Turns out you can get a massive improvement, even for repeated calls to
the same function. Here is a test.

A population model in R

First up, let’s write a simple model for simulating population size over
time, according to the logistic function. The below function just takes
your standard r (intrinsic population growth) and k (carrying
capacity) parameters and simulates population size starting at yinit
over t years.

Further, to I have included a stochastic process, whose variation is
controlled by thetasd, to illustrate Rcpp random number generator.

logmodr <- function(t, yinit, r, k, thetasd){
    y <- numeric(t)
    y[1] <- yinit
    theta <- rnorm(t, 0, thetasd)
    for(i in 2:t){
        y[i] <- y[i-1]*(r - r*(y[i-1]/k)) * exp(theta[i])

Note that I also ran these models without the stochastic component. The
speedup was even greater when you compared C++ to R without the
stochastic step (about 20 times).

A population model in C++

Now let’s write the equivalent C++ function. You will need to install
the Rcpp package. Note that it has some other software dependencies,
so I recommend you read the guide on

We write the function definition as a string and pass it to
cppFunction from Rcpp:

    cppFunction("NumericVector logmodc(int t, double yinit, double r,
double k, double thetasd){
            NumericVector y(t);
            y[0] = yinit;
      NumericVector theta = rnorm(t, 0, thetasd);
            for (int i = 1; i < t; ++i){
                y[i] = y[i-1]*(r - r*(y[i-1] / k)) * exp(theta[i]);
            return y;

Hopefully you can understand this, even if you are not familiar with
C++. The syntax is reasonably similar to R. If you learned to program in
R you may notice a few discrepencies.

First, C++ requires that you specify the type of each variable when its
created. You can’t just create new variables without assigning them a
type, and you can’t just change the type. This makes C++ more efficient
than R, because the computer knows exactly how much memory to allocate a
variable and doesn’t have to watch for changes.

Second, notice I start the iterator at time-step 1, whereas in the R
code we started at time-step 2. In C++ vectors are indexed starting at

Finally, don’t forget to end lines with ; (you can use ; to end
lines in R, but it is not essential).

Running a single simulation

First up, we need to define the model parameters:

t <- 100
yinit <- 1
k <- 20
thetasd <- 0.1
r <- 0.2

Now we can run our model. I am just going to plug the models straight
into microbenchmark, so I can compare their times.

mb1 <- microbenchmark(
    logmodc(t, yinit, 1.4, k, thetasd),
    logmodr(t, yinit, 1.4, k, thetasd)

## Unit: microseconds
##                                expr     min       lq      mean   median
##  logmodc(t, yinit, 1.4, k, thetasd)  10.051  11.1100  12.70373  11.7415
##  logmodr(t, yinit, 1.4, k, thetasd) 179.053 198.8315 251.48889 224.3450
##        uq      max neval cld
##   12.8825   67.801   100  a
##  296.1400 1098.436   100   b

So the C++ version is 19 times faster.

Running multiple simulations

So C++ is faster for a single call to a function (that contains a loop).
No surprises there. What if we want to make repeated calls to the same
function, is C++ still faster than R? We might want to make repeated
calls if we want to run different values of r through our model to do
a sensitivty analysis.

We could increase the scope of the C++ code to include a loop over
different values of r. However, then we would lose some of the
convenience of R, which is good at manipulating data. We also wouldn’t
be able to use purrr package to make sensitivity analysis easy.

First, up let’s create a sequence of r values:

rseq <- seq(1.1, 2.2, length.out = 10)

Now we can run our two models. I will use purrr::map (the :: just
means map is in the package purrr and avoids another call to
library()). We will also use set.seed() to make sure both algorithms
generate the same series of random numbers, that way we can check
whether the results are identical.

yc <- purrr::map(rseq, ~logmodc(t, yinit, .x, k, thetasd))
yr <- purrr::map(rseq, ~logmodr(t, yinit, .x, k, thetasd))

map iteratively steps through rseq replacing the .x in the
function call with each value of r in turn. Note that we also have to
turn the function call into a formula (with ~) to iterate in this way.

map returns a list, where each element is a time-series of population
sizes for a given value of r.

Let’s plot the result, for the second value of r:

plot(yr[[2]], type = "l", col = "DarkBlue", lwd = 2)
points(yc[[2]], pch = 16, col = "Tomato", cex = 0.8)
legend('topleft', legend = c("R solution","C solution"),
       pch = 16, col = c("DarkBlue", "Tomato"))

They look identical, excellent.

Now, let’s compare the time. Remember I had wondered whether repeated
calls to a C++ function might lose some of the performance gain:

mb2 <- microbenchmark(
    purrr::map(rseq, ~logmodc(t, yinit, .x, k, thetasd)),
    purrr::map(rseq, ~logmodr(t, yinit, .x, k, thetasd))

## Unit: microseconds
##                                                  expr      min        lq
##  purrr::map(rseq, ~logmodc(t, yinit, .x, k, thetasd))  151.421  166.4165
##  purrr::map(rseq, ~logmodr(t, yinit, .x, k, thetasd)) 1890.549 2047.6310
##       mean    median       uq      max neval cld
##   199.9101  179.5795  221.885  371.192   100  a
##  2543.3459 2233.7455 2534.350 9173.440   100   b

Turns out we still gain a 12 times improvement when using C++.

I don’t believe I have been wasting so many hours waiting for
simulations to run all these years. Learning a bit of C++ is well worth
the investment.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Analysis of International T20 matches with yorkr templates

By Tinniam V Ganesh

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)


In this post I create yorkr templates for International T20 matches that are available on Cricsheet. With these templates you can convert all T20 data which is in yaml format to R dataframes. Further I create data and the necessary templates for analyzing. All of these templates can be accessed from Github at yorkrT20Template. The templates are

  1. Template for conversion and setup – T20Template.Rmd
  2. Any T20 match – T20Matchtemplate.Rmd
  3. T20 matches between 2 nations – T20Matches2TeamTemplate.Rmd
  4. A T20 nations performance against all other T20 nations – T20AllMatchesAllOppnTemplate.Rmd
  5. Analysis of T20 batsmen and bowlers of all T20 nations – T20BatsmanBowlerTemplate.Rmd

Besides the templates the repository also includes the converted data for all T20 matches I downloaded from Cricsheet in Dec 2016, You can recreate the files as more matches are added to Cricsheet site. This post contains all the steps needed for T20 analysis, as more matches are played around the World and more data is added to Cricsheet. This will also be my reference in future if I decide to analyze T20 in future!

Feel free to download/clone these templates from Github yorkrT20Template and perform your own analysis

There will be 5 folders at the root

  1. T20data – Match files as yaml from Cricsheet
  2. T20Matches – Yaml match files converted to dataframes
  3. T20MatchesBetween2Teams – All Matches between any 2 T20 teams
  4. allMatchesAllOpposition – A T20 countries match data against all other teams
  5. BattingBowlingDetails – Batting and bowling details of all countries

The first few steps take care of the data setup. This needs to be done before any of the analysis of T20 batsmen, bowlers, any T20 match, matches between any 2 T20 countries or analysis of a teams performance against all other countries

There will be 5 folders at the root

  1. T20data
  2. T20Matches
  3. T20MatchesBetween2Teams
  4. allMatchesAllOpposition
  5. BattingBowlingDetails

The source YAML files will be in T20Data folder

1.Create directory T20Matches

Some files may give conversions errors. You could try to debug the problem or just remove it from the T20data folder. At most 2-4 file will have conversion problems and I usally remove then from the files to be converted.

Also take a look at my Inswinger shiny app which was created after performing the same conversion on the Dec 16 data .


2.Save all matches between all combinations of T20 nations

This function will create the set of all matches between every T20 country against every other T20 country. This uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.


3.Save all matches against all opposition

This will create a consolidated dataframe of all matches played by every T20 playing nation against all other nattions. This also uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.


4. Create batting and bowling details for each T20 country

These are the current T20 playing nations. You can add to this vector as more countries start playing T20. You will get to know all T20 nations by also look at the directory created above namely allMatchesAllOpposition. his also uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.

teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka',
          "England", "Bangladesh","Netherlands","Scotland", "Afghanistan",
          "Zimbabwe","Ireland","New Zealand","South Africa","Canada",
          "Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea",
          "United Arab Emirates")

for(i in seq_along(teams)){
    val <- paste(teams[i],"-details",sep="")
    val <- getTeamBattingDetails(teams[i],dir="../T20Matches", save=TRUE)


for(i in seq_along(teams)){
    val <- paste(teams[i],"-details",sep="")
    val <- getTeamBowlingDetails(teams[i],dir="../T20Matches", save=TRUE)


5. Get the list of batsmen for a particular country

For e.g. if you wanted to get the batsmen of Canada you would do the following. By replacing Canada for any other country you can get the batsmen of that country. These batsmen names can then be used in the batsmen analysis

teamData <- paste(country,"-BattingDetails.RData",sep="")
countryDF <- battingDetails
bmen <- countryDF %>% distinct(batsman) 
bmen <- as.character(bmen$batsman)
batsmen <- sort(bmen)

6. Get the list of bowlers for a particular country

The method below can get the list of bowler names for any T20 nation. These names can then be used in the bowler analysis below

teamData <- paste(country,"-BowlingDetails.RData",sep="")
countryDF <- bowlingDetails
bwlr <- countryDF %>% distinct(bowler) 
bwlr <- as.character(bwlr$bowler)
bowler <- sort(bwlr)

Now we are all set

A) International T20 Match Analysis

Load any match data from the ./T20Matches folder for e.g. Afganistan-England-2016-03-23.RData

afg_eng<- overs
#The steps are
country1_country2 <- overs

All analysis for this match can be done now

2. Scorecard


3.Batting Partnerships


4. Batsmen vs Bowler Plot


5. Team bowling scorecard


6. Team bowling Wicket kind match

m <-teamBowlingWicketKindMatch(country1_country2,"Country1","Country2",plot=FALSE)

7. Team Bowling Wicket Runs Match

m <-teamBowlingWicketRunsMatch(country1_country2,"Country1","Country2",plot=FALSE)

8. Team Bowling Wicket Match

m <-teamBowlingWicketMatch(country1_country2,"Country1","Country2",plot=FALSE)

9. Team Bowler vs Batsmen

m <- teamBowlersVsBatsmenMatch(country1_country2,"Country1","Country2",plot=FALSE)

10. Match Worm chart


B) International T20 Matches between 2 teams

Load match data between any 2 teams from ./T20MatchesBetween2Teams for e.g.Australia-India-allMatches

aus_ind_matches <- matches
#Replace below with your own countries
country2 <- "South Africa"
country1VsCountry2 <- paste(country1,"-",country2,"-allMatches.RData",sep="")
country1_country2_matches <- matches

2.Batsmen partnerships

m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country1",report="summary")
m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country2",report="summary")
m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country1",report="detailed")

3. Team batsmen vs bowlers


4. Bowling scorecard

a <-teamBattingScorecardOppnAllMatches(country1_country2_matches,main="country1",opposition="country2")

5. Team bowling performance


6. Team bowler wickets

m <-teamBowlersWicketsOppnAllMatches(country1_country2_matches,main="country1",opposition="country2",plot=FALSE)

7. Team bowler vs batsmen


8. Team bowler wicket kind

m <- teamBowlersWicketKindOppnAllMatches(country1_country2_matches,"country1","country2",plot=FALSE)

9. Team bowler wicket runs


10. Plot wins and losses


C) International T20 Matches for a team against all other teams

Load the data between for a T20 team against all other countries ./allMatchesAllOpposition for e.g all matches of India

india_matches <- matches
allMatches <- paste("allMatchesAllOposition-",country,".RData",sep="")
country1AllMatches <- matches

2. Team’s batting scorecard all Matches

m <-teamBattingScorecardAllOppnAllMatches(country1AllMatches,theTeam="country1")

3. Batting scorecard of opposing team

m <-teamBattingScorecardAllOppnAllMatches(matches=country1AllMatches,theTeam="country2")

4. Team batting partnerships

m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam="country1")
m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam='country1',report="detailed")
m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam='country1',report="summary")

5. Team batting partnerships plot


6, Team batsmen vs bowlers report

m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=0)
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=1,dispRows=30)
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(matches=country1AllMatches,theTeam="country2",rank=1,dispRows=25)

7. Team batsmen vs bowler plot

d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=1,dispRows=50)
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=2,dispRows=50)

8. Team bowling scorecard


9. Team bowler vs batsmen


10. Team Bowler vs bastmen

df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(country1AllMatches,theTeam="country1",rank=1)

11. Team bowler wicket kind




D) Batsman functions

Get the batsman’s details for a batsman

kohli <- getBatsmanDetails(team="India",name="Kohli",dir=".")
batsmanDF <- getBatsmanDetails(team="country1",name="batsmanName",dir=".")

2. Runs vs deliveries


3. Batsman 4s & 6s

batsman46 <- select(batsmanDF,batsman,ballsPlayed,fours,sixes,runs)
p1 <- batsmanFoursSixes(batsman46,"batsmanName")

4. Batsman dismissals


5. Runs vs Strike rate


6. Batsman Moving Average


7. Batsman cumulative average


8. Batsman cumulative strike rate


9. Batsman runs against oppositions


10. Batsman runs vs venue


11. Batsman runs predict


12. Bowler functions

For example to get Ravicahnder Ashwin’s bowling details

ashwin <- getBowlerWicketDetails(team="India",name="Ashwin",dir=".")
bowlerDF <- getBatsmanDetails(team="country1",name="bowlerName",dir=".")

13. Bowler Mean Economy rate


14. Bowler mean runs conceded


15. Bowler Moving Average


16. Bowler cumulative average wickets


17. Bowler cumulative Economy Rate (ER)


18. Bowler wicket plot


19. Bowler wicket against opposition


20. Bowler wicket at cricket grounds


21. Predict number of deliveries to wickets

bowlerDF1 <- getDeliveryWickets(team="country1",dir=".",name="bowlerName",save=FALSE)

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News