New R Course: Introduction to the Tidyverse!

By gabriel

(This article was first published on, and kindly contributed to R-bloggers)

Hi! Big announcement today as we just launched Introduction to the Tidyverse R course by David Robinson!

This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

Take me to chapter 1!

Introduction to the Tidyverse features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a Tidyverse expert!

What you’ll learn

1. Data wrangling

In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

2. Data visualization

You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

3. Grouping and summarizing

So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

4. Types of visualizations

You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

Master the Tidyverse with our course Introduction to the Tidyverse

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Be a BigBallR in #rstats : Stayeth or Switcheth

By MikeJackTzen

(This article was first published on rstats – MikeJackTzen, and kindly contributed to R-bloggers)

If you’re a Big Baller, you know when to stayeth in your lane but also when to switcheth lanes.

@LangstonKerman four score and seven years ago our fathers brought forth on this continent, a new lane to stayeth in

MikeJackTzen (@MKJCKTZN) November 19, 2017

The Big Baller Brand brothers just inked a deal to play professional basketball in Europe. They switched into a different lane to achieve their original goal.

It’s not about the money for the Ball Brothers. They have a passion to play Basketball and to experience playing as……

Big Baller Brand (@bigballerbrand) December 12, 2017

The knee-jerk reaction from the non-globally minded is that this spells doom for any NBA hoop dreams. Not so.

@bigballerbrand @MELOD1P @LiAngeloBall shoutouts to @bigballerbrand
for pursuing other ways 2 achieve goals.

MikeJackTzen (@MKJCKTZN) December 12, 2017

Four score and ten years ago, your only shot of making it to the NBA was usually through the NCAA. The NBA has adopted a global perspective. You see professionals from different continental leagues finding a way into the NBA. The 2016 to 2017 diagram is interesting, we’re seeing a darker edge from the NBA to the G-League. The NBA Players Association worked hard on installing ‘2 way contracts’ letting NBA players play in (and get paid in) both the NBA and the G-League, this was a great move, a no-brainer.

Options are great, but keep in mind that the NBA is the crème de la crème. You see the lightest edges enter the NBA node while you see many dark edges enter the ‘none’ node.

When I started making the above network plot of basketball league to league migrations, I looked into off the shelf R packages. There’s a fragmented handfull of R packages to do this. Kicker, none of them do directed edges the way I wanted.

If you’re an #rstats user and want to adopt the BigBallR philosophy, find multiple ways to achieve your goal.

I went back to the basics and hand rolled my own network plot with vanilla ggplot2. At the end of the day, a network plot needs two things, nodes and edges, eg points (stored in dat_nodes) and lines (stored in dat_edges). Once you have your data in that format, you can make the network migration plot above, with the code snippet below

 geom_point(aes(size=10,label=league),show.legend = FALSE) +
 arrow=arrow(angle=30,length = unit(0.2, "inches")),
 aes(x=X1_send,y = X2_send,
 xend = X1_rec,yend = X2_rec)

I hand-rolled the network diagram cuz the other packages didn’t have the custom features I needed. In my hand-rolled plot there is still one thing missing. I want to place the ‘arrow head’ along some other part of the curve (say the mid-point), other than the end-point. This is probably hard to do, since arrow() looks like it just needs the receiving coordinate and plops the arrow head there. For what I want, ggplot2 needs to know the desired placement-coordinates output from geom_curve(). So somehow, the internal curve calculations need to be returned in order to pass into arrow().

To leave a comment for the author, please follow the link and comment on their blog: rstats – MikeJackTzen. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

When a Tweet Turns Into an R Package

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Reproduced with the kind permission of our Head of Data Engineering, Mark Sellors and first published on his blog

When a Tweet Turns Into an R Package

Boy, that escalated quickly

I just wanted to write up a brief post about the power of R, its community, and tell the story of how actually putting stuff out into the world can have amazing consequences.

About 24 hours ago I was going to tweet something like this:

Hey Mac #rstats users – system(‘say “hello rstats user”’)

I’d been playing with the MacOS command line tool, ‘say’, with the kids, and just figured it would be funny to make R say stuff.

Before I tweeted though, I thought I’d better check that it worked as intended. While I was doing that I decided it would be fun to expose the say command’s different voices and figured I’d make a gist on github instead, as it was getting too long for a tweet.

So, I started playing around with the different voices and thinking of a nice way to expose that and by this point I’m thinking I might as well just make it into a package. Making a package in RStudio is so easy, it seemed like the easiest way to share what I was doing. And so now we have the rsay package.

Bear in mind that I didn’t start out to create a package. It has one function, which is essentially the thinnest wrapper around an existing command line tool and it only works on MacOS and it’s pretty silly really, so I didn’t expect there to be a great deal of interest.

In the end, instead of tweeting that one command, I tweeted about how I accidentally made this R package.

That in turn got noticed by the awesome Mara Averick, which resulted in this tweet:

To illustrate the incredible power of @sellorm‘s new 📦, “rsay” #rstats

— Mara Averick (@dataandme) December 9, 2017

Then Mark Edmondson sees Mara’s tweet, and mentions how it might be fun to build it into a sort of Babelfish using some of his existing work around the Google APIs. (Mark knows a lot about Google’s APIs and R!)

A few hours after initially mentioning this, the following tweet appears in my timeline:

Using @sellorm ‘s new rsay package that talks to you on MacOS, enhanced the googleLanguageR demo Shiny app so that it now can transcribe speech to text, translate it, and talk back to you

— Mark Edmondson (@HoloMarkeD) December 9, 2017

He’d only gone and done it – How cool is that?!?!

To recap; silly idea, and R’s easy to use packaging tools lead to an accidental (but still pretty silly) R package. Package is popularised by prominent developer advocate. Package is then integrated into a shiny app for automated machine translation, by well known R/Google API developer.

It’s been a fun 24 hours, but the main thing to take away, is just to get your stuff out there, even if you don’t think it’s that great/interesting/useful. When you share an idea, even if it seems frivolous or trivial, that idea takes on a life of its own and you never know where that may lead.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Point Pattern Analysis using Ecological Methods in R

By James

(This article was first published on R –, and kindly contributed to R-bloggers)

Here is a quick example for how to get started with some of the more sophisticated point pattern analysis tools that have been developed for ecologists – principally the adehabitathr package – but that are very useful for human data. Ecologists deploy point pattern analysis to establish the “home range” of a particular animal based on the know locations it has been sighted (either directly or remotely via camera traps). Essentially it is where the animal spends most of its time. In the case of human datasets the analogy can be extended to identify areas where most crimes are committed – hotspots – or to identify the activity spaces of individuals or the catchment areas of services such as schools and hospitals.

This tutorial offers a rough analysis of crime data in London so the maps should not be taken as definitive – I’ve just used them as a starting point here.

Police crime data have been taken from here This tutorial London uses data from London’s Metropolitan Police (September 2017).

For the files used below click here.

#Load in the library and the csv file.



At the moment input is a basic data frame. We need to convert the data frame into a spatial object. Note we have first specified our epsg code as 4326 since the coordinates are in WGS84. We then use spTransform to reproject the data into British National Grid – so the coordinate values are in meters.


The plot reveals that we have crimes across the UK, not just in London. So we need an outline of London to help limit the view. Here we load in a shapefile of the Greater London Authority Boundary.


Thinking ahead we may wish to compare a number of density estimates, so they need to be performed across a consistently sized grid. Here we create an empty grid in advance to feed into the kernelUD function.


OK now run the density estimation note we use grid= xy utlise the grid we just created. This is for all crime in London.


Unsurprisingly we have a hotpot over the centre of London. Are there differences for specific crime types? We may, for example, wish to look at the density of burglaries.

plot(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",]) # quick plot of burglary points


There’s a slight difference but still it’s tricky to see if there are areas where burglaries concentrate more compared to the distribution of all crimes. A very rough way to do this is to divide one density grid by the other.


This hasn’t worked particularly well since there are edge effects on the density grid that are causing issues due to a few stray points at the edge of the grid. We can solve this by capping the values we map – in this we are only showing values of between 0 and 1. Some more interesting structures emerge with burglary occuring in more residential areas, as expected.

both2 = 1] 

There’s many more sophisticated approaches to this kind of analysis – I’d encourgae you to look at the adehabitatHR vignette on the package page.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Not So Simple Bar Plot Example Using ggplot2

By INWT-Blog-RBloggers

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2. To download the data you can use the following lines:

dir.create("data") writeLines("*", "data/.gitignore") download.file("",               "data/") unzip("data/", exdir = "data") 

And to download the original script and the base R version of this plot:

download.file("",               "data/") unzip("data/", exdir = "data") 

After downloading check out the original pdf version of this plot in data/beispielcode/pdf/balkendiagramm_einfach.pdf.

Preparing Data

Here are some steps to modify the data such that it can be easily used with ggplot2.

# remember to adjust the path ipsos <- openxlsx::read.xlsx("../data/alle_daten/ipsos.xlsx")  ipsos <- ipsos[order(ipsos$Wert),] ipsos$Land <- ordered(ipsos$Land, ipsos$Land) ipsos$textFamily <- ifelse(ipsos$Land %in% c("Deutschland","Brasilien"),                            "Lato Black", "Lato Light") ipsos$labels <- paste0(ipsos$Land, ifelse(ipsos$Wert < 10, "     ", "  "),                        ipsos$Wert) rect <- data.frame(   ymin = seq(0, 80, 20),   ymax = seq(20, 100, 20),   xmin = 0.5, xmax = 16.5,   colour = rep(c(grDevices::rgb(191,239,255,80,maxColorValue=255),                  grDevices::rgb(191,239,255,120,maxColorValue=255)),                length.out = 5)) 

The Basic Plot

First we add the geoms, then modifications to the scales and flip of the coordinate system. The remaining code is just modifying the appearance.

library("ggplot2") ggBar <- ggplot(ipsos) +   geom_bar(aes(x = Land, y = Wert), stat = "identity", fill = "grey") +   geom_bar(aes(x = Land, y = ifelse(Land %in% c("Brasilien", "Deutschland"), Wert, NA)),            stat = "identity", fill = rgb(255,0,210,maxColorValue=255)) +   geom_rect(data = rect,             mapping = aes(ymin = ymin, ymax = ymax,                           xmin = xmin, xmax = xmax),             fill = rect$colour) +   geom_hline(aes(yintercept = 45), colour = "skyblue3") +   scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100), expand = c(0, 0)) +   scale_x_discrete(labels = ipsos$labels) +     coord_flip() +   labs(y = NULL,        x = NULL,        title = NULL) +   theme_minimal() +   theme(panel.grid.minor = element_blank(),         panel.grid.major = element_blank(),         axis.ticks = element_blank(),         axis.text.y = element_text(           family = ipsos$textFamily),         text = element_text(family = "Lato Light")) ggBar 

Annotations and Layout

Of course you can simply add the title and text annotations to the plot using ggplot2, but I didn’t find a way to do the exact placement comparable to the original version without the package grid.

library("grid") vp_make <- function(x, y, w, h)    viewport(x = x, y = y, width = w, height = h, just = c("left", "bottom")) main <- vp_make(0.05, 0.05, 0.9, 0.8) title <- vp_make(0, 0.9, 0.6, 0.1) subtitle <- vp_make(0, 0.85, 0.4, 0.05) footnote <- vp_make(0.55, 0, 0.4, 0.05) annotation1 <- vp_make(0.7, 0.85, 0.225, 0.05) annotation2 <- vp_make(0.4, 0.85, 0.13, 0.05) # To see which space these viewports will use: grid.rect(gp = gpar(lty = "dashed")) grid.rect(gp = gpar(col = "grey"), vp = main) grid.rect(gp = gpar(col = "grey"), vp = title) grid.rect(gp = gpar(col = "grey"), vp = subtitle) grid.rect(gp = gpar(col = "grey"), vp = footnote) grid.rect(gp = gpar(col = "grey"), vp = annotation1) grid.rect(gp = gpar(col = "grey"), vp = annotation2)

And now we can add the final annotations to the plot:

# pdf_datei<-"balkendiagramme_einfach.pdf" # cairo_pdf(bg = "grey98", pdf_datei, width=9, height=6.5) grid.newpage() print(ggBar, vp = main) 

grid.text("'Ich glaube fest an Gott oder ein höheres Wesen'",           gp = gpar(fontfamily = "Lato Black", fontsize = 14),           just = "left", x = 0.05, vp = title) grid.text("...sagten 2010 in:",           gp = gpar(fontfamily = "Lato Light", fontsize = 12),           just = "left",           x = 0.05, vp = subtitle) grid.text("Quelle:, Design: Stefan Fichtel, ixtract",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 0.95, vp = footnote) grid.text("Alle Angaben in Prozent",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 1, y = 0.55, vp = annotation1) grid.text("Durchschnitt: 45",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 0.95, y = 0.55, vp = annotation2) 


To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Stock-Recruitment Graphing Questions

By fishR Blog

plot of chunk srStarts1

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

A fishR user recently asked me

In the book that you published, I frequently use the stock-recruit curve code. The interface that shows both the Ricker/Beverton-Holt figure with the recruit per spawner to spawner figure (i.e., the dynamic plot for srStarts()) has not been working for quite some time. Additionally, I can get the recruits versus spawner plot for the Beverton-Holt or Ricker curve with confidence bounds around the curve, but how do you do the same for the recruit per spawner to spawner curve?

Below I answer both questions. First, however, I load required packages …

library(FSA)      # srStarts, srFuns
library(FSAdata) # PSalmonAK data
library(dplyr) # %>%, filter, mutate, select
library(nlstools) # nlsBoot

… and obtain the same data (in PSalmonAK.csv available from here) used in the Stock-Recruitment chapter of my Introductory Fisheries Analyses with R (IFAR) book. Note that I created two new variables here: retperesc is the “recruits per spawner” and logretperesc is the natural log of the recruits per spawner variable.

pinks  read.csv("data/PSalmonAK.csv") %>%
filter(!,! %>%
retperesc=ret/esc,logretperesc=log(retperesc)) %>%
##    year   esc    ret   logret retperesc logretperesc
## 1 1960 1.418 2.446 0.894454 1.724965 0.5452066
## 2 1961 2.835 14.934 2.703640 5.267725 1.6615986
## 3 1962 1.957 10.031 2.305680 5.125703 1.6342676
## 28 1987 4.289 18.215 2.902245 4.246911 1.4461918
## 29 1988 2.892 9.461 2.247178 3.271438 1.1852298
## 30 1989 4.577 23.359 3.150982 5.103561 1.6299386

Dynamic Plot Issue

The first question about the dynamic plot is due to a change in functionality in the FSA package since IFAR was published. The dynamicPlot= argument was removed because the code for that argument relied on the tcltk package, which I found difficult to reliably support. A similar, though more manual, approach is accomplished with the new fixed= and plot= arguments. For example, using plot=TRUE (without fixed=) will produces a plot of “recruits” versus “stock” with the chosen stock-recruitment model evaluated at the automatically chosen parameter starting values superimposed.

svR  srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE)

The user, however, can show the stock-recruitment model evaluated at manually chosen parameter starting values by including those starting values in a list that is supplied to fixed=. These values can be iteratively changed in subsequent calls to srStarts() to manually find starting values that provide a model that reasonably fits (by eye) the stock-recruit data.

svR  srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE,

plot of chunk srStarts2

Note however that srStarts() no longer supports the simultaneously plotting of spawners versus recruits and recruits per spawner versus recruits.

Plot of Recruits per Spawner versus Spawners

The first way that I can imagine plotting recruits per spawners versus spawners with the fitted curve and confidence bands is to first follow the code for fitting the stock-recruit function to the stock and recruit data as described in IFAR. In this case, the stock-recruit function is fit on the log scale to adjust for a multiplicative error structure (as described in IFAR).

rckr  srFuns("Ricker")
srR nls(logret~log(rckr(esc,a,b)),data=pinks,start=svR)
bootR nlsBoot(srR)
##    estimates     95% LCI   95% UCI
## a 2.84924206 1.74768271 4.8435727
## b 0.05516673 -0.08443862 0.2158164

As described in the book, the plot of spawners versus recruits is made by (i) constructing a sequence of “x” values that span the range of observed numbers of spawners, (ii) predicting the number of recruits at each spawner value using the best-fit stock-recruitment model, (iii) constructing lower and upper confidence bounds for the predicted number of recruits at each spawner value with the bootstrap results, (iv) making a schematic plot on which to put (v) a polygon for the confidence band, (vi) the raw data points, and (vii) the best-fit curve. The code below follows these steps and reproduces Figure 12.4 in the book.

x  seq(0,9,length.out=199)        # many S for prediction
pR rckr(x,a=coef(srR)) # predicted mean R
LCI UCI numeric(length(x))

for(i in 1:length(x)) { # CIs for mean R @ each S
tmp apply(bootR$coefboot,MARGIN=1,FUN=rckr,S=x[i])
LCI[i] quantile(tmp,0.025)
UCI[i] quantile(tmp,0.975)
ylmts range(c(pR,LCI,UCI,pinks$ret))
xlmts range(c(x,pinks$esc))

ylab="Returners (millions)",
xlab="Escapement (millions)")

plot of chunk RickerFit1

These results can be modified to plot recruits per spawner versus spawners by replacing the “recruits” in the code above with “recruits per spawner.” This is simple for the actual data as ret is simply replaced with retperesc. However, the predicted number of recruits (in pR) and the confidence bounds (in LCI and UCI) from above must be divided by the number of spawners (in x). As the / symbol has a special meaning in R formulas, this division must be contained within I() as when it appears in a formula (see the lines() code below). Of course, the y-axis scale range must also be adjusted. Thus, a plot of recruits per spawner versus spawners is produced from the previous results with the following code.

ylmts  c(0.7,7)
xlab="Escapement (millions)")

plot of chunk RickerFit2

Alternatively, the Ricker stock-recruitment model could be reparameterized by dividing each side of the function by “spawners” such that the right-hand-side becomes “recruits per spawner” (this is a fairly typical reparameterization of the model). This model can be put into an R function, with model parameters then estimated with nonlinear regression similar to above. The results below show that the paramter point estimates are identical and the bootsrapped confidence intervals are similar to what was obtained above.

rckr2  function(S,a,b=NULL) {
if (length(a)>1) { b a[[2]]; a a[[1]] }
srR2 nls(logretperesc~log(rckr2(esc,a,b)),data=pinks,start=svR)
bootR2 nlsBoot(srR2)
##    estimates     95% LCI   95% UCI
## a 2.84924202 1.67734916 4.8613811
## b 0.05516673 -0.08776123 0.2040402

With this, a second method for plotting recruits per spawner versus spawners is the same as how the main plot from the book was constructed but modified to use the results from this “new” model.

x  seq(0,9,length.out=199)        # many S for prediction
pRperS rckr2(x,a=coef(srR2)) # predicted mean RperS
LCI2 UCI2 numeric(length(x))

for(i in 1:length(x)) { # CIs for mean RperS @ each S
tmp apply(bootR2$coefboot,MARGIN=1,FUN=rckr2,S=x[i])
LCI2[i] quantile(tmp,0.025)
UCI2[i] quantile(tmp,0.975)
ylmts range(c(pRperS,LCI2,UCI2,pinks$retperesc))
xlmts range(c(x,pinks$esc))

xlab="Escapement (millions)")

plot of chunk RickerFit3

The two methods described above for plotting recruits per spawner versuse spawners are identical for the best-fit curve and nearly identical for the confidence bounds (slight differences likely due to the randomness inherent in bootstrapping). Thus, the two methods produce nearly the same visual.

xlab="Escapement (millions)")

plot of chunk RickerFit4

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Upcoming R conferences (2018)

By R on The Jumping Rivers Blog

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

It’s that time of year when we need to start thinking about what R Conferences we would like to (and can!) attend. To help plan your (ahem) work trips, we thought it would be useful to list the upcoming main attractions.

We maintain a list of upcoming rstats conferences. To keep up to date, just follow our twitter bot.

rstudio::conf (San Diego, USA)

rstudio::conf is about all things R and RStudio

The next RStudio conference is held in San Diego, and boosts top quality speakers from around the R world. With excellent tutorials and great talks, this is certainly one of the top events. The estimated cost is around $200 per day, so not the cheapest options, but worthwhile.

SatRday (Cape Town, South Africa)

An opportunity to hear from and network with top Researchers, Data Scientists and Developers from the R community in South Africa and beyond.

This workshop combines an amazing location, with great speakers and at an amazing price (only $70 per day). The key note speakers are Maëlle Salmon and Stephanie Kovalchik. This years SatRday has two tutorials, one on package building the other on sports modelling.

  • March 17th: SatRday. Cape Town, South Africa.


The European R Users Meeting, eRum, is an international conference that aims at integrating users of the R language living in Europe. This conference is similar to useR!.

R/Finance 2018 (Chicago, USA)

From the inaugural conference in 2009, the annual R/Finance conference in Chicago has become the primary meeting for academics and practioners interested in using R in Finance. Participants from academia and industry mingle for two days to exchange ideas about current research, best practices and applications. A single-track program permits continued focus on a series of refereed submissions. We hear there’s a lively social program rounds out the event.

useR! 2018 (Brisbane, Australia)

With useR! 2017 spanning over 5 days and boasting some of the biggest names in data science, the next installment of the useR! series is sure to be even better. Last year the program was packed with speakers, programmes and tutorials from industry and academia. Each day usually containing numerous tutorials and usually ends in a keynote speech followed by dinner(!). Registration is open in January 2018.

Noreast’R Conference (North East USA)

Born on Twitter, the Noreast’R Conference is a grass roots effort to organize a regional #rstats conference in the Northeastern United States. Work is currently under way and further details will follow on the Noreast’R website and twitter page (linked below).

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Explaining Predictions of Machine Learning Models with LIME – Münster Data Science Meetup

By Dr. Shirin Glander

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

Slides from Münster Data Science Meetup

These are my slides from the Münster Data Science Meetup on December 12th, 2017.

My sketchnotes were collected from these two podcasts:

Sketchnotes: TWiML Talk #7 with Carlos Guestrin – Explaining the Predictions of Machine Learning Models & Data Skeptic Podcast – Trusting Machine Learning Models with Lime

Example Code

  • the following libraries were loaded:
library(tidyverse)  # for tidy data analysis
library(farff)      # for reading arff file
library(missForest) # for imputing missing values
library(dummies)    # for creating dummy variables
library(caret)      # for modeling
library(lime)       # for explaining predictions


The Chronic Kidney Disease dataset was downloaded from UC Irvine’s Machine Learning repository:

  • load data with the farff package


  • age – age
  • bp – blood pressure
  • sg – specific gravity
  • al – albumin
  • su – sugar
  • rbc – red blood cells
  • pc – pus cell
  • pcc – pus cell clumps
  • ba – bacteria
  • bgr – blood glucose random
  • bu – blood urea
  • sc – serum creatinine
  • sod – sodium
  • pot – potassium
  • hemo – hemoglobin
  • pcv – packed cell volume
  • wc – white blood cell count
  • rc – red blood cell count
  • htn – hypertension
  • dm – diabetes mellitus
  • cad – coronary artery disease
  • appet – appetite
  • pe – pedal edema
  • ane – anemia
  • class – class

Missing data

  • impute missing data with Nonparametric Missing Value Imputation using Random Forest (missForest package)

One-hot encoding

  • create dummy variables (
  • scale and center


# training and test set
## Random Forest 
## 360 samples
##  48 predictor
##   2 classes: 'ckd', 'notckd' 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 324, 324, 324, 324, 325, 324, ... 
## Resampling results across tuning parameters:
##   mtry  Accuracy   Kappa    
##    2    0.9922647  0.9838466
##   25    0.9917392  0.9826070
##   48    0.9872930  0.9729881
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
# predictions
pred %
  mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, "correct", "wrong"))

confusionMatrix(pred$actual, pred$prediction)
## Confusion Matrix and Statistics
##           Reference
## Prediction ckd notckd
##     ckd     23      2
##     notckd   0     15
##                Accuracy : 0.95            
##                  95% CI : (0.8308, 0.9939)
##     No Information Rate : 0.575           
##     P-Value [Acc > NIR] : 1.113e-07       
##                   Kappa : 0.8961          
##  Mcnemar's Test P-Value : 0.4795          
##             Sensitivity : 1.0000          
##             Specificity : 0.8824          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5750          
##          Detection Rate : 0.5750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.9412          
##        'Positive' Class : ckd             


  • LIME needs data without response variable
  • build explainer
  • run explain() function
  • model reliability
explanation_df %>%
  ggplot(aes(x = model_r2, fill = label)) +
    geom_density(alpha = 0.5)

  • plot explanations
plot_features(explanation_df[1:24, ], ncol = 1)

Session Info

## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.2 (2017-09-28)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  de_DE.UTF-8                 
##  tz                               
##  date     2017-12-12
## Packages -----------------------------------------------------------------
##  package      * version  date       source        
##  assertthat     0.2.0    2017-04-11 CRAN (R 3.4.0)
##  backports      1.1.1    2017-09-25 CRAN (R 3.4.2)
##  base         * 3.4.2    2017-10-04 local         
##  BBmisc         1.11     2017-03-10 CRAN (R 3.4.0)
##  bindr          0.1      2016-11-13 CRAN (R 3.4.0)
##  bindrcpp     * 0.2      2017-06-17 CRAN (R 3.4.0)
##  blogdown       0.3      2017-11-13 CRAN (R 3.4.2)
##  bookdown       0.5      2017-08-20 CRAN (R 3.4.1)
##  broom          0.4.3    2017-11-20 CRAN (R 3.4.2)
##  caret        * 6.0-77   2017-09-07 CRAN (R 3.4.1)
##  cellranger     1.1.0    2016-07-27 CRAN (R 3.4.0)
##  checkmate      1.8.5    2017-10-24 CRAN (R 3.4.2)
##  class          7.3-14   2015-08-30 CRAN (R 3.4.2)
##  cli            1.0.0    2017-11-05 CRAN (R 3.4.2)
##  codetools      0.2-15   2016-10-05 CRAN (R 3.4.2)
##  colorspace     1.3-2    2016-12-14 CRAN (R 3.4.0)
##  compiler       3.4.2    2017-10-04 local         
##  crayon         1.3.4    2017-09-16 cran (@1.3.4) 
##  CVST           0.2-1    2013-12-10 CRAN (R 3.4.0)
##  datasets     * 3.4.2    2017-10-04 local         
##  ddalpha        1.3.1    2017-09-27 CRAN (R 3.4.2)
##  DEoptimR       1.0-8    2016-11-19 CRAN (R 3.4.0)
##  devtools       1.13.4   2017-11-09 CRAN (R 3.4.2)
##  digest         0.6.12   2017-01-27 CRAN (R 3.4.0)
##  dimRed         0.1.0    2017-05-04 CRAN (R 3.4.0)
##  dplyr        * 0.7.4    2017-09-28 CRAN (R 3.4.2)
##  DRR            0.0.2    2016-09-15 CRAN (R 3.4.0)
##  dummies      * 1.5.6    2012-06-14 CRAN (R 3.4.0)
##  e1071          1.6-8    2017-02-02 CRAN (R 3.4.0)
##  evaluate       0.10.1   2017-06-24 CRAN (R 3.4.0)
##  farff        * 1.0      2016-09-11 CRAN (R 3.4.0)
##  forcats      * 0.2.0    2017-01-23 CRAN (R 3.4.0)
##  foreach      * 1.4.3    2015-10-13 CRAN (R 3.4.0)
##  foreign        0.8-69   2017-06-22 CRAN (R 3.4.1)
##  ggplot2      * 2.2.1    2016-12-30 CRAN (R 3.4.0)
##  glmnet         2.0-13   2017-09-22 CRAN (R 3.4.2)
##  glue           1.2.0    2017-10-29 CRAN (R 3.4.2)
##  gower          0.1.2    2017-02-23 CRAN (R 3.4.0)
##  graphics     * 3.4.2    2017-10-04 local         
##  grDevices    * 3.4.2    2017-10-04 local         
##  grid           3.4.2    2017-10-04 local         
##  gtable         0.2.0    2016-02-26 CRAN (R 3.4.0)
##  haven          1.1.0    2017-07-09 CRAN (R 3.4.0)
##  hms            0.4.0    2017-11-23 CRAN (R 3.4.3)
##  htmltools      0.3.6    2017-04-28 CRAN (R 3.4.0)
##  htmlwidgets    0.9      2017-07-10 CRAN (R 3.4.1)
##  httpuv         1.3.5    2017-07-04 CRAN (R 3.4.1)
##  httr           1.3.1    2017-08-20 CRAN (R 3.4.1)
##  ipred          0.9-6    2017-03-01 CRAN (R 3.4.0)
##  iterators    * 1.0.8    2015-10-13 CRAN (R 3.4.0)
##  itertools    * 0.1-3    2014-03-12 CRAN (R 3.4.0)
##  jsonlite       1.5      2017-06-01 CRAN (R 3.4.0)
##  kernlab        0.9-25   2016-10-03 CRAN (R 3.4.0)
##  knitr          1.17     2017-08-10 CRAN (R 3.4.1)
##  labeling       0.3      2014-08-23 CRAN (R 3.4.0)
##  lattice      * 0.20-35  2017-03-25 CRAN (R 3.4.2)
##  lava           1.5.1    2017-09-27 CRAN (R 3.4.1)
##  lazyeval       0.2.1    2017-10-29 CRAN (R 3.4.2)
##  lime         * 0.3.1    2017-11-24 CRAN (R 3.4.3)
##  lubridate      1.7.1    2017-11-03 CRAN (R 3.4.2)
##  magrittr       1.5      2014-11-22 CRAN (R 3.4.0)
##  MASS           7.3-47   2017-02-26 CRAN (R 3.4.2)
##  Matrix         1.2-12   2017-11-15 CRAN (R 3.4.2)
##  memoise        1.1.0    2017-04-21 CRAN (R 3.4.0)
##  methods      * 3.4.2    2017-10-04 local         
##  mime           0.5      2016-07-07 CRAN (R 3.4.0)
##  missForest   * 1.4      2013-12-31 CRAN (R 3.4.0)
##  mnormt         1.5-5    2016-10-15 CRAN (R 3.4.0)
##  ModelMetrics   1.1.0    2016-08-26 CRAN (R 3.4.0)
##  modelr         0.1.1    2017-07-24 CRAN (R 3.4.1)
##  munsell        0.4.3    2016-02-13 CRAN (R 3.4.0)
##  nlme           3.1-131  2017-02-06 CRAN (R 3.4.2)
##  nnet           7.3-12   2016-02-02 CRAN (R 3.4.2)
##  parallel       3.4.2    2017-10-04 local         
##  pkgconfig      2.0.1    2017-03-21 CRAN (R 3.4.0)
##  plyr           1.8.4    2016-06-08 CRAN (R 3.4.0)
##  prodlim        1.6.1    2017-03-06 CRAN (R 3.4.0)
##  psych          1.7.8    2017-09-09 CRAN (R 3.4.1)
##  purrr        * 0.2.4    2017-10-18 CRAN (R 3.4.2)
##  R6             2.2.2    2017-06-17 CRAN (R 3.4.0)
##  randomForest * 4.6-12   2015-10-07 CRAN (R 3.4.0)
##  Rcpp           0.12.14  2017-11-23 CRAN (R 3.4.3)
##  RcppRoll       0.2.2    2015-04-05 CRAN (R 3.4.0)
##  readr        * 1.1.1    2017-05-16 CRAN (R 3.4.0)
##  readxl         1.0.0    2017-04-18 CRAN (R 3.4.0)
##  recipes        0.1.1    2017-11-20 CRAN (R 3.4.3)
##  reshape2       1.4.2    2016-10-22 CRAN (R 3.4.0)
##  rlang          0.1.4    2017-11-05 CRAN (R 3.4.2)
##  rmarkdown      1.8      2017-11-17 CRAN (R 3.4.2)
##  robustbase     0.92-8   2017-11-01 CRAN (R 3.4.2)
##  rpart          4.1-11   2017-03-13 CRAN (R 3.4.2)
##  rprojroot      1.2      2017-01-16 CRAN (R 3.4.0)
##  rstudioapi     0.7      2017-09-07 CRAN (R 3.4.1)
##  rvest          0.3.2    2016-06-17 CRAN (R 3.4.0)
##  scales         0.5.0    2017-08-24 CRAN (R 3.4.1)
##  sfsmisc        1.1-1    2017-06-08 CRAN (R 3.4.0)
##  shiny          1.0.5    2017-08-23 CRAN (R 3.4.1)
##  shinythemes    1.1.1    2016-10-12 CRAN (R 3.4.0)
##  splines        3.4.2    2017-10-04 local         
##  stats        * 3.4.2    2017-10-04 local         
##  stats4         3.4.2    2017-10-04 local         
##  stringdist  2017-07-31 CRAN (R 3.4.1)
##  stringi        1.1.6    2017-11-17 CRAN (R 3.4.2)
##  stringr      * 1.2.0    2017-02-18 CRAN (R 3.4.0)
##  survival       2.41-3   2017-04-04 CRAN (R 3.4.0)
##  tibble       * 1.3.4    2017-08-22 CRAN (R 3.4.1)
##  tidyr        * 0.7.2    2017-10-16 CRAN (R 3.4.2)
##  tidyselect     0.2.3    2017-11-06 CRAN (R 3.4.2)
##  tidyverse    * 1.2.1    2017-11-14 CRAN (R 3.4.2)
##  timeDate       3042.101 2017-11-16 CRAN (R 3.4.2)
##  tools          3.4.2    2017-10-04 local         
##  utils        * 3.4.2    2017-10-04 local         
##  withr          2.1.0    2017-11-01 CRAN (R 3.4.2)
##  xml2           1.1.1    2017-01-24 CRAN (R 3.4.0)
##  xtable         1.8-2    2016-02-05 CRAN (R 3.4.0)
##  yaml           2.1.15   2017-12-01 CRAN (R 3.4.3)
To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Random walking

By dan

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)


I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

1 up (North)
2 diagonal to the upper right (Northeast)
3 diagonal to the lower right (Southeast)
4 down (South)
5 diagonal to the lower left (Southwest)
6 diagonal to the upper left (Northwest)

Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

We did this about 50 times. The results are pictured above, along with kid hands for scale.

I sent the picture to my friend and serial co-author Jake Hofman because he likes a good kid’s science project and has a mental association for everything in applied math. He wrote “time for some Brownian motion?” and sent a talk he’d given a decade ago at a high school which taught me all kind of stuff I didn’t realize connecting random walks to Brownian motion, Einstein, the existence of atoms and binomial pricing trees in finance. (I was especially embarrassed not to have mentally connected random walks to binomial pricing because I had a slide on that in my job talk years ago and because it is the method we used in an early distribution builder paper.)

Back at home Jake did some simulations on random walks in one dimension (in which you just go forward or backward with equal probability) and sent them to me. Next, I did the same with hexagonal random walks (code at the end of this post). Here’s an image of one random walk on a hexagonal field.

I simulated 5000 random walks of 250 steps, starting at the point 0,0. The average X and Y position is 0 at each step, as shown here.

This might seem strange at first. But think about many walks of just one step. The number of one-step journeys in which your X position is increased a certain amount will be matched, in expectation, by an equal number of one-step journeys in which your X position is decreased by the same amount. Your average X position is thus 0 at the first step. Same is true for Y. The logic scales when you take two or more steps and that’s why we see the flat lines we do.

If you think about this wrongheadedly you’d think you weren’t getting anywhere. But of course you are. Let’s look at your average distance from the starting point at each step (below).

The longer you walk, the more distant from the starting point you tend to be. Because distances are positive, the average of those distances is positive. We say you “tend to” move away from the origin at each step, because that is what happens on average over many trips. At any given step on any given trip, you could move towards or away from the starting point with equal probability. This is deep stuff.

Speaking of deep stuff, you might notice that the relationship is pretty. Let’s zoom in.

The dashed line is the square root of the number of steps. It’s interesting to note that this square root relationship happens in a one-dimensional random walk as well. There’s a good explanation of it in this document. As Jake put it, it’s as if the average walk is covered by a circular plate whose area grows linearly with the number of steps. (Why linearly? Because area of a circle is proportional to its radius squared. Since the radius grows as the square root of the number of steps, the radius squared is linear in the number of steps)

(*) As a sidenote, I was at first seeing something that grew more slowly than the square root and couldn’t figure out what the relationship was. It turns out that the square root relationship holds for the root mean squared distance (the mean of the squared distances) and I had been looking at the mean Euclidean distance. It’s a useful reminder that the term “average” has quite a few definitions. “Average” is a useful term for getting the gist across, but can lead to some confusion.

Speaking of gists, here’s the R code. Thanks to @hadleywickham for creating the tidyverse and making everything awesome.


The post Random walking appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Personal note on joining the Microsoft Cloud Advocates team

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A quick personal note: today is my first day as a member of the Cloud Developer Advocates team at Microsoft! I’ll still be blogging and events related to R, and supporting the R community, but now I’ll be doing it as a member of a team dedicated to community outreach.

As a bit of background, when I joined Microsoft back in 2015 via the acquisition of Revolution Analytics, I was thrilled to be able to continue my role supporting the R community. Since then, Microsoft as a whole has continue to ramp up its support of open source projects and to interact directly with developers of all stripes (including data scientists!) through various initiatives across the company. (Aside: I knew Microsoft was a big company before I joined, but even then took me a while to appreciate the scale of the different divisions, groups, and geographies. For me, it was a bit like moving into a new city in the sense that it takes a while to learn what the neighborhoods are and how to find your way around.)

I learned of the Cloud Developer Advocates group after reading a RedMonk blog post about how some prominent techies (many of whom I’ve long admired and respected on Twitter) had been recruited into Microsoft to engage directly with developers. So when I learned that there was an entire group at Microsoft devoted to community outreach, and with such a fantastic roster already on board (and still growing!), I knew I had to be a part of it. I’ll be working with a dedicated team (including Paige, Seth and Vadim) focused on data science, machine learning, and AI. As I mentioned above, I’ll still be covering R, but will also be branching out into some other areas as well.

Stay tuned for more, and please hit me up on Twitter or via email if you’d like to chat, want to connect at an event, or just let me know how I can help. Let’s keep the conversation going!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News