This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.
Introduction to the Tidyverse features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a Tidyverse expert!
What you’ll learn
1. Data wrangling
In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.
2. Data visualization
You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.
3. Grouping and summarizing
So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.
4. Types of visualizations
You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.
Four score and ten years ago, your only shot of making it to the NBA was usually through the NCAA. The NBA has adopted a global perspective. You see professionals from different continental leagues finding a way into the NBA. The 2016 to 2017 diagram is interesting, we’re seeing a darker edge from the NBA to the G-League. The NBA Players Association worked hard on installing ‘2 way contracts’ letting NBA players play in (and get paid in) both the NBA and the G-League, this was a great move, a no-brainer.
Options are great, but keep in mind that the NBA is the crème de la crème. You see the lightest edges enter the NBA node while you see many dark edges enter the ‘none’ node.
When I started making the above network plot of basketball league to league migrations, I looked into off the shelf R packages. There’s a fragmented handfull of R packages to do this. Kicker, none of them do directed edges the way I wanted.
If you’re an #rstats user and want to adopt the BigBallR philosophy, find multiple ways to achieve your goal.
I went back to the basics and hand rolled my own network plot with vanilla ggplot2. At the end of the day, a network plot needs two things, nodes and edges, eg points (stored in dat_nodes) and lines (stored in dat_edges). Once you have your data in that format, you can make the network migration plot above, with the code snippet below
I hand-rolled the network diagram cuz the other packages didn’t have the custom features I needed. In my hand-rolled plot there is still one thing missing. I want to place the ‘arrow head’ along some other part of the curve (say the mid-point), other than the end-point. This is probably hard to do, since arrow() looks like it just needs the receiving coordinate and plops the arrow head there. For what I want, ggplot2 needs to know the desired placement-coordinates output from geom_curve(). So somehow, the internal curve calculations need to be returned in order to pass into arrow().
Reproduced with the kind permission of our Head of Data Engineering, Mark Sellors and first published on his blog
When a Tweet Turns Into an R Package
I just wanted to write up a brief post about the power of R, its community, and tell the story of how actually putting stuff out into the world can have amazing consequences.
About 24 hours ago I was going to tweet something like this:
Hey Mac #rstats users – system(‘say “hello rstats user”’)
I’d been playing with the MacOS command line tool, ‘say’, with the kids, and just figured it would be funny to make R say stuff.
Before I tweeted though, I thought I’d better check that it worked as intended. While I was doing that I decided it would be fun to expose the say command’s different voices and figured I’d make a gist on github instead, as it was getting too long for a tweet.
So, I started playing around with the different voices and thinking of a nice way to expose that and by this point I’m thinking I might as well just make it into a package. Making a package in RStudio is so easy, it seemed like the easiest way to share what I was doing. And so now we have the rsay package.
Bear in mind that I didn’t start out to create a package. It has one function, which is essentially the thinnest wrapper around an existing command line tool and it only works on MacOS and it’s pretty silly really, so I didn’t expect there to be a great deal of interest.
In the end, instead of tweeting that one command, I tweeted about how I accidentally made this R package.
That in turn got noticed by the awesome Mara Averick, which resulted in this tweet:
Then Mark Edmondson sees Mara’s tweet, and mentions how it might be fun to build it into a sort of Babelfish using some of his existing work around the Google APIs. (Mark knows a lot about Google’s APIs and R!)
A few hours after initially mentioning this, the following tweet appears in my timeline:
To recap; silly idea, and R’s easy to use packaging tools lead to an accidental (but still pretty silly) R package. Package is popularised by prominent developer advocate. Package is then integrated into a shiny app for automated machine translation, by well known R/Google API developer.
It’s been a fun 24 hours, but the main thing to take away, is just to get your stuff out there, even if you don’t think it’s that great/interesting/useful. When you share an idea, even if it seems frivolous or trivial, that idea takes on a life of its own and you never know where that may lead.
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.
Here is a quick example for how to get started with some of the more sophisticated point pattern analysis tools that have been developed for ecologists – principally the adehabitathr package – but that are very useful for human data. Ecologists deploy point pattern analysis to establish the “home range” of a particular animal based on the know locations it has been sighted (either directly or remotely via camera traps). Essentially it is where the animal spends most of its time. In the case of human datasets the analogy can be extended to identify areas where most crimes are committed – hotspots – or to identify the activity spaces of individuals or the catchment areas of services such as schools and hospitals.
This tutorial offers a rough analysis of crime data in London so the maps should not be taken as definitive – I’ve just used them as a starting point here.
Police crime data have been taken from here https://data.police.uk/data/. This tutorial London uses data from London’s Metropolitan Police (September 2017).
#Load in the library and the csv file.
At the moment input is a basic data frame. We need to convert the data frame into a spatial object. Note we have first specified our epsg code as 4326 since the coordinates are in WGS84. We then use spTransform to reproject the data into British National Grid – so the coordinate values are in meters.
The plot reveals that we have crimes across the UK, not just in London. So we need an outline of London to help limit the view. Here we load in a shapefile of the Greater London Authority Boundary.
Thinking ahead we may wish to compare a number of density estimates, so they need to be performed across a consistently sized grid. Here we create an empty grid in advance to feed into the kernelUD function.
OK now run the density estimation note we use grid= xy utlise the grid we just created. This is for all crime in London.
Unsurprisingly we have a hotpot over the centre of London. Are there differences for specific crime types? We may, for example, wish to look at the density of burglaries.
plot(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",]) # quick plot of burglary points
There’s a slight difference but still it’s tricky to see if there are areas where burglaries concentrate more compared to the distribution of all crimes. A very rough way to do this is to divide one density grid by the other.
This hasn’t worked particularly well since there are edge effects on the density grid that are causing issues due to a few stray points at the edge of the grid. We can solve this by capping the values we map – in this we are only showing values of between 0 and 1. Some more interesting structures emerge with burglary occuring in more residential areas, as expected.
both2 = 1]
There’s many more sophisticated approaches to this kind of analysis – I’d encourgae you to look at the adehabitatHR vignette on the package page.
To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly.
In the book that you published, I frequently use the stock-recruit curve code. The interface that shows both the Ricker/Beverton-Holt figure with the recruit per spawner to spawner figure (i.e., the dynamic plot for srStarts()) has not been working for quite some time. Additionally, I can get the recruits versus spawner plot for the Beverton-Holt or Ricker curve with confidence bounds around the curve, but how do you do the same for the recruit per spawner to spawner curve?
Below I answer both questions. First, however, I load required packages …
… and obtain the same data (in PSalmonAK.csv available from here) used in the Stock-Recruitment chapter of my Introductory Fisheries Analyses with R (IFAR) book. Note that I created two new variables here: retperesc is the “recruits per spawner” and logretperesc is the natural log of the recruits per spawner variable.
Dynamic Plot Issue
The first question about the dynamic plot is due to a change in functionality in the FSA package since IFAR was published. The dynamicPlot= argument was removed because the code for that argument relied on the tcltk package, which I found difficult to reliably support. A similar, though more manual, approach is accomplished with the new fixed= and plot= arguments. For example, using plot=TRUE (without fixed=) will produces a plot of “recruits” versus “stock” with the chosen stock-recruitment model evaluated at the automatically chosen parameter starting values superimposed.
The user, however, can show the stock-recruitment model evaluated at manually chosen parameter starting values by including those starting values in a list that is supplied to fixed=. These values can be iteratively changed in subsequent calls to srStarts() to manually find starting values that provide a model that reasonably fits (by eye) the stock-recruit data.
Note however that srStarts() no longer supports the simultaneously plotting of spawners versus recruits and recruits per spawner versus recruits.
Plot of Recruits per Spawner versus Spawners
The first way that I can imagine plotting recruits per spawners versus spawners with the fitted curve and confidence bands is to first follow the code for fitting the stock-recruit function to the stock and recruit data as described in IFAR. In this case, the stock-recruit function is fit on the log scale to adjust for a multiplicative error structure (as described in IFAR).
As described in the book, the plot of spawners versus recruits is made by (i) constructing a sequence of “x” values that span the range of observed numbers of spawners, (ii) predicting the number of recruits at each spawner value using the best-fit stock-recruitment model, (iii) constructing lower and upper confidence bounds for the predicted number of recruits at each spawner value with the bootstrap results, (iv) making a schematic plot on which to put (v) a polygon for the confidence band, (vi) the raw data points, and (vii) the best-fit curve. The code below follows these steps and reproduces Figure 12.4 in the book.
These results can be modified to plot recruits per spawner versus spawners by replacing the “recruits” in the code above with “recruits per spawner.” This is simple for the actual data as ret is simply replaced with retperesc. However, the predicted number of recruits (in pR) and the confidence bounds (in LCI and UCI) from above must be divided by the number of spawners (in x). As the / symbol has a special meaning in R formulas, this division must be contained within I() as when it appears in a formula (see the lines() code below). Of course, the y-axis scale range must also be adjusted. Thus, a plot of recruits per spawner versus spawners is produced from the previous results with the following code.
Alternatively, the Ricker stock-recruitment model could be reparameterized by dividing each side of the function by “spawners” such that the right-hand-side becomes “recruits per spawner” (this is a fairly typical reparameterization of the model). This model can be put into an R function, with model parameters then estimated with nonlinear regression similar to above. The results below show that the paramter point estimates are identical and the bootsrapped confidence intervals are similar to what was obtained above.
With this, a second method for plotting recruits per spawner versus spawners is the same as how the main plot from the book was constructed but modified to use the results from this “new” model.
The two methods described above for plotting recruits per spawner versuse spawners are identical for the best-fit curve and nearly identical for the confidence bounds (slight differences likely due to the randomness inherent in bootstrapping). Thus, the two methods produce nearly the same visual.
To leave a comment for the author, please follow the link and comment on their blog: fishR Blog.
Sketchnotes: TWiML Talk #7 with Carlos Guestrin – Explaining the Predictions of Machine Learning Models & Data Skeptic Podcast – Trusting Machine Learning Models with Lime
the following libraries were loaded:
library(tidyverse) # for tidy data analysis
library(farff) # for reading arff file
library(missForest) # for imputing missing values
library(dummies) # for creating dummy variables
library(caret) # for modeling
library(lime) # for explaining predictions
## Random Forest
## 360 samples
## 48 predictor
## 2 classes: 'ckd', 'notckd'
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 324, 324, 324, 324, 325, 324, ...
## Resampling results across tuning parameters:
## mtry Accuracy Kappa
## 2 0.9922647 0.9838466
## 25 0.9917392 0.9826070
## 48 0.9872930 0.9729881
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
It’s that time of year when we need to start thinking about what R Conferences we would like to (and can!) attend. To help plan your (ahem) work trips, we thought it would be useful to list the upcoming main attractions.
The next RStudio conference is held in San Diego, and boosts top quality speakers from around the R world. With excellent tutorials and great talks, this is certainly one of the top events. The estimated cost is around $200 per day, so not the cheapest options, but worthwhile.
An opportunity to hear from and network with top Researchers, Data Scientists and Developers from the R community in South Africa and beyond.
This workshop combines an amazing location, with great speakers and at an amazing price (only $70 per day). The key note speakers are Maëlle Salmon and Stephanie Kovalchik. This years SatRday has two tutorials, one on package building the other on sports modelling.
From the inaugural conference in 2009, the annual R/Finance conference in Chicago has become the primary meeting for academics and practioners interested in using R in Finance. Participants from academia and industry mingle for two days to exchange ideas about current research, best practices and applications. A single-track program permits continued focus on a series of refereed submissions. We hear there’s a lively social program rounds out the event.
With useR! 2017 spanning over 5 days and boasting some of the biggest names in data science, the next installment of the useR! series is sure to be even better. Last year the program was packed with speakers, programmes and tutorials from industry and academia. Each day usually containing numerous tutorials and usually ends in a keynote speech followed by dinner(!). Registration is open in January 2018.
Born on Twitter, the Noreast’R Conference is a grass roots effort to organize a regional #rstats conference in the Northeastern United States. Work is currently under way and further details will follow on the Noreast’R website and twitter page (linked below).
I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:
1 up (North)
2 diagonal to the upper right (Northeast)
3 diagonal to the lower right (Southeast)
4 down (South)
5 diagonal to the lower left (Southwest)
6 diagonal to the upper left (Northwest)
Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”
After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.
We did this about 50 times. The results are pictured above, along with kid hands for scale.
I sent the picture to my friend and serial co-author Jake Hofman because he likes a good kid’s science project and has a mental association for everything in applied math. He wrote “time for some Brownian motion?” and sent a talk he’d given a decade ago at a high school which taught me all kind of stuff I didn’t realize connecting random walks to Brownian motion, Einstein, the existence of atoms and binomial pricing trees in finance. (I was especially embarrassed not to have mentally connected random walks to binomial pricing because I had a slide on that in my job talk years ago and because it is the method we used in an early distribution builder paper.)
Back at home Jake did some simulations on random walks in one dimension (in which you just go forward or backward with equal probability) and sent them to me. Next, I did the same with hexagonal random walks (code at the end of this post). Here’s an image of one random walk on a hexagonal field.
I simulated 5000 random walks of 250 steps, starting at the point 0,0. The average X and Y position is 0 at each step, as shown here.
This might seem strange at first. But think about many walks of just one step. The number of one-step journeys in which your X position is increased a certain amount will be matched, in expectation, by an equal number of one-step journeys in which your X position is decreased by the same amount. Your average X position is thus 0 at the first step. Same is true for Y. The logic scales when you take two or more steps and that’s why we see the flat lines we do.
If you think about this wrongheadedly you’d think you weren’t getting anywhere. But of course you are. Let’s look at your average distance from the starting point at each step (below).
The longer you walk, the more distant from the starting point you tend to be. Because distances are positive, the average of those distances is positive. We say you “tend to” move away from the origin at each step, because that is what happens on average over many trips. At any given step on any given trip, you could move towards or away from the starting point with equal probability. This is deep stuff.
Speaking of deep stuff, you might notice that the relationship is pretty. Let’s zoom in.
The dashed line is the square root of the number of steps. It’s interesting to note that this square root relationship happens in a one-dimensional random walk as well. There’s a good explanation of it in this document. As Jake put it, it’s as if the average walk is covered by a circular plate whose area grows linearly with the number of steps. (Why linearly? Because area of a circle is proportional to its radius squared. Since the radius grows as the square root of the number of steps, the radius squared is linear in the number of steps)
(*) As a sidenote, I was at first seeing something that grew more slowly than the square root and couldn’t figure out what the relationship was. It turns out that the square root relationship holds for the root mean squared distance (the mean of the squared distances) and I had been looking at the mean Euclidean distance. It’s a useful reminder that the term “average” has quite a few definitions. “Average” is a useful term for getting the gist across, but can lead to some confusion.
Speaking of gists, here’s the R code. Thanks to @hadleywickham for creating the tidyverse and making everything awesome.
A quick personal note: today is my first day as a member of the Cloud Developer Advocates team at Microsoft! I’ll still be blogging and events related to R, and supporting the R community, but now I’ll be doing it as a member of a team dedicated to community outreach.
As a bit of background, when I joined Microsoft back in 2015 via the acquisition of Revolution Analytics, I was thrilled to be able to continue my role supporting the R community. Since then, Microsoft as a whole has continue to ramp up its support of open source projects and to interact directly with developers of all stripes (including data scientists!) through various initiatives across the company. (Aside: I knew Microsoft was a big company before I joined, but even then took me a while to appreciate the scale of the different divisions, groups, and geographies. For me, it was a bit like moving into a new city in the sense that it takes a while to learn what the neighborhoods are and how to find your way around.)
I learned of the Cloud Developer Advocates group after reading a RedMonk blog post about how some prominent techies (many of whom I’ve long admired and respected on Twitter) had been recruited into Microsoft to engage directly with developers. So when I learned that there was an entire group at Microsoft devoted to community outreach, and with such a fantastic roster already on board (and still growing!), I knew I had to be a part of it. I’ll be working with a dedicated team (including Paige, Seth and Vadim) focused on data science, machine learning, and AI. As I mentioned above, I’ll still be covering R, but will also be branching out into some other areas as well.
Stay tuned for more, and please hit me up on Twitter or via email if you’d like to chat, want to connect at an event, or just let me know how I can help. Let’s keep the conversation going!
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.