Why Backtesting On Individual Legs In A Spread Is A BAD Idea

By Ilya Kipnis

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

So after reading the last post, the author of quantstrat had mostly critical feedback, mostly of the philosophy that prompted its writing in the first place. Basically, the reason I wrote it, as I stated before, is that I’ve seen many retail users of quantstrat constantly ask “how do I model individual spread instruments”, and otherwise try to look like they’re sophisticated by trading spreads.

The truth is that real professionals use industrial-strength tools to determine their intraday hedge ratios (such a tool is called a spreader). The purpose of quantstrat is not to be an execution modeling system, but to be a *strategy* modeling system. Basically, the purpose of your backtest isn’t to look at individual instruments, since in the last post, the aggregate trade statistics told us absolutely nothing about how our actual spread trading strategy performed. The backtest was a mess as far as the analytics were concerned, and thus rendering it more or less useless. So this post, by request of the author of quantstrat, is about how to do the analysis better, and looking at what matters more–the actual performance of the strategy on the actual relationship being traded–namely, the *spread*, rather than the two components.

So, without further ado, let’s look at the revised code:

require(quantmod)
require(quantstrat)
require(IKTrading)

getSymbols("UNG", from="1990-01-01")
getSymbols("DGAZ", from="1990-01-01")
getSymbols("UGAZ", from="1990-01-01")
UNG <- UNG["2012-02-22::"]
UGAZ <- UGAZ["2012-02-22::"]

spread <- 3*OHLC(UNG) - OHLC(UGAZ)

initDate='1990-01-01'
currency('USD')
Sys.setenv(TZ="UTC")
symbols <- c("spread")
stock(symbols, currency="USD", multiplier=1)

strategy.st <- portfolio.st <- account.st <-"spread_strategy_done_better"
rm.strat(portfolio.st)
rm.strat(strategy.st)
initPortf(portfolio.st, symbols=symbols, initDate=initDate, currency='USD')
initAcct(account.st, portfolios=portfolio.st, initDate=initDate, currency='USD')
initOrders(portfolio.st, initDate=initDate)
strategy(strategy.st, store=TRUE)

#### paramters

nEMA = 20

### indicator

add.indicator(strategy.st, name="EMA",
              arguments=list(x=quote(Cl(mktdata)), n=nEMA),
              label="ema")

### signals

add.signal(strategy.st, name="sigCrossover",
           arguments=list(columns=c("Close", "EMA.ema"), relationship="gt"),
           label="longEntry")

add.signal(strategy.st, name="sigCrossover",
           arguments=list(columns=c("Close", "EMA.ema"), relationship="lt"),
           label="longExit")

### rules

add.rule(strategy.st, name="ruleSignal", 
         arguments=list(sigcol="longEntry", sigval=TRUE, ordertype="market", 
                        orderside="long", replace=FALSE, prefer="Open", orderqty=1), 
         type="enter", path.dep=TRUE)

add.rule(strategy.st, name="ruleSignal", 
         arguments=list(sigcol="longExit", sigval=TRUE, orderqty="all", ordertype="market", 
                        orderside="long", replace=FALSE, prefer="Open"), 
         type="exit", path.dep=TRUE)

#apply strategy
t1 <- Sys.time()
out <- applyStrategy(strategy=strategy.st,portfolios=portfolio.st)
t2 <- Sys.time()
print(t2-t1)

In this case, things are a LOT simpler. Rather than jumping through the hoops of pre-computing an indicator, along with the shenanigans of separate rules for both the long and the short end, we simply have a spread as it’s theoretically supposed to work–three of an unleveraged ETF against the 3x leveraged ETF, and we can go long the spread, or short the spread. In this case, the dynamic seems to be on the up, and we want to capture that.

So how did we do?

#set up analytics
updatePortf(portfolio.st)
dateRange <- time(getPortfolio(portfolio.st)$summary)[-1]
updateAcct(portfolio.st,dateRange)
updateEndEq(account.st)

#trade statistics
tStats <- tradeStats(Portfolios = portfolio.st, use="trades", inclZeroDays=FALSE)
tStats[,4:ncol(tStats)] <- round(tStats[,4:ncol(tStats)], 2)
print(data.frame(t(tStats[,-c(1,2)])))
(aggPF <- sum(tStats$Gross.Profits)/-sum(tStats$Gross.Losses))
(aggCorrect <- mean(tStats$Percent.Positive))
(numTrades <- sum(tStats$Num.Trades))
(meanAvgWLR <- mean(tStats$Avg.WinLoss.Ratio[tStats$Avg.WinLoss.Ratio < Inf], na.rm=TRUE))

And here’s the output:

> print(data.frame(t(tStats[,-c(1,2)])))
                   spread
Num.Txns            76.00
Num.Trades          38.00
Net.Trading.PL       9.87
Avg.Trade.PL         0.26
Med.Trade.PL        -0.10
Largest.Winner       7.76
Largest.Loser       -1.06
Gross.Profits       21.16
Gross.Losses       -11.29
Std.Dev.Trade.PL     1.68
Percent.Positive    39.47
Percent.Negative    60.53
Profit.Factor        1.87
Avg.Win.Trade        1.41
Med.Win.Trade        0.36
Avg.Losing.Trade    -0.49
Med.Losing.Trade    -0.46
Avg.Daily.PL         0.26
Med.Daily.PL        -0.10
Std.Dev.Daily.PL     1.68
Ann.Sharpe           2.45
Max.Drawdown        -4.02
Profit.To.Max.Draw   2.46
Avg.WinLoss.Ratio    2.87
Med.WinLoss.Ratio    0.78
Max.Equity          13.47
Min.Equity          -1.96
End.Equity           9.87
> (aggPF <- sum(tStats$Gross.Profits)/-sum(tStats$Gross.Losses))
[1] 1.874225
> (aggCorrect <- mean(tStats$Percent.Positive))
[1] 39.47
> (numTrades <- sum(tStats$Num.Trades))
[1] 38
> (meanAvgWLR <- mean(tStats$Avg.WinLoss.Ratio[tStats$Avg.WinLoss.Ratio < Inf], na.rm=TRUE))
[1] 2.87

In other words, the typical profile for a trend follower, rather than the uninformative analytics from the last post. Furthermore, the position sizing and equity curve chart actually make sense now. Here they are.

To conclude, while it’s possible to model spreads using individual legs, it makes far more sense in terms of analytics to actually examine the performance of the strategy on the actual relationship being traded, which is the spread itself. Furthermore, after constructing the spread as a synthetic instrument, it can be treated like any other regular instrument in the context of analysis in quantstrat.

Thanks for reading.

NOTE: I am a freelance consultant in quantitative analysis on topics related to this blog. If you have contract or full time roles available for proprietary research that could benefit from my skills, please contact me through my LinkedIn

To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R in Nature, Mashable

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R was recently the subject of a feature article in the prestigious science magazine Nature: Programming tools: Adventures with R.

Besides being free, R is popular partly because it presents different faces to different users. It is, first and foremost, a programming language — requiring input through a command line, which may seem forbidding to non-coders. But beginners can surf over the complexities and call up preset software packages, which come ready-made with commands for statistical analysis and data visualization. These packages create a welcoming middle ground between the comfort of commercial ‘black-box’ solutions and the expert world of code.

The article highlights many of the packages you can use for scientific analysis with R, and also mentions several scientific projects based on R, including BioConductor and ROpenSci. The article also noted that the use of R has increased rapidly in a number of scientific disciplines, as measured by the rate at which R is cited in published articles.

The article also includes quotes from R’s co-creator Robert Gentleman (“I can write software that would be good for somebody doing astronomy, but it’s a lot better if someone doing astronomy writes software for other people doing astronomy”) and Bob Muenchen, who tracks the popularity of statistical software (“Most likely, R became the top statistics package used during the summer of this year.”).

Mashable isn’t in the same authoritative league as Nature, but it’s read by a lot of people. So it’s great that R also got a mention in the recent article, So you wanna be a data scientist?.

“On an average day, I manage a series of dashboards that tell our company about our business — what the users are doing,” says Jon Greenberg, a data scientist at Playstudios, a gaming firm. Greenberg is a manager now, so he’s programming less than he used to, but he still does his fair share. Usually, he pulls data out of Apache Hadoop storage and runs it through Revolution R, an analytics platform and comes up with some kind of visualization. “It may be how one segment of the population is interacting with a new feature,” he explains.

The article also describes the experiences of other data scientists and gives some salary statistics on “2015’s Hottest Profession“: Data Science.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Interactive Simple Networks

By jlebeau

(This article was first published on More or Less Numbers, and kindly contributed to R-bloggers)
This post isn’t anything new in terms of analysis, but just a cooler look at a previous post. I looked at board members of large companies in a previous

To leave a comment for the author, please follow the link and comment on his blog: More or Less Numbers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

digest 0.6.8

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Release 0.6.8 of digest package is now on CRAN and will get to Debian shortly.

This release opens the door to also providing the digest functionality at the C level to other R packages. Wush Wu is going to use the murmurHash C implementation in his recently-created FeatureHashing package.

We plan to export the other hashing function as well. Another small change attempts to overcome a build limitation on that other largely-irrelevant-but-still-check-by-CRAN OS.

CRANberries provides the usual summary of changes to the previous version.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

DataVis with Plot.ly (@plotlygraphs) – Meetup Summary

By Eduardo Ariño de la Rubia

(This article was first published on Data Science Los Angeles » R, and kindly contributed to R-bloggers)

It pains me to admit it, but even though I had visited their site, created an account, and played around with their tools, I didn’t really get the value proposition behind Plot.ly. I already use ggplot, bokeh, and d3.js, I already use knitr and ipython notebooks, so why do I need a new way of posting my plots on a web page?

On my own and before the meetup, it made no sense. Why was this company getting such positive press? How on earth could Plot.ly have attracted a customer list including some of the Who’s Who of scientific and research institutions? After the meetup presentation, which covered topics ranging from rich visualization to collaboration architectures, I spoke to the Plot.ly team, and I brought this up… how I had missed the value proposition completely, how their website did a poor job of explaining why I should care. Their response was great…

“We hear this all the time, and we’ve ended up traveling around telling people what we do instead of clearly showing them on the site. Traveling to everyone’s meetups isn’t very scalable, so we have a website redesign in the works.”

It was a very funny, very human moment which conveyed so much about the truth of running a Data Science / Visualization startup in 2014. Our field is complicated, and sometimes effectively telling a story is hard – even if you’re in the business of helping people tell stories.

For me, the Plot.ly meetup was one of the most impactful and interesting meetups of 2014 for DataScience.LA, and it’s a shame that so many of our members missed it. Out of nearly 120 RSVPs, only 30 members made it in person (with nearly 10 still on a waiting list). Sure, it was rainy and cold and on the brink of the holiday season, but I guarantee you that every single person that missed it would have considered the trip worthwhile.

Those in attendance had the privilege of speaking one-on-one with a Plot.ly team that is genuinely breaking new ground in the world of inter-language visualization collaboration. I learned that Plot.ly isn’t just about beautiful, interactive, web based graphics. Plot.ly is about collaboration across data science teams using disparate programming languages. I can build a graph in R using ggplot2, push it to Plot.ly, and my collaborator can download it for use in Python using matplotlib. The folks at Plot.ly have built a generalized vocabulary to describe data visualization across these tools, functioning as universal translators between the major programming languages used in Data Science.

Enjoy the video of Plot.ly’s wonderful, educational, and insightful presentation. If you live in the area, make it a New Year’s resolution to make it to one of our DSLA meetups in 2015… and if you RSVP’d and won’t be able to make it, please go ahead and cancel. One day you may be the one on a meetup waiting list eagerly awaiting your opportunity to attend!

To leave a comment for the author, please follow the link and comment on his blog: Data Science Los Angeles » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

SAS is #1…In Plans to Discontinue Use

By Bob Muenchen

SAS Attrition Plot

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

I’ve been tracking The Popularity of Data Analysis Software for many years now, and a clear trend is the decline of the market share of the bigger analytics firms, notably SAS and SPSS. Many people have interpreted my comments as implying the decline in the revenue of those companies. But the fields involved in analytics (statistics, data mining, analytics, data science, etc.) have been exploding in popularity, so having a smaller slice of a much bigger pie still leaves billions in revenue for the big players.

Each year, the Gartner Group, “the world’s leading information technology research and advisory company”, collects data in a survey of the customers of 42 business intelligence firms. They recently released the data on the customers’ plans to discontinue use of their current software in one to three years. The results are shown in the figure below. Over 16% of the SAS Institute customers surveyed reported considering discontinuing their use of the software, the highest of any of the vendors shown. It will be interesting to see if this will actually lead to an eventual decline in revenue. Although I have helped quite a few organizations migrate from SAS to R, I would be surprised to see SAS Institute’s revenue decline. They offer excellent software and service which I still use, though not anywhere near as much as R.

The full Gartner report is available here.

To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Plot with ggplot2 and plotly within knitr reports

By Plotly

(This article was first published on Plotly, and kindly contributed to R-bloggers)

Plotly is a platform for making, editing, and sharing graphs. If you are used to making plots with ggplot2, you can call ggplotly() to make your plots interactive, web-based, and collaborative. For example, see plot.ly/~marianne2/166, shown below. Notice the hover text!

The “plotly” R package lets you use plotly with R. Want to try it out? Just copy and paste this R code and you can make a web-based, interactive plot with “ggplot2”.

install.packages("devtools")  # so we can install from GitHub
devtools::install_github("ropensci/plotly")  # plotly is part of rOpenSci
 
library(plotly)
 
py <- plotly(username="r_user_guide", key="mw5isa4yqp")  # open plotly connection
 
gg <- ggplot(cars) + geom_point(aes(speed, dist))
 
py$ggplotly(gg)

Plotly lets you embed your interactive plots in iframes, web pages, and RPubs using knitr. In this post, we’ll show you three examples. There are two ways you can embed a Plotly graph in your web reports from R:

  • You can publish your code, data, and and interactive plot all in one place;
  • You can make a plot, edit it with your team in the plotly GUI, and embed it in an iframe.

“knitr” is an R package that lets you generate reports dynamically. It is very popular in the R community. If you have never heard of it before, you can check out the official website and/or this minimal tutorial. If you use RStudio, you can create a new R Markdown document directly from the “New File” menu.

Select “Document” and give it a title. Click the “Knit HTML” button above your document and… boom! You get a gorgeous-looking report, that combines text, code, and plots.

Edit your .Rmd file as you please. In order to use “plotly” functions in a given code chunk, you will need to add the plotly=TRUE parameter. To embed the plotly plot, you also want to set parameter session="knitr" in the ggplotly() call (the default behaviour is that of an interactive session, which opens a new tab/window in your web browser with the corresponding plotly graph).

Click here to view (and copy-paste) our .Rmd file.

Clicking the “Knit HTML” button in RStudio will generate your updated “knitr” report (R code that has not changed is not recomputed, it got cached). Click “Open in Browser” in order to view your report in the web browser and see the embedded Plotly plot.

If you want to make your report public, you can do so via RPubs (click “Publish”/”Republish”). Our example report is published here: http://rpubs.com/plotly/knitr_plotly_example.

The link that says “Play with this data!” will open your plot on plotly’s website (https://plot.ly/~marianne2/164) where you can invite collaborators, and style your plot with our GUI. Viewers will be able to access your data, find code for your graph in Python, MATLAB and other languages, and comment on your work.

To leave a comment for the author, please follow the link and comment on his blog: Plotly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Widgets For Christmas

By klr

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

For Christmas, I generally want electronic widgets, but after six months of development, all I wanted this Christmas was htmlwidgets, and Santa RStudio/jj,joe,yihui and Santa Ramnath delivered early with this RStudio tweet on December 17th. htmlwidgets: Bring the best of JavaScript data visualization to R http://t.co/a16qlLxuLz #rstats— RStudio (@rstudio) December 17, 2014 The major benefit of

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The 6th Spanish R Users Conference

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Emilio L. Cano

The VI Spanish R Users Conference took place on October 23 and 24 in Santiago de Compostela (Spain). It was a two-day event with a variety of talks and workshops about the R statistical software and programming language and its applications.

First of all, let me thank all the local organizers, the melisa[1] association and the R-Hispano[2] association, also supported by cntg[3] and cixug[4]. The organizing committee team did a great job before and during the event. In my opinion, the conference was a great success, not only from the scientific and technological points of view, but also because of the personal interactions among all the participants.

We also wish to thank the conference sponsors. In addition to Revolution Analytics, that supports http://r-es.org/ activities on a regular basis, the conference was sponsored by melisa (‘Terra de Melide’ free software users association), amtega (agency for technological modernization, Xunta de Galicia), GPUL (Group of Linux Programmers and Users), and AGASOL (Galician free software businesses association).

There were a total of 272 registered participants. 109 of them were able to follow the conference on-line thanks to the streaming service provided by the organization. Moreover, the recordings of the sessions, including some of the workshops, are already available at the conference program webpage for the general public. The slides and other resources such as the workshop materials are already available from the conference program. An abstracts booklet can also be downloaded with all the contributions.

In addition to the typical invited and regular talks one would find at a major R conference, workshops also played a prominent role. Thus, three parallel sessions of workshops were held, covering the following topics: R basics, Bayesian inference, raster data visualization, predictive models, time series visualization, and color perception and visualization with R.

As for the presentations: plenary talks, 15 minutes regular talks, and 5 minutes express talks were given. In the opening plenary talk, Miguel Á. Rodríguez Muíños gave an overview of graphical user interfaces with R, showing for example how they have deployed a Cardiovascular Risk Calculator at the Galician Health Department to be easily used by physicians. In the afternoon plenary talks, Carlos Gil Bellosta made us think about models as pets and herds, while Rafael Rodríguez Gayoso showed how they are spreading the use of R, among other free software, in Galicia. For example, with the translation of Rcommander to Galician or the hosting of a CRAN Mirror.

The remainder of the presentations encompassed topics such as demographics, biostatistics, spatial analysis, or statistical methodology, among others. Some of the speakers illustrated their advances using published packages, others just practical code, and several of them showcased their results with impressive shiny applications. See for example the one by Noema Afonso Casalderrey y Salvador Naya Fernández for their middle-town Galician study, or the two by Luis Mariano Esteban to make HUMS nomogram for Organ confined disease and growing curves.

Just to mention some of the Spanish R people contributing with published packages at CRAN and other repositories, you can see the slides by Rubén Fernández-Casal (npsp), Inés Garmendia (micromatch), María José Nueda (maSigPro), Manuel Fontenla (optrees), Manuel Oviedo de la Fuente (fda.usc), Emilio Torres-Manzaneda (freqweights), Oscar Perpiñán Lamigueiro (meteoForecast), and Emilio L. Cano (SixSigma).

As you may imagine, all of attendees also enjoyed the social events in such a beautiful city, kind people, and nice food.

Even though we are still catching our breath, we are already excited about the VII edition, which will take place next year in Salamanca, another historical and charming Spanish city. More details will be provided in due course both in the R-Hispano website and twitter account (@R_Hisp).


Notes:

[1]’Terra de Melide’ free software users association
[2]Spanish association of R users
[3]Galician new technologies center
[4]Galician universities’ free software office

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Cluster Analysis of the NFL’s Top Wide Receivers

By Cory Lesmeister

(This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers)

“The time has come to get deeply into football. It is the only thing we have left that ain’t fixed.”
Hunter S. Thompson, Hey Rube Column, November 9, 2004

I have to confess that I haven’t been following the NFL this year as much as planned or hoped. On only 3 or 4 occasions this year have I been able to achieve a catatonic state while watching NFL RedZone. Nonetheless, it is easy to envision how it is all going to end. Manning will throw four picks on a cold snowy day in Foxboro in the AFC Championship game and the Seahawk defense will curb-stomp Aaron Rodgers and capture a consecutive NFC crown. As for the Super Bowl, well who cares other than the fact that we must cheer against the evil Patriot empire, rooting for their humiliating demise. One can simultaneously hate and admire that team. I prefer to do the former publicly and the latter in private.

We all seem to have a handle on the good and bad quarterbacks out there, but what about their wide receivers? With the playoffs at the doorstep and ignorance of the situation, I had to bring myself up to speed. This is a great opportunity to do a cluster analysis of the top wide receivers and see who is worth keeping an eye on in the upcoming spectacle.

A good source for interesting statistics and articles on NFL players and teams is http://www.advancedfootballanalytics.com/ . Here we can download the rankings for the top 40 wide receivers based on Win Probability Added or “WPA”. To understand the calculation you should head over to the site and read the details. The site provides a rationale on WPA by saying “WPA has a number of applications. For starters, we can tell which plays were truly critical in each game. From a fan’s perspective, we can call a play the ‘play of the week’ or the ‘play of the year.’ And although we still can’t separate an individual player’s performance from that of his teammates’, we add up the total WPA for plays in which individual players took part. This can help us see who really made the difference when it matters most. It can help tell us who is, or at least appears to be, “clutch.” It can also help inform us who really deserves the player of the week award, the selection to the Pro Bowl, or even induction into the Hall of Fame.”

I put the website’s wide receiver data table in a .csv and we can start the analysis, reading in the file and examining its structure.

> receivers <- read.csv(file.choose())
> str(receivers)
‘data.frame’: 40 obs. of 19 variables:
$ Rank : int 1 2 3 4 5 6 7 8 9 10 …
$ Player : Factor w/ 40 levels “10-E.Sanders”,..: 16 33 1 23 37 24 36 2 4 13 …
$ Team : Factor w/ 28 levels “ARZ”,”ATL”,”BLT”,..: 11 23 10 22 9 12 12 28 17 19 …
$ G : int 16 16 16 16 16 16 16 14 14 12 …
$ WPA : num 2.43 2.4 2.33 2.33 2.3 2.27 2.19 1.91 1.89 1.76 …
$ EPA : num 59 95.7 81.3 56.8 78.8 86.2 97.3 54.3 63.6 64.6 …
$ WPA_G : num 0.15 0.15 0.15 0.15 0.14 0.14 0.14 0.14 0.14 0.15 …
$ EPA_P : num 0.38 0.48 0.5 0.38 0.54 0.6 0.58 0.54 0.41 0.43 …
$ SR_PERC : num 55.4 62.8 61.3 52 58.5 59.4 60.5 54.5 58.1 55 …
$ YPR : num 13.4 13.2 13.9 15.5 15 14.1 15.5 20.2 10.6 14.3 …
$ Rec : int 99 129 101 86 88 91 98 52 92 91 …
$ Yds : int 1331 1698 1404 1329 1320 1287 1519 1049 972 1305 …
$ RecTD : int 4 13 9 10 16 12 13 5 4 12 …
$ Tgts : int 143 181 141 144 136 127 151 88 134 130 …
$ PER_Tgt : num 24.2 30.2 23.4 23.5 28.9 23.9 28.4 16.2 22.3 21.7 …
$ YPT : num 9.3 9.4 10 9.2 9.7 10.1 10.1 11.9 7.3 10 …
$ C_PERC : num 69.2 71.3 71.6 59.7 64.7 71.7 64.9 59.1 68.7 70 …
$ PERC_DEEP: num 19.6 26.5 36.2 30.6 30.9 23.6 31.8 33 16.4 28.5 …
$ playoffs : Factor w/ 2 levels “n”,”y”: 2 2 2 1 2 2 2 1 2 1 …

> head(receivers$Player)
[1] 15-G.Tate 84-A.Brown 10-E.Sanders 18-J.Maclin 88-D.Bryant
[6] 18-R.Cobb

Based on WPA, Golden Tate of Detroit is the highest ranked wide receiver; in contrast his highly-regarded teammate is ranked 13th.

We talked about WPA so here is a quick synopsis on the other variables; again, please go to the website for detailed explanations:

  • EPA – Expected Points Added
  • WPA_G – WPA per game
  • EPA_P – Expected Points Added per Play
  • SR_PERC – Success Rate of plays the receiver was involved that are considered successful
  • YPR – Yards Per Reception
  • Rec – Total Receptions
  • Yds – Total Reception Yards
  • RecTD – Receiving Touchdowns
  • Tgts – The times a receiver was targeted in the passing game
  • PER_Tgts – Percentage of time a team’s passes were targeted to the receiver
  • YPT – Yards per times targeted by a pass
  • C_PERC – Completion percentage
  • PERC_DEEP – Percent of passes targeted deep
  • playoffs – A factor I coded on whether the receiver’s team is in the playoffs or not

To do hierarchical clustering with this data, we can use the hclust() function available in base R. In preparation for that, we should scale the data and we must create a distance matrix.

> r.df <- receivers[,c(4:18)]
> rownames(r.df) <- receivers[,2]
> scaled <- scale(r.df)
> d <- dist(scaled)

With the data prepared, produce the cluster object and plot it.

> hc <- hclust(d, method="ward.D")
> plot(hc, hang=-1, xlab=””, sub=””)

This is the standard dendrogram produced with hclust. We now need to select the proper number of clusters and produce a dendrogram that is easier to examine. For this, I found some interesting code to adapt on Gaston Sanchez’s blog: http://gastonsanchez.com/blog/how-to/2012/10/03/Dendrograms.html . Since I am leaning towards 5 clusters, let’s first create a vector of colors. (Note: you can find/search for color codes on colorhexa.com)

> labelColors = c(“#FF0000”, “#800080″,”#0000ff”, “#ff8c00″,”#013220”)

Then use the cutree() function to specify 5 clusters

> clusMember = cutree(hc, 5)

Now, we create a function (courtesy of Gaston) to apply colors to the clusters in the dendrogram.

> colLab <- function(n) {
+ if (is.leaf(n)) {
+ a <- attributes(n)
+ labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
+ attr(n, “nodePar”) <- c(a$nodePar, lab.col = labCol)
+ }
+ n
+ }

Finally, we turn “hc” into a dendrogram object and plot the new results.

> hcd <- as.dendrogram(hc)
> clusDendro = dendrapply(hcd, colLab)
> plot(clusDendro, main = “NFL Receiver Clusters”, type = “triangle”)

That is much better. For more in-depth analysis you can put the clusters back into the original dataframe.

> receivers$cluster <- as.factor(cutree(hc, 5))

It is now rather interesting to plot the variables by cluster to examine the differences. In the interest of time and space, I present just a boxplot of WPA by cluster.

> boxplot(WPA~cluster, data=receivers, main=”Receiver Rank by Cluster”)

Before moving on, I present this simple table of the clusters of the receivers by playoff qualification. Interesting to note that cluster 1 with the high WPA, also has 10 of 13 receivers in the playoffs. One of the things that would be worth a look I think is to adjust wide receiver WPA by some weight based on their QB quality. Note that Randall Cobb and Jordy Nelson of the Packers have high WPA, ranked 6 and 7 respectively, but have the privilege of having Rodgers as QB. Remember, in the quote above WPA does not have the ability to separate an individual’s success from a teammate’s success. This raises some interesting questions for me that require further inquiry.

> table(receivers$cluster, receivers$playoff)

n y
1 3 10
2 4 2
3 7 4
4 4 0
5 3 3

In closing the final blog of the year, I must make some predictions for the College Football playoffs. I hate to say it, but I think Alabama will roll over Ohio State. In the Rose Bowl, FSU comes from behind to win…again! I’d really like to see the Ducks win it all, but I just don’t see their defense being of the quality to stop Winston when it counts, which will be in the fourth quarter. ‘Bama has that defense, well, the defensive line and backers anyway. Therefore, I have to give the Crimson Tide the nod in the championship. The news is not all bad. Nebraska finally let go of Bo Pelini. I was ecstatic about his hire, but the paucity of top-notch recruits finally manifested itself with perpetual high-level mediocrity. His best years were with Callaghan’s recruits, Ndamukong Suh among many others. They should have hired Paul Johnson from Georgia Tech, at least it would have been fun and somewhat nostalgic to watch Husker football again.

Mahalo,

Cory

To leave a comment for the author, please follow the link and comment on his blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News