The R Project: 2015 in Review

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s been a banner year for the R project in 2015, with frequent new releases, ever-growing popularity, a flourishing ecosystem, and accolades from both users and press. Here’s a roundup of the big events for R from 2015.

R continues to advance under the new leadership of the R Foundation. There were five updates in 2015: R 3.1.3 in March, R 3.2.0 in April, R 3.2.1 in June, R 3.2.2 in August, and R 3.2.3 in December. That’s impressive release rate, especially for a project that’s been in active development for 18 years!

R’s popularity continued unabated in 2015. R is the most popular language for data scientists according to the 2015 Rexer survey, and the most popular Predictive Analytics / Data Mining / Data Science software in the KDnuggets software poll. While R’s popularity amongst data scientists is no surprise, R ranked highly even amongst general-purpose programming languages. In July, R placed #6 in the IEEE list of top programming languages, rising 3 places from its 2014 ranking. It also continues to rank highly amongst StackOverflow users, where it is the 8th most popular language by activity, and the fastest-growing language by number of questions. R was also a top-ranked language on GitHub in 2015.

The R Consortium, a trade group dedicated to the support and growth of the R community, was founded in June. Already, the group has published best practices for secure use of R, and formed the Infrastructure Steering Committee to fund and oversee commuinity projects. Its first project (a hub for R package developers) was funded in November, and proposals are being accepted for future projects.

2015 was the year that Microsoft put its weight behind R, beginning with the acquisition of Revolution Analytics in April and prominent R announcements at the BUILD Conference in May. Microsoft continues the steady pace of open-source R project releases, with regular updates to Revolution R Open, DeployR Open and the foreach and checkpoint packages. Revolution R Enterprise saw updates, and new releases of several Microsoft platforms have integrated R, including SQL Server 2016, Cortana Analytics, PowerBI, Azure and the Data Science Virtual Machine.

Activity within local R user groups accelerated in 2015, with 18 new groups founded for a total of 174. Microsoft expanded its R user group sponsorship with the Microsoft Data Science User Group Program. Community conferences also boasted record attendance, inclusing at useR! 2015, R/Finance, EARL Boston, and EARL London. Meanwhile, companies including Betterment, Zillow, Buzzfeed, the New York Times and many others shared how they benefit from R.

R also got some great coverage in the media this year, with features in Priceonomics, TechCrunch, Nature, Inside BigData, Mashable, The Economist, opensource.com and many other publications.

That’s a pretty big year … and we expect even more from R in 2016. A big thanks go out to everyone in the R community, and especially the R Core group, for making R the standout success it is today. Happy New Year!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Wind Resource Assessment

By Fabio Veronesi

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

This is an article we recently published on “Renewable and Sustainable Energy Reviews”. It starts with a thorough review of the methods used for wind resource assessment: from algorithms based on physical laws to other based on statistics, plus mixed methods.
In the second part of the manuscript we present a method for wind resource assessment based on the application of Random Forest, coded completely in R.

Elsevier allows to download the full paper for FREE until the 12th of February, so if you are interested please download a copy.
This is the link: http://authors.elsevier.com/a/1SG5a4s9HvhNZ6

Below is the abstract.

Abstract

Wind resource assessment is fundamental when selecting a site for wind energy projects. Wind is influenced by several environmental factors and understanding its spatial variability is key in determining the economic viability of a site. Numerical wind flow models, which solve physical equations that govern air flows, are the industry standard for wind resource assessment. These methods have been proven over the years to be able to estimate the wind resource with a relatively high accuracy. However, measuring stations, which provide the starting data for every wind estimation, are often located at some distance from each other, in some cases tens of kilometres or more. This adds an unavoidable amount of uncertainty to the estimations, which can be difficult and time consuming to calculate with numerical wind flow models. For this reason, even though there are ways of computing the overall error of the estimations, methods based on physics fail to provide planners with detailed spatial representations of the uncertainty pattern. In this paper we introduce a statistical method for estimating the wind resource, based on statistical learning. In particular, we present an approach based on ensembles of regression trees, to estimate the wind speed and direction distributions continuously over the United Kingdom (UK), and provide planners with a detailed account of the spatial pattern of the wind map uncertainty.

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

7 new R jobs from around the world (2015-12-31)

By Tal Galili

r_jobs

This is the bi-monthly R-bloggers post (for 2015-12-31) for new R Jobs.

To post your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

  1. Part-Time
    Content Development Intern ($20/hour) @ Cambridge, Massachusetts, U.S.
    DataCamp – Posted by nickc123
    Cambridge
    Massachusetts, United States
    22 Dec2015
  2. Full-Time
    Data Scientist @ Billerica, Massachusetts, U.S.
    MilliporeSigma, Inc. – Posted by andreaduda
    Billerica
    Massachusetts, United States
    22 Dec2015
  3. Freelance
    Data Science Course Mentor – Remote/Flexible
    Springboard – Posted by Parul Gupta
    Anywhere
    21 Dec2015
  4. Full-Time
    Computational Analyst / Bioinformatician @ Cambridge, MA, U.S.
    Boston Children’s Hospital, Dana-Farber Cancer Institute, Broad Institute – Posted by julirsch
    Cambridge
    Massachusetts, United States
    20 Dec2015
  5. Freelance
    Consultant/Tutor for R and SQL
    Logistics Capital & Strategy – Posted by Edoody
    Anywhere
    19 Dec2015
  6. Full-Time
    Data Scientist – Predictive Analyst @ Harrisburg, Pennsylvania, US
    Manada Technology LLC – Posted by manadatechnology
    Harrisburg
    Pennsylvania, United States
    18 Dec2015
  7. Freelance
    Big Data in Digital Health (5-10 hours per week)
    MedStar Intitute for Innovation – Posted by Praxiteles
    Anywhere
    18 Dec2015

Job seekers: please follow the links below to learn more and apply for your job of interest:

(In R-users.com you may see all the R jobs that are currently available)

(you may also look at previous R jobs posts).

Source:: R News

[R-bloggers]RcmdrPlugin.KMggplot2_0.2-3 now on CRAN

By triadsou

f:id:triadsou:20151231120403p:image

(This article was first published on Triad sou., and kindly contributed to R-bloggers)

New version (v0.2-3) of ‘RcmdrPlugin.KMggplot2′ (an Rcmdr plug-in; a GUI for ‘ggplot2′) released.

NEWS

Changes in version 0.2-3 (2015-12-30)
  • New geom_stepribbon().
  • Pointwise confidence intervals of Kaplan-Meier plots with band (Thanks to Dr. M. Felix Freshwater).
  • Notched box plots (Thanks to Dr. M. Felix Freshwater).
  • Stratified (colored) box plots (Thanks to Dr. M. Felix Freshwater).
Changes in version 0.2-2 (2015-12-22)
  • New ggplot2’s theme (theme_linedraw, theme_light).
  • New ggthemes’s themes (theme_base, theme_par).
  • Fixed a bug was caused by new ggplot2’s theme.
  • Fixed a bug related to ggthemes 3.0.0.
  • Fixed a bug related to windows fonts.
  • Fixed a bug related to file saving.
  • Added a vignette for dataset requirements to make Kaplan-Meier plot.
  • Added a vignette for extrafont.

geom_stepribbon()

The geom_stepribbon is an extension of the geom_ribbon, and is optimized for Kaplan-Meier plots with pointwise confidence intervals or a confidence band.

Usage
geom_stepribbon(mapping = NULL, data = NULL, stat = "identity",
  position = "identity", na.rm = FALSE, show.legend = NA,
  inherit.aes = TRUE, kmplot = FALSE, ...)

The additional argument

  • kmplot
    • If TRUE, missing values are replaced by the previous values. This option is needed to make Kaplan-Meier plots if the last observation has an event, in which case the upper and lower values of the last observation are missing. This processing is optimized for results from the survfit function.

Other arguments are the same as the geom_ribbon.

Example
require("ggplot2")
data(dataKm, package = "RcmdrPlugin.KMggplot2")

.df <- na.omit(data.frame(x = dataKm$time, y = dataKm$event, z = dataKm$trt))
.df <- .df[do.call(order, .df[, c("z", "x"), drop = FALSE]), , drop = FALSE]
.fit <- survival::survfit(
  survival::Surv(time = x, event = y, type = "right") ~ z, .df)
.fit <- data.frame(x = .fit$time, y = .fit$surv, nrisk = .fit$n.risk, 
  nevent = .fit$n.event, ncensor= .fit$n.censor, upper = .fit$upper,
  lower = .fit$lower)
.df <- .df[!duplicated(.df[,c("x", "z")]), ]
.df <- .fit <- data.frame(.fit, .df[, c("z"), drop = FALSE])
.df <- .fit <- rbind(unique(data.frame(x = 0, y = 1, nrisk = NA, nevent = NA,
  ncensor = NA, upper = 1, lower = 1, .df[, c("z"), drop = FALSE])), .fit)
.cens <- subset(.fit, ncensor == 1)

ggplot(data = .fit, aes(x = x, y = y, colour = z)) + 
  RcmdrPlugin.KMggplot2::geom_stepribbon(data = .fit,
    aes(x = x, ymin = lower, ymax = upper, fill = z), alpha = 0.25,
    colour = "transparent", show.legend = FALSE, kmplot = TRUE) +
  geom_step(size = 1.5) +
  geom_linerange(data = .cens, aes(x = x, ymin = y, ymax = y + 0.02),
    size = 1.5) + 
  scale_x_continuous(breaks = seq(0, 21, by = 7), limits = c(0, 21)) + 
  scale_y_continuous(limits = c(0, 1), expand = c(0.01, 0)) + 
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  xlab("Time from entry") +
  ylab("Proportion of survival") +
  labs(colour = "trt") +
  theme_bw(base_size = 14, base_family = "sans") + 
  theme(legend.position = "right")

Pointwise confidence intervals of Kaplan-Meier plots with band

f:id:triadsou:20151231120400p:image

Notched box plots and stratified (colored) box plots

f:id:triadsou:20151231120401p:image

f:id:triadsou:20151231120402p:image

To leave a comment for the author, please follow the link and comment on their blog: Triad sou..

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using R to analyse MAN AHL Trend

By Thomas Huben

(This article was first published on Copula, and kindly contributed to R-bloggers)

Let’s use the great PerformanceAnalytics package to get some insights on the risk profile of the MAN AHL Trend Fund. It’s a program with a long track record – I believe in the late 80′. The UCITS Fund NAV Data can be downloaded from the fund webpage as xls file- starting 2009.

First let’s import the data into R. I’m using a small function, to import .csv which returns an .xts object named ahl.

#Monthly NAV MAN AHL
loadahl<-function(){
a=read.table(“ahl_trend.csv”,sep = “,”,dec = “,”)
a$date = paste(substr(a$V1,1,2),substr(a$V1,4,5),substr(a$V1,7,10),sep=”-“)
ahl=a$date
ahl=cbind(ahl,substr(a$V2,1,5))
a=as.POSIXct(ahl[,1],format=”%d-%m-%Y”)
ahl=as.xts(as.numeric(ahl[,2]),order.by=a)
rm(a)
return(ahl)
}

next we would like to have the monthly returns

monthlyReturn(x, subset=NULL, type='arithmetic',
leading=TRUE, ...)
 
 

which we store in retahl.

retahl=monthlyReturn(ahl,type=”log”)

Next, I usually plot the chart.Drawdown to get a visual idea, if the product is designed for my risk appetite.

chart.Drawdown(retahl)

table.AnnualizedReturns(retahl)
 
                          monthly.returns
Annualized Return 0.0212
Annualized Std Dev 0.1246
Annualized Sharpe (Rf=0%) 0.1702
 
table.DownsideRisk(retahl)
monthly.returns
Semi Deviation 0.0254
Gain Deviation 0.0222
Loss Deviation 0.0222
Downside Deviation (MAR=10%) 0.0289
Downside Deviation (Rf=0%) 0.0241
Downside Deviation (0%) 0.0241
Maximum Drawdown 0.2478
Historical VaR (95%) -0.0521
Historical ES (95%) -0.0748
Modified VaR (95%) -0.0573
Modified ES (95%) -0.0730
 
 

To leave a comment for the author, please follow the link and comment on their blog: Copula.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

bayes.js: A Small Library for Doing MCMC in the Browser

By Rasmus Bååth

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

Bayesian data analysis is cool, Markov chain Monte Carlo is the cool technique that makes Bayesian data analysis possible, and wouldn’t it be coolness if you could do all of this in the browser? That was what I thought, at least, and I’ve now made bayes.js: A small JavaScript library that implements an adaptive MCMC sampler and a couple of probability distributions, and that makes it relatively easy to implement simple Bayesian models in JavaScript.

Here is a motivating example: Say that you have the heights of the last ten American presidents…

// The heights of the last ten American presidents in cm, from Kennedy to Obama 
var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];

… and that you would like to fit a Bayesian model assuming a Normal distribution to this data. Well, you can do that right now by clicking “Start sampling” below! This will run an MCMC sampler in your browser implemented in JavaScript.



function loadScript(url, callback)
{
// Adding the script tag to the head as suggested before
var head = document.getElementsByTagName('head')[0];
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = url;

// Then bind the event to the callback function.
// There are several events for cross browser compatibility.
script.onreadystatechange = callback;
script.onload = callback;

// Fire the loading
head.appendChild(script);
}

var clear_samples;
var sample_loop;
var stop_sample_loop;

loadScript("https://cdn.plot.ly/plotly-1.3.0.min.js", function(){
loadScript("/files/posts/2015-12-31-bayes-js-a-small-library-for-doing-mcmc-in-the-browser/mcmc.js", function(){
loadScript("/files/posts/2015-12-31-bayes-js-a-small-library-for-doing-mcmc-in-the-browser/distributions.js", function(){

var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];

var params = {
mu: {type: "real"},
sigma: {type: "real", lower: 0}};

var log_post = function(state, heights) {
var log_post = 0;
// Priors (here sloppy and vague...)
log_post += ld.norm(state.mu, 0, 1000);
log_post += ld.unif(state.sigma, 0, 1000);
// Likelihood
for(var i = 0; i ' +
'

' +
'

' +
'

Source:: R News

Write in-line equations in your Shiny application with MathJax

By Ken Kleinman

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

I’ve been working on a Shiny app and wanted to display some math equations. It’s possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here’s how to do it.

R

Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax().
## ui.R


library(shiny)

shinyUI(fluidPage(
title = 'MathJax Examples with in-line equations',
withMathJax(),
# section below allows in-line LaTeX via $ in mathjax.
tags$div(HTML("
")),
helpText('An irrational number $sqrt{2}$
and a fraction $1-frac{1}{2}$'),
helpText('and a fact about $pi$:$frac2pi = frac{sqrt2}2 cdot
frac{sqrt{2+sqrt2}}2 cdot
frac{sqrt{2+sqrt{2+sqrt2}}}2 cdots$'),
uiOutput('ex1'),
uiOutput('ex2'),
uiOutput('ex3'),
uiOutput('ex4'),
checkboxInput('ex5_visible', 'Show Example 5', FALSE),
uiOutput('ex5')
))



## server.R
library(shiny)

shinyServer(function(input, output, session) {
output$ex1 withMathJax(helpText('Dynamic output 1: $alpha^2$'))
})
output$ex2 withMathJax(
helpText('and output 2 $3^2+4^2=5^2$'),
helpText('and output 3 $sin^2(theta)+cos^2(theta)=1$')
)
})
output$ex3 withMathJax(
helpText('The busy Cauchy distribution
$frac{1}{pigamma,left[1 +
left(frac{x-x_0}{gamma}right)^2right]}!$'))
})
output$ex4 invalidateLater(5000, session)
x withMathJax(sprintf("If $X$ is a Cauchy random variable, then
$P(X leq %.03f ) = %.03f$", x, pcauchy(x)))
})
output$ex5 if (!input$ex5_visible) return()
withMathJax(
helpText('You do not see me initially: $e^{i pi} + 1 = 0$')
)
})
})

Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn’t work when the MathJax HTML is inserted as above.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on their blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Write in-line equations in your Shiny application with MathJax

By Ken Kleinman I’ve been working on a Shiny app and wanted to display some math equations. It’s possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here’s how to do it.

R

Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax().
## ui.R


library(shiny)

shinyUI(fluidPage(
title = 'MathJax Examples with in-line equations',
withMathJax(),
# section below allows in-line LaTeX via $ in mathjax.
tags$div(HTML("
")),
helpText('An irrational number $sqrt{2}$
and a fraction $1-frac{1}{2}$'),
helpText('and a fact about $pi$:$frac2pi = frac{sqrt2}2 cdot
frac{sqrt{2+sqrt2}}2 cdot
frac{sqrt{2+sqrt{2+sqrt2}}}2 cdots$'),
uiOutput('ex1'),
uiOutput('ex2'),
uiOutput('ex3'),
uiOutput('ex4'),
checkboxInput('ex5_visible', 'Show Example 5', FALSE),
uiOutput('ex5')
))



## server.R
library(shiny)

shinyServer(function(input, output, session) {
output$ex1 withMathJax(helpText('Dynamic output 1: $alpha^2$'))
})
output$ex2 withMathJax(
helpText('and output 2 $3^2+4^2=5^2$'),
helpText('and output 3 $sin^2(theta)+cos^2(theta)=1$')
)
})
output$ex3 withMathJax(
helpText('The busy Cauchy distribution
$frac{1}{pigamma,left[1 +
left(frac{x-x_0}{gamma}right)^2right]}!$'))
})
output$ex4 invalidateLater(5000, session)
x withMathJax(sprintf("If $X$ is a Cauchy random variable, then
$P(X leq %.03f ) = %.03f$", x, pcauchy(x)))
})
output$ex5 if (!input$ex5_visible) return()
withMathJax(
helpText('You do not see me initially: $e^{i pi} + 1 = 0$')
)
})
})

Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn’t work when the MathJax HTML is inserted as above.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Source:: SAS & R

Our R package roundup

By Christoph Safferling

gif

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

A year in review

It’s the time of the year again where one eats too much, and gets in a reflective mood! 2015 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year!

Since we are in a festive mood, we decided we would not fight it out but rather present our top five of new R packages, a purely subjective list of packages we (and Chuck Norris) approves of.

But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.

Top Popular CRAN packages

Let’s start with some factual data before we go into our personal favourites of 2015. We’ll pull the titles of the new 2015 R packages from cranberries, and parse the CRAN downloads per day using cranlogs package.

Using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. I don’t know, and this is an interesting question I think we’ll tackle in a later blog post…

For now, let’s just assume that average downloads per day is a relatively stable metric to gauge package success with. We’ll grab the packages released using rvest:

berries <- read_html("http://dirk.eddelbuettel.com/cranberries/2015/")
titles <- berries %>% html_nodes("b") %>% html_text
new <- titles[grepl("^New package", titles)] %>% 
  gsub("^New package (.*) with initial .*", "1", .) %>% unique

and then lapply() these titles into the CRAN and parse their respective average downloads per day:

logs <- pblapply(new, function(x) {
  down <- cran_downloads(x, from = "2015-01-01")$count 
  if(sum(down) > 0) {
    public <- down[which(down > 0)[1]:length(down)]
  } else {
    public <- 0
  }
  return(data.frame(package = x, sum = sum(down), avg = mean(public)))
})

logs <- do.call(rbind, logs)

With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2015, by average number of daily downloads:

top 20 new CRAN packages in 2015

The full code is availble on github, of course.

As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.

The top four packages (rversions, xml2, git2r, praise) are all technical packages. Although I have to say I did not know of praise so far, and it looks like it’s a very fun package, indeed: you can automatically add randomly generated praises to your output! Fun times ahead, I’d say.

Excluding these, the clear winner of “frontline” packages are readxl and readr, both packages by Hadly Wickham dealing with importing data into R. Well-deserved, in our opinion. These are packages nearly everybody working with data will need on a daily basis. Although, one hopes that contact with Excel sheets is kept to a minimum to ensure one’s sanity, and thus readxl is needed less often in daily life!

The next two are packages (DiagrammeR and visNetwork) relate to network diagrams, something that seems to be en vogue currently. R is getting some much-needed features on these topics here it seems.

plotly is the R package to the recently open-sourced popular plot.ly javascript libraries for interactive charts. A well-deserved top ranking entry! We also see packages that build and improve the ever-popular shiny packages (DT and shinydashboard), leaflet dealing with interactive mapping issues, and packages on stan, the Baysian statistical interference language (rstan, StanHeaders).

But now, this blog’s authors’ personal top five of new R packages for 2015:

readr

(safferli’s pick)

readr is our package pick that also made it into the top downloads metric, above. Small wonder, as it’s written by Hadley and aims to make importing data easier, and especially more consistent. It is thus immediately useful for most, if not all, R users out there, and also received a tremendous “fame kickstart” from Hadley’s reputation within the R community. For extremely large datasets I still like to use data.table‘s fread() function, but for anything else the new read_* functions make your life considerably easier. They’re faster compared to base R, and just the no more worries of stringsAsFactors alone is a godsend.

Since the package is written by Hadley, it is not only great but also comes with a fantastic documentation. If you’re not using readr currently, you should head over the the package readme and check it out.

infuser

(Yuki’s pick)

R already has many template engines but this one is simple yet quite useful if you work on data exploration, visualization, statistics in R and deploy your findings in Python while using the same SQL queries and as similar syntax as possible.

Code transition from R to Python is quick and easy with infuser like this now;

# R
library(infuser)
template <- "SELECT {{var}} FROM {{table}} WHERE month = {{month}}"
query <- infuse(template,var="apple",table="fruits",month=12)
cat(query)
# SELECT apple FROM fruits WHERE month = 12
# Python
template = "SELECT {var} FROM {table} WHERE month = {month}"
query = template.format(var="apple",table="fruits",month=12)
print(query)
# SELECT apple FROM fruits WHERE month = 12

googlesheets

(Kirill’s pick)

googlesheets by Jennifer Bryan finally allows me to directly output to Google Sheets, instead of output it to xlsx format and then push it (mostly manually) to Google Drive. At our company we use Google Drive as a data communication and storage tool for the management, so outputing Data Science results to Google Sheets is important. We even have some small reports stored in Google Sheets. The package allows for easy creating, finding, filling, and reading of Google Sheets with an incredible simplicity of use.

AnomalyDetection

(Kirill’s second pick. He gets to pick two since he is so indecisive)

AnomalyDetection was developed by Twitter’s data scientists and introduced to the open source community in the first week of the year. A very handy, beautiful, well-developed tool to find anomalies in the data. This is very important for a data scientist to be able to find anomalies in the data fast and reliably, before real damage occurs. The package allows you to get a good first impression of the things going on in your KPIs (Key Performance Indicators) and react quickly. Building alerts with it is a no-brainer if you want to monitor your data and assure data quality.

emoGG

(Jess’s pick)

emoGG is definitely falling in the category “most whatever” R package of the year. What this package does is fairly simple: it allows you to display emojis in your ggplot2 plots, either as plotting symbols or as a background. Under the hood, it adds a geom_emoji layer to your ggplot2 plots, in which you have to specify one or more emoji codes corresponding to the emojis you wish to plot. emoGG can be used to make visualisations more compelling and make plots transport more meaning, no doubt. But before anything else, it’s fun and a must have for an avid emoji fan like me.

Our R package roundup was originally published by Kirill Pomogajko at Opiate for the masses on December 30, 2015.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using segmented regression to analyse world record running times

By Andrie de Vries

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article “How Zach Bitter Ran 100 Miles in Less Than 12 Hours“. Zach’s effort was rewarded with the American record for the 100 mile event.

Zach Bitter holds the American record for the 100 mile

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events – where “standard” is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is “standard”, but the 2,000m race is not.

Plot2

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

  • The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
  • And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start – clearly this plays a large role in the very short sprint distance.

Subsetting the data

For the analysis below, I exlcuded the data for:

  • The 100m sprint (transition effects play too large a role)
  • The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

Using the segmented package

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The segmented() function allows you to modify a fitted object of class lm or glm, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed segmented() to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

> summary(sfit)

***Regression Model with Segmented Relationship(s)***

Call:
segmented.lm(obj = lfit, seg.Z = ~logDistance)

Estimated Break-Point(s):
Est. St.Err
0.055 0.021

Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***
logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***
U1.logDistance 11.2046 0.4536 24.70 NA
---
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1

Residual standard error: 0.2373 on 12 degrees of freedom
Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976

Convergence attained in 4 iterations with relative change -4.922372e-16

The final plot shows the same data, but this time with the segmented regression line also displayed.

Plot3

Conclusion

I conlude:

  1. It is really easy to fit a segmented linear regression model using the segmented package
  2. There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
  3. The ultramarathon distances have a completely different dynamic. However, it’s not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these “non-standard” events.
  4. The 50km world record seems too “slow”. Perhaps the competition for this event is less intense than for the marathon?
Dennis-Kimetto

Dennis Kimetto holds the world record for the marathon

The code

Here is my code for the analysis:

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News