## likelihood-free inference in high-dimensional models

By xi’an

(This article was first published on Xi’an’s Og » R, and kindly contributed to R-bloggers)

“…for a general linear model (GLM), a single linear function is a sufficient statistic for each associated parameter…”

The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values. I ran a comparative experiment on a toy normal target, using either empirical mean and variance (blue), or empirical median and mad (brown), with little difference in the (above) output. Especially when considering that I set the tolerance somewhat arbitrarily. This could be due to the fact that the pairs are quite similar in terms of their estimation properties. However, I then realised that the empirical variance is not sufficient for the variance conditional on the mean parameter. I looked at the limiting case (with zero tolerance), which amounts to simulating σ first and then μ given σ, and ran a (Gibbs and iid) simulation. The difference, as displayed below (red standing for the exact ABC case), is not enormous, even though it produces a fatter tail in μ. Note the interesting feature that I cannot produce the posterior distribution of the parameters given the median and mad statistics. Which is a great introduction to ABC!

R code follows:

```N=10
data=rnorm(N,mean=3,sd=.5)

#ABC with insufficient statistics
medata=median(data)

for (i in 1:100)

T=1e6
mu=rep(median(data),T)
for (t in 2:T){
mu[t]=rnorm(1)
psudo=rnorm(N,mean=mu[t],sd=sig[t-1])
if (abs(medata-median(psudo))>tol[1])
mu[t]=mu[t-1]

sig[t]=1/rexp(1)
psudo=rnorm(N,mean=mu[t],sd=sig[t])
sig[t]=sig[t-1]
}
#ABC with more sufficient statistics
meaata=mean(data)
sddata=sd(data)

for (i in 1:100)

for (t in 2:T){
mu[t]=rnorm(1)
psudo=rnorm(N,mean=mu[t],sd=sig[t-1])
if (abs(meaata-mean(psudo))>tol[1])
mu[t]=mu[t-1]

sig[t]=1/rexp(1)
psudo=rnorm(N,mean=mu[t],sd=sig[t])
if (abs(sddata-sd(psudo))>tol[2])
sig[t]=sig[t-1]
}

#MCMC with false sufficient
sig=1/sqrt(rgamma(T,shape=.5*N,rate=1+.5*var(data)))
mu=rnorm(T,mean(data)/(1+sig^2/N),sd=1/sqrt(1+N/sig^2)))
```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## A better interactive neuroimage plotter in R

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

In a previous post, I described how you can interactively explore a 3D `nifti` object in R. I used the `manipulate` package, but the overall results were sluggish and not really usable.

I was introduced to a a good neuroimaging viewer called Mango, by a friend or two and use it somewhat inconsistently. One major advantage of this software is that it has been converted to a pure JavaScript library called Papaya. As such, you can create simple HTML files that have an embedded interactive image viewer.

# Papayar

That’s all fine and good, but I like my things in `R`. I understand many can easily write bash scripts that perform the same operations and use the HTML builder provided by papaya.

I want the operation in R for the same reasons I make many things for R:

1. I use R
2. Many statisticians like imaging but need tools they understand and they understand R
3. I like writing pipelines and scripts in one language

My answer: the `papayar` package.

### Install papayar

To install `papayar`, you can simply install from GitHub using the `devtools` package.

```require(devtools)
devtools::install_github("muschellij2/papayar")
```

## Papayar functions

The main function is `papaya`. Let’s look at the arguments:

```library(papayar)
formalArgs(papaya)
```
```[1] "images" "outdir" "..."
```

The `images` argument can be a list of `nifti` objects or a vector of character filenames. The `outdir` is where the HTML file is written. The default is to a temporary directory that will be trashed after the session is over. The additional arguments are passed to the lower-level function `pass_papaya`, which in turn are passed to functions `httd` and `daemon_stop` in the `servr` package. The `pass_papaya` function is useful, however, to open a blank papaya session by just invoking `pass_papaya()`

## Papayar Example

As the `httd` function starts a server, the images can be rendered (and will be by default) in the RStudio viewer! In the terminal, it opens your default web browser. Here’s a basic example:

```library(oro.nifti)
x = nifti(img = array(rnorm(100^3), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
y = nifti(img = array(rbinom(100^3, prob = 0.5, size = 10), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
index.file = papaya(list(x, y))
```

The first 3 lines make some random arrays, from a normal and binomial distribution and puts them into a `nifti` object. The list of these `nifti` objects is passed in. The first image is displayed in grayscale and the second image is overlaid using red-hot colors and the opacity of this image can be changed. The object `index.file` will be a character filename where the HTML file is stored. The data and this HTML file is written to `outdir` (which, again, is a temporary directory by default).

### Output

Below is a series of screen shots I took from the code above. You should be able to see this in RStudio or your browser:

The main reason to use this is that you can click different areas for the crosshairs and move to a different point in axial, coronal, and sagittal space. Thus, this is truly interactive.

Here we can show there are limited (but useful) controls for the overlay. You can change the mapping of the values in the image and the overlay and the opacity of the overlay.

## Brain Example

The above data has been used since everyone could test it, but it’s just random numbers and doesn’t look very compelling. Here I will show you the hyperintense voxels overlaid on the MNI 152 1mm T1 brain image click here for description, which correspond mainly to the white matter:

Hopefully you can see how this can be useful for exploring data and results.

# ITK-SNAP and itksnapr

Some of my colleagues are more partial to using ITK-SNAP for viewing images interactively. I have bundled the executables for ITK-SNAP into the R package `itksnapr`. The main function is `itksnap`, which you can specify images to different options to ITK-SNAP.

### Install itksnapr

To install `itksnapr`, you can simply install from GitHub using the `devtools` package.

```require(devtools)
devtools::install_github("muschellij2/itsnapr")
```
```library(itksnapr)
itksnap(grayscale = x, overlay = y)
```

I haven’t used ITK-SNAP much, but hear many good things about it. There are many better resources than this blog on how to use it and how to use it well. If interested in a good image viewer, I implore you to google around for some of these resources. If anyone has intense interest of image viewers and wants added functionality, don’t hesitate to file an inssue on GitHub.

# FSLView

Although it was included in my `fslr` package by default and I never discussed it in detail, FSLView is included with the distribution of FSL and is a viewer I use heavily. The `fslr` function is `fslview`. One specific advantage of using FSLView is that it can pass through X11 forwarding, so you can remotely view image from a cluster, though it may be slow.

# Conclusion

Although I use the `orthographic`,`image.nifti` and `overlay` functions from `oro.nifti` for many of my figures, for interactive exploring of results, these can be somewhat slow for getting a large-scale view, and not a specific slice view. Therefore, a fully interactive neuroimaging plotter is necessary. Here are 3 options that can be accessed “within” `R`.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## RTutor: How Soap Operas Reduced Fertility in Brazil

(This article was first published on Economics and R – R posts, and kindly contributed to R-bloggers)

What is the real world impact of tv series? Did Brazilian women get fewer children because they watched soap operas that portray happy, rich families that have few children?

Clara Ulmer has written a very nice RTutor problem set that allows you interactively explore this question in R. It is based on the paper

Soap Operas and Fertility: Evidence from Brazil by Eliana La Ferrara, Alberto Chong and Suzanne Duryea, American Economic Journal: Applied Economics (2012)

While correlations between two time series can be quickly found, the challenge in many econometric studies is to convincingly establish causal effects. The main analysis is based on a micro data panel data set that contains births and background characteristics for Brazilian woman over several years. The analysis exploits the fact that the relevant TV broadcast company entered at different points of time in different regions.

The regression analysis starts with Exercise 4. Subsequently you can explore many robustness checks (e.g. placebo regressions), that make actually a quite convincing case for the presence of a causal effect of soap operas on fertility.

The panel data analysis in the problem set extensively uses the R package lfe. In my view `lfe` is quite a nice package, that allows to quickly run multiway fixed effect regressions on considerably large data sets. In addition, many functionality that empirical economists commonly use, like instrumental variables or clustered robust standard errors can be very conveniently accessed. For economists that currently use Stata but are thinking of moving to R, the `lfe` package may be removing some considerable obstacles. You can check it out in Clara’s problem set.

Here is a screenshot:

To install and try out the problem set locally, follow the instructions here:

https://github.com/ClaraUlmer/RTutorSoapOperas

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://claraulmer.shinyapps.io/RTutorSoapOperas

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## R courses on basic R, advanced R, statistical machine learning with R, text mining with R, spatial modelling with R and R package building

(This article was first published on BNOSAC – Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

Waw, our course list for teaching R is getting bigger and bigger. We have now courses on basic, R, advanced R, R package building, statistical machine learning with R, text mining with R and spatial analysis with R. All face-to-face courses given in Belgium and scheduled in the coming months.

Some courses are given at the European Data Innovation Hub (Brussels, Belgium), other courses are given through the Leuven Statistics Research Center (Leuven, Belgium). From today on, you can register for the following courses regarding the use of R.
Prices are set to 300€ per course day + taxes.
For detailed information on the course content, have a look at the pdf which can be found here.

Courses given at the European Data Innovation Hub (Brussels, Belgium) – http://www.datainnovationhub.eu
07-08/09/2015 Introduction to R programming (2 days) (register at www.eventbrite.com/e/common-data-manipulation-for-r-programmers-tickets-17938118395)
14/09/2015 Common data manipulation for R programmers (1 day) (register at www.eventbrite.com/e/common-data-manipulation-for-r-programmers-tickets-17938118395)
28-29/09/2016 Statistical Machine Learning with R (2 days) (register at www.eventbrite.com/e/statistical-machine-learning-with-r-tickets-17938785390)
05/10/2015 Text mining with R (1 day) (register at www.eventbrite.com/e/text-mining-with-r-tickets-17939078266)
02/11/2015 Reporting with R (1 day) (register at www.eventbrite.com/e/reporting-with-r-tickets-17938500538)
03/11/2015 Creating R packages and R repositories (1 day) (register at www.eventbrite.com/e/training-creating-r-packages-and-r-repositories-tickets-18168431267)

Courses given at LStat (Leuven, Belgium) – http://lstat.kuleuven.be/training/index.htm
28-29/10/2015 Statistical Machine Learning with R (2 days) (register at lstat.kuleuven.be/training/coursedescriptions/statistical-machine-learning-with-r)
17/11/2015 Text mining with R (1 day) (register at lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r)
13-14/04/2016 Applied Spatial Modelling with R (1.5 days) (register at lstat.kuleuven.be/training/applied-spatial-modelling-with-r)

Hope to see you soon.

PS. Thanks Brandon for allowing us to use your wonderfull logo

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Doh! I Could Have Had Just Used V8!

By hrbrmstr

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

An R user recently had the need to split a “full, human name” into component parts to retrieve first & last names. The full names could be anything from something simple like “David Regan” to more complex & diverse such as “John Smith Jr.”, “Izaque Iuzuru Nagata” or “Christian Schmit de la Breli”. Despite the face that I’m pretty good at searching GitHub & CRAN for R stuff, my quest came up empty (though a teensy part of me swears I saw this type of thing in a package somewhere). I did manage to find Python & node.js modules that carved up human names but really didn’t have the time to re-implement their functionality from scratch in R (or, preferably, Rcpp).

Rather than rely on the Python bridge to R (yuck) I decided to use @opencpu‘s V8 package to wrap a part of the node.js humanparser module. If you’re not familiar with V8, it provides the ability to run JavaScript code within R and makes it possible to pass variables into JavaScript functions and get data back in return. All the magic happens via a JSON data passing & Rcpp wrappers (and, of course, the super-awesome code Jeroen writes).

Working with JavaScript in R is as simple as creating an instance of the JavaScript V8 interpreter, loading up the JavaScript code that makes the functions work:

 ```library(V8) ct <- new_context() ct\$source(system.file("js/underscore.js", package="V8")) ct\$call("_.filter", mtcars, JS("function(x){return x.mpg < 15}")) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 #> Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 #> Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 #> Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 #> Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4```

There are many more examples in the V8 vignette.

For `humanparser` I needed to use Underscore.js (it comes with V8) and a function from `humanparser` that I carved out to work the way I wanted it to. You can look at the innards of the package on github—specifically, this file (it’s really small)— and, to use the two new functions the package exposes it’s as simple as doing:

 ```devtools::install_github("hrbrmstr/humanparser") library(humanparser) parse_name("John Smith Jr.") #> \$firstName #> [1] "John" #> #> \$suffix #> [1] "Jr." #> #> \$lastName #> [1] "Smith" #> #> \$fullName #> [1] "John Smith Jr."```

or the following to convert a bunch of ’em:

 ```full_names <- c("David Regan", "Izaque Iuzuru Nagata", "Christian Schmit de la Breli", "Peter Doyle", "Hans R.Bruetsch", "Marcus Reichel", "Per-Axel Koch", "Louis Van der Walt", "Mario Adamek", "Ugur Tozsekerli", "Judit Ludvai" ) parse_names(full_names) #> Source: local data frame [11 x 4] #> #> firstName lastName fullName middleName #> 1 David Regan David Regan NA #> 2 Izaque Nagata Izaque Iuzuru Nagata Iuzuru #> 3 Christian de la Breli Christian Schmit de la Breli Schmit #> 4 Peter Doyle Peter Doyle NA #> 5 Hans R.Bruetsch Hans R.Bruetsch NA #> 6 Marcus Reichel Marcus Reichel NA #> 7 Per-Axel Koch Per-Axel Koch NA #> 8 Louis Van der Walt Louis Van der Walt NA #> 9 Mario Adamek Mario Adamek NA #> 10 Ugur Tozsekerli Ugur Tozsekerli NA #> 11 Judit Ludvai Judit Ludvai NA```

Now, the functions in this package won’t win any land-speed records since we’re going from R to C[++] to JavaScript and back, passing JSON-converted data back & forth, so I pwnd @quominus into making a full Rcpp-based human, full-name parser. And, he’s nearly done! So, keep an eye on humaniformat since it will no doubt be in CRAN soon.

The real point of this post is that there are tons of JavaScript modules that will work well with the V8 package and let you get immediate functionality for something that might not be in R yet. You can prototype quickly (it took almost no time to make that package and you don’t even need to go that far), then optimize later. So, next time—if you can’t find some functionality directly in R—see if you can get by with a JavaScript shim, then convert to full R/Rcpp when/if you need to go into production.

If you’ve done any creative V8 hacks, drop a note in the comments!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## functional enrichment analysis with NGS data

By ygc

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

I found that there is a Bioconductor package, seq2pathway, that can apply functional analysis to NGS data. It consists of two components, seq2gene and gene2pathway. seq2gene converts genomic coordination to genes while gene2pathway performs functional analysis at gene level.

I think it would be interesting to incorporate seq2gene with clusterProfiler. But I found it fail to run due to it call absolute path of python installed in the author’s computer.

THIS should never be happened in a Bioconductor package since Bioconductor requires every package to have a least one vignette and runnable examples in man files. It turns out that there is no runnable example in either the vignette or man files. This package was compiled successfully in all platforms but it actually can only be run in windows platform with python installed exactly in the same path of the author. I don’t know why Bioconductor core team didn’t detect this issue when reviewing this package and why reviewers of Bioinformatics didn’t test the package.

I modified ‘C:/Python27/python’, to ‘python’. But it still throw error:

This is due to the author used ‘’ as folder separator which is Windows platform specific.

```command
AND I found that there are many '' in the source code.
So I gave up to try this package and reinvented the wheel.
It's not difficult to implement a seq2gene function using internal functions already available in my package ChIPseeker. It only took me half hour and was available since version 1.5.5.
seq2gene provided by ChIPseeker considers host gene (exon/intron), promoter region and flanking gene from intergenomic region.
With ChIPseeker, user can annotates genomic coordination by nearest gene, host gene, flanking genes or consider all of them by seq2gene and performs GO/KEGG/DO/Reactome analysis with clusterProfiler/DOSE/ReactomePA at gene level. A scenario can be found in Bioconductor support site, that used ChIPseeker to retrieve 5 genes flanking of each enhancer followed by GO analysis.
Note:
seq2pathway, https://github.com/Bioconductor-mirror/seq2pathway, version 1.1.2, access date 21 Aug, 2015
Related Posts
```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Deploying a car price model using R and AzureML

(This article was first published on Longhow Lam’s Blog » R, and kindly contributed to R-bloggers)

Recently Microsoft released the AzureML R package, it allows R users to publish their R models (or any R function) as a web service on the Microsoft Azure Machine Learning platform. Of course, I wanted to test the new package, so I performed the following steps.

To make use of this service you will need to sign up for an account. There is a free tier account, go to the Azure machine learning website.

### Get the AzureML package

The azureml package can be installed from GitHub, follow the instructions here.

### Create a model in R to deploy

In a previous blog post I described how to estimate the value of a car. I scraped the Dutch site www.autotrader.nl where used cars are sold. For more than 100.000 used cars I have their make, model, sales price, mileage (kilometers driven) and age. Let’s focus again on Renault, and instead of only using mileage I am also going to use the age, and the car model in my predictive splines model. For those who are interested the Renault data (4MB, Rda format, with many more variables) is available here.

```
library(splines)
RenaultPriceModel = lm(Price ~ ns(Kilometers, df=5)*model + ns(Age,df=3) , data = RenaultCars)

```

Adding interaction between Kilometers and car model increases the predictive power. Suppose I am satisfied with the model, it has a nice R-squared (0.92) on a hold-out set, now I am going to create a scoring function from the model object

```

RenaultScorer = function(Kilometers, Age, model)
{
require(splines)
return(
predict.lm(
RenaultPriceModel,
data.frame("Kilometers" = Kilometers, "Age" = Age, "model" = model)
)
)
}

# predict the price for a 1 year old Renault Espace that has driven 50000 KM
RenaultScorer(50000,365,"ESPACE")
33616.91

```

To see how estimated car prices vary over the age and kilometers, I created some contour plots. It seems that Clios loose value over years while Espaces loose value over kilometers. Sell your five year old Espace with 100K kilometers for 20.000 Euro, so that you can buy a new Clio (0 years and 0 kilometers)..

Before you can publish R functions on Azure you need to obtain the security credentials to your Azure Machine Learning workspace. See the instructions here. The function, RenaultScorer, can now be published as a web service on the AzureML platform using the publishWebservice function.

```
library(AzureML)

myWsID = "123456789101112131415"
myAuth = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# publish the R function to AzureML
RenaultWebService = publishWebService(
"RenaultScorer",
"RenaultPricerOnline",
list("Kilometers" = "float", "Age" = "float", "model" = "string"),
list("prijs" = "float"),
myWsID,
myAuth
)

```

### Test the web service

In R, the AzureML package contains the consumeList function to call the web service we have just created. The key and the location of the web service that we want to call are contained in the return object of the publishWebservice.

```
### predict the price of a one year old Renault Espace using the web service
RenaultPrice = consumeLists(
RenaultWebService[[2]][[1]]["PrimaryKey"],
RenaultWebService[[2]][[1]]\$ApiLocation,
list("Kilometers" = 50000, "Age" = 365, "model" = "ESPACE")
)
RenaultPrice

prijs
1 33616.90625
```

In the Azure machine learning studio you can also test the web service.

Azure ML studio

My web service (called RenaultPricerOnline) is visible now in Azure machine Learning studio, click on it and it will launch a dialog box to enter kilometers, age and model.

The call results again in a price of 33616.91. The call takes some time, but that’s because the performance is throttled for free tier accounts.

### Conclusion

In R I have created a car price prediction model, but more likely than not, the predictive model will be consumed by another application than R, It is not difficult to publish an R model as a web service on the Azure machine learning platform. Once you have done that, AzureML will generate the api key and the link to the web service, so that you can provide that to anyone who wants to consume the predictive model in their applications.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Data Manipulation with dplyr

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

`dplyr` is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package.

For this article, I will be using the `airquality` dataset from the `datasets` package. The `airquality` dataset contains information about air quality measurements in New York from May 1973 – September 1973.

The head of the dataset looks like this:

```head(airquality)
Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
```

Before we dive into the functions, let’s load up the two packages:

```library(datasets)
library(dplyr)
```

Okay, now let’s get to the functions.

Filter
The filter function will return all the rows that satisfy a following condition. For example below will return all the rows where Temp is larger than 70.

```filter(airquality, Temp > 70)
Ozone Solar.R Wind Temp Month Day
1    36     118  8.0   72     5   2
2    12     149 12.6   74     5   3
3     7      NA  6.9   74     5  11
4    11     320 16.6   73     5  22
5    45     252 14.9   81     5  29
6   115     223  5.7   79     5  30
...
```

Another example of filter is to return all the rows where Temp is larger than 80 and Month is after May.

```filter(airquality, Temp > 80 & Month > 5)
Ozone Solar.R Wind Temp Month Day
1    NA     286  8.6   78     6   1
2    NA     287  9.7   74     6   2
3    NA     186  9.2   84     6   4
4    NA     220  8.6   85     6   5
5    NA     264 14.3   79     6   6
...
```

Mutate
Mutate is used to add new variables to the data. For example lets adds a new column that displays the temperature in Celsius.

```mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
Ozone Solar.R Wind Temp Month Day  TempInC
1    41     190  7.4   67     5   1 19.44444
2    36     118  8.0   72     5   2 22.22222
3    12     149 12.6   74     5   3 23.33333
4    18     313 11.5   62     5   4 16.66667
5    NA      NA 14.3   56     5   5 13.33333
...
```

Summarise
The summarise function is used to summarise multiple values into a single value. It is very powerful when used in conjunction with the other functions in the `dplyr` package, as demonstrated below. `na.rm = TRUE` will remove all `NA` values while calculating the mean, so that it doesn’t produce spurious results.

```summarise(airquality, mean(Temp, na.rm = TRUE))
mean(Temp)
1   77.88235
```

Group By
The `group_by` function is used to group data by one or more variables. Will group the data together based on the Month, and then the summarise function is used to calculate the mean temperature in each month.

```summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
Month mean(Temp)
1     5   65.54839
2     6   79.10000
3     7   83.90323
4     8   83.96774
5     9   76.90000
```

Sample
The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 15 rows (10% of the original 153 rows) from the dataset.

```sample_n(airquality, size = 10)
sample_frac(airquality, size = 0.1)
```

Count
The count function tallies observations based on a group. It is slightly similar to the table function in the base package. For example:

```count(airquality, Month)
Month  n
1     5 31
2     6 30
3     7 31
4     8 31
5     9 30
```

This means that there are 31 rows with Month = 5, 30 rows with Month = 6, and so on.

Arrange
The arrange function is used to arrange rows by variables. Currently, the airquality dataset is arranged based on Month, and then Day. We can use the arrange function to arrange the rows in the descending order of Month, and then in the ascending order of Day.

```arrange(airquality, desc(Month), Day)
Ozone Solar.R Wind Temp Month Day
1    96     167  6.9   91     9   1
2    78     197  5.1   92     9   2
3    73     183  2.8   93     9   3
4    91     189  4.6   93     9   4
5    47      95  7.4   87     9   5
6    32      92 15.5   84     9   6
```

Pipe
The pipe operator in R, represented by `%>%` can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step.

For example, let’s say we want to remove all the data corresponding to Month = 5, group the data by month, and then find the mean of the temperature each month. The conventional way to write the code for this would be:

```filteredData <- filter(airquality, Month != 5)
groupedData <- group_by(filteredData, Month)
summarise(groupedData, mean(Temp, na.rm = TRUE))
```

With piping, the above code can be rewritten as:

```airquality %>%
filter(Month != 5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm = TRUE))
```

This is a very basic example, and the usefulness may not be very apparent, but as the number of operations/functions perfomed on the data increase, the pipe operator becomes more and more useful!

That brings us to the end of this article. I hope you enjoyed it and found it useful. If you have questions, feel free to leave a comment or reach out to me on Twitter.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Track Hurricane Danny (Interactively) with R + leaflet

By hrbrmstr

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

Danny became the first hurricane of the 2015 Season, so it’s a good time to revisit how one might be able to track them with R.

We’ll pull track data from Unisys and just look at Danny, but it should be easy to extrapolate from the code.

For this visualization, we’ll use leaflet since it’s all the rage and makes the plots interactive without any real work (thanks to the very real work by the HTML Widgets folks and the Leaflet.JS folks).

Let’s get the library calls out of the way:

 ```library(leaflet) library(stringi) library(htmltools) library(RColorBrewer)```

Now, we’ll get the tracks:

 `danny <- readLines("http://weather.unisys.com/hurricane/atlantic/2015/DANNY/track.dat")`

Why aren’t we using `read.csv` or `read.table` directly, you ask? Well, the data is in a really ugly format thanks to the spaces in the `STATUS` column and two prefix lines:

 ```Date: 18-20 AUG 2015 Hurricane-1 DANNY ADV LAT LON TIME WIND PR STAT 1 10.60 -36.50 08/18/15Z 30 1009 TROPICAL DEPRESSION 2 10.90 -37.50 08/18/21Z - - TROPICAL DEPRESSION 3 11.20 -38.80 08/19/03Z - - TROPICAL DEPRESSION 4 11.30 -40.20 08/19/09Z - - TROPICAL DEPRESSION 5 11.20 -41.10 08/19/15Z - - TROPICAL DEPRESSION 6 11.50 -42.00 08/19/21Z - - TROPICAL DEPRESSION 7 12.10 -42.70 08/20/03Z - - TROPICAL DEPRESSION 8 12.20 -43.70 08/20/09Z - - TROPICAL DEPRESSION 9 12.50 -44.80 08/20/15Z - - TROPICAL DEPRESSION +12 13.10 -46.00 08/21/00Z 70 - HURRICANE-1 +24 14.00 -47.60 08/21/12Z 75 - HURRICANE-1 +36 14.70 -49.40 08/22/00Z 75 - HURRICANE-1 +48 15.20 -51.50 08/22/12Z 70 - HURRICANE-1 +72 16.00 -56.40 08/23/12Z 65 - HURRICANE-1 +96 16.90 -61.70 08/24/12Z 65 - HURRICANE-1 +120 18.00 -66.60 08/25/12Z 55 - TROPICAL STORM```

But, we can put that into shape pretty easily, using `gsub` to make it easier to read everything with `read.table` and we just skip over the first two lines (we’d use them if we were doing other things with more of the data).

 ```danny_dat <- read.table(textConnection(gsub("TROPICAL ", "TROPICAL_", danny[3:length(danny)])), header=TRUE, stringsAsFactors=FALSE)```

Now, let’s make the data a bit prettier to work with:

 ```# make storm type names prettier danny_dat\$STAT <- stri_trans_totitle(gsub("_", " ", danny_dat\$STAT)) # make column names prettier colnames(danny_dat) <- c("advisory", "lat", "lon", "time", "wind_speed", "pressure", "status")```

Those steps weren’t absolutely necessary, but why do something half-baked (unless it’s chocolate chip cookies)?

Let’s pick better colors than Unisys did. We’ll use a color-blind safe palette from Color Brewer:

 ```danny_dat\$color <- as.character(factor(danny_dat\$status, levels=c("Tropical Depression", "Tropical Storm", "Hurricane-1", "Hurricane-2", "Hurricane-3", "Hurricane-4", "Hurricane-5"), labels=rev(brewer.pal(7, "YlOrBr"))))```

And, now for the map! We’ll make lines for the path that was already traced by Danny, then make interactive points for the forecast locations from the advisory data:

 ```leaflet() %>% addTiles() %>% addPolylines(data=danny_dat[danny_dat\$advisory<=9,], ~lon, ~lat, color=~color) %>% addCircles(data=danny_dat[danny_dat\$advisory>9,], ~lon, ~lat, color=~color, fill=~color, radius=25000, popup=~sprintf("Advisory forecast for +%dh (%s)
Position: %3.2f, %3.2f
Expected strength: %s
Forecast wind: %s (knots)
Forecast pressure: %s", htmlEscape(advisory), htmlEscape(time), htmlEscape(lon), htmlEscape(lat), htmlEscape(color), htmlEscape(status), htmlEscape(wind_speed), htmlEscape(pressure)))```

Click on one of the circles to see the popup.

The entire source code is in this gist and, provided you have the proper packages installed, you can run this at any time with:

 `devtools::source_gist("e3253ddd353f1a489bb4", sha1="b3e1f13e368d178804405ab6d0bf98a185126a9a")`

The use of the `sha1` hash parameter will help ensure you aren’t being asked to run a potentially modified & harmful gist, but you should visit the gist first to make sure I’m not messing with you (which, I’m not).

If you riff off of this or have suggestions for improvement, drop a note here or in the gist comments.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## 5 New R Packages for Data Scientists

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

One great beauty of the R ecosystem, and perhaps the primary reason for R’s phenomenal growth, is the system for contributing new packages. This, coupled to the rock solid stability of CRAN, R’s primary package repository, gives R a great advantage. However, anyone with enough technical knowhow to formulate a proper submission can contribute a package to CRAN. Just being on CRAN is no great indicator of merit: a fact that newcomers to R, and open source, often find troubling. It takes some time and effort working with R in a disciplined way to appreciate how the organic metracracy of the package system leads to high quality, integrated software. Nevertheless, even for relative newcomers it is not difficult to discover the bedrock packages that support the growth of the R language. Those packages that reliably add value to the R language and they are readily apparent in plots of CRAN’s package dependency network.

Finding new packages that may ultimately prove to be useful is another matter. In the spirit of discovery; here are 5, relatively new packages that I think may ultimately prove to be interesting to data scientists. None of these have been on CRAN long enough to be battle tested. So please, explore them with cooperation in mind.

AzureML V0.1.1

Cloud computing is, or will be, important to every practicing data scientist. Microsoft’s Azure ML is a particularly rich machine learning environment for R (and Python) programmers. If your are not yet an Azure user this new package goes a long way to overcoming the inertia involved in getting started. It provides functions to push R code from your local environment up to the Azure cloud and publish functions and models as web services. The vignette walks you step by step from getting a trial account and the necessary credentials to publishing your first simple examples.

distcomp V0.25.1

Distributed computing with large data sets is always tricky, especially in environments where it is difficult or impossible to share data among collaborators. A clever partial likelihood algorithm implemented in the distcomp package (See the paper by Narasimham et al.) makes it possible to build sophisticated statistical models on unaggregated data sets. Have a look at this previous blog post for more detail.

rotationForest V0.1

The forests algorithm is the “go to” ensemble method for many data scientists as it consistently performs well on diverse data sets. This new variation based on performing Principal Component Analysis on random subsets of the feature space shows great promise. See the paper by Rodriguez et. al. for an explanation of how the PCA amounts to rotating the feature space and a comparison of the rotation forest algorithm with standard random forests and the Adaboost algorithm.

rpca V0.2.3

Given a matrix that is a superposition of a low rank component and a sparse component, rcpa uses a robust PCA method to recover these components. Netflix data scientists publicized this algorithm, which is based on a paper by Candes et al, Robust Principal Component Analysis, earlier this year when they reported spectacular success using robust PCA in an outlier detection problem.

SwarmSVM V0.1

The support vector machine is also a mainstay machine learning algorithm. SwarmSVM, which is based on a clustering approach as described in a paper by Gu and Han provides three ensemble methods for training support vector machines. The vignette that accompanies the package provides a practical introduction to the method.