See R in action at the BUILD conference

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Build 2015, the Microsoft conference which brings around 5,000 developers to the Moscone Center in San Francisco, begins tomorrow. The conference is sold out, but you can livestream the keynote presentations from buildwindows.com to catch all the big announcements. You can also follow along on Twitter at the #Build2015 hashtag.

There will be a major keynote presentation featuring CEO Satya Nadella today (Wednesday) from 8:30AM to 11:00AM Pacific Time. But R users should also be sure to tune in to the keynote tomorrow (Thursday), also from 8:30AM to 11:00AM. I can reveal that R will be a significant component of the presentation by Joseph Sirosh, and will include at around 9:30AM a live demo of Revolution R running in the Azure cloud. I can’t reveal all the details yet, but there will be a live demo of distributed big-data computing with R on Azure HDInsights Hadoop using Bioconductor, and Shiny. There will also be several other very cool machine learning applications to see. Don’t miss it!

Microsoft Build 2015: Agenda

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Finance-YahooQuote 0.25 hotfix

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A hotfix release for the Finance-YahooQuote Perl module on CPAN is now available. Available Yahoo! Finance decided to change the base URL. My thanks to Nicola Chiapolini who not only noticed but also sent me the one-line patch fixing this:

--- YahooQuote.pm~      2010-03-27 01:44:10.000000000 +0100
+++ YahooQuote.pm       2015-04-29 11:31:20.407926674 +0200
@@ -34,7 +34,7 @@
 $VERSION = '0.24';
 
 ## these variables govern what type of quote the modules is retrieving
-$QURLbase = "http://download.finance.yahoo.com/d/quotes.csvr?e=.csv&f=";
+$QURLbase = "http://download.finance.yahoo.com/d/quotes.csv?e=.csv&f=";
 $QURLformat = "snl1d1t1c1p2va2bapomwerr1dyj1x";        # default up to 0.19
 $QURLextended = "s7t8e7e8e9r6r7r5b4p6p5j4m3m4";        # new in 0.20
 $QURLrealtime = "b2b3k2k1c6m2j3"; # also new in 0.20

If need be, edit your file YahooQuote.pm by hand.

This change in Finance-YahooQuote will also affect Beancounter and smtm both of which use this module.

The fix has been pushed to Debian for the corresponding package and to PAUSE for CPAN package.

Having maintained this since 2002 in RCS, I also just created a GitHub repo for it where development/maintenance will now happen.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

choroplethrZip v1.3.0: easier demographics, national maps

By Ari Lamstein

total_population

(This article was first published on Just an R Blog » R, and kindly contributed to R-bloggers)

Introduction

choroplethr v3.0 is now available on github. You can get it by typing

# install.packages("devtools")
library(devtools)
install_github('arilamstein/choroplethrZip@v1.3.0')

Version 1.3.0 has two new features:

  1. Data frame df_zip_demographics contains eight demographic statistics about each ZIP Code Tabulated Area (ZCTA) in the US. Data comes from the 2013 5-year American Community Survey (ACS).
  2. Function ?get_zip_demographics will return a data.frame with those same statistics from an arbitrary ACS.

Data

Here is how to access the data:

library(choroplethrZip)
data(df_zip_demographics)

?df_zip_demographics
colnames(df_zip_demographics)
 [1] "region" "total_population" "percent_white" 
 [4] "percent_black" "percent_asian" "percent_hispanic" 
 [7] "per_capita_income" "median_rent" "median_age"

summary(df_zip_demographics[, "total_population"])
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 0    721     2802   9517 13000   114700

Mapping the Data

Here is a program which will create national maps of the data:

# for each column in the data.frame
for (i in 2:ncol(df_zip_demographics))
{
 # set the value and title
 df_zip_demographics$value = df_zip_demographics[,i]
 title = paste0("2013 ZCTA Demographics:n",
                colnames(df_zip_demographics[i]))

 # print the map
 choro = zip_choropleth(df_zip_demographics, title=title)
 print(choro)
}

Note that national zip maps can take a few minutes to render. Here is the output.

percent_black

percent_asian

percent_hispanic

per_capita_income

median_rent

median_age

New Vignette

Additionally, I have created a new vignette about these features.

To leave a comment for the author, please follow the link and comment on his blog: Just an R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RStudio v0.99 Preview: Code Diagnostics

By jjallaire

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

In RStudio v0.99 we’ve made a major investment in R source code analysis. This work resulted in significant improvements in code completion, and in the latest preview release enable a new inline code diagnostics feature that highlights various issues in your R code as you edit.

For example, here we’re getting a diagnostic that notes that there is an extra parentheses:

Here the diagnostic indicates that we’ve forgotten a comma within a shiny UI definition:

diagnostics-comma

This diagnostic flags an unknown parameter to a function call:

Screen Shot 2015-04-08 at 11.50.07 AM

This diagnostic indicates that we’ve referenced a variable that doesn’t exist and suggests a fix based on another variable in scope:

Screen Shot 2015-04-08 at 4.23.49 PM

A wide variety of diagnostics are supported, including optional diagnostics for code style issues (e.g. the inclusion of unnecessary whitespace). Diagnostics are also available for several other languages including C/C++, JavaScript, HTML, and CSS.

Configuring Diagnostics

By default, code in the current source file is checked whenever it is saved, as well as if the keyboard is idle for a period of time. You can tweak this behavior using the Code -> Diagnostics options:

diagnostics-options

Note that several of the available diagnostics are disabled by default. This is because we’re in the process of refining their behavior to eliminate “false negatives” where correct code is flagged as having a problem. We’ll continue to improve these diagnostics and enable them by default when we feel they are ready.

Trying it Out

You can try out the new code diagnostics by downloading the latest preview release of RStudio. This feature is a work in progress and we’re particularly interested in feedback on how well it works. Please also let us know if there are common coding problems which you think we should add new diagnostics for. We hope you try out the preview and let us know how we can make it better.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Winning streaks in baseball

By dan

longestStreakPerSeason.s

(This article was first published on Decision Science News » R, and kindly contributed to R-bloggers)

HOW RARE ARE STREAKS?

The New York Mets recently won 11 games in a row, which got a lot of attention.

How likely is it that a given baseball team will win 11 games in a row by chance, if its probability of winning a single game is 50%?

The plot below shows that if a baseball team plays 100 seasons of 162 games, they’ll have an streak of 11 wins in a row about 7 to 8 times a century (about every 13 years on average). If they’re a really good team that wins 60% of the time in the long run, they’ll have an 11 game winning streak 55 times per century (about every 2 years).

Streaks aren’t weird, they’re expected. The graph up top shows that for a team that wins 50% of the time, the most likely outcome is that they’ll have a six game winning streak in a typical 162 game season. There’s an 8% chance their longest streak in a season will be 10 wins or more.

For the gifted team that wins 60% of the time, an eight game winning streak is the most likely outcome in a season, and there’s a 32% chance they’ll have a streak of 10 wins or more.

Fans of R and ggplot2 can reproduce the plots with the code below.

The post Winning streaks in baseball appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on his blog: Decision Science News » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Situational Baseball: Analyzing Runs Potential Statistics

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Mark Malter

After reading the book, Analyzing Baseball with R, by Max Marchi and Jim Albert, I decided to expand on some of their ideas relating to runs created and put them into an R shiny app .

The Server and UI code are linked at the bottom of the Introduction tab.

I downloaded the Retrosheet play-by-play data for every game played in the 2011-2014 seasons in every park and aggregated every plate appearance by one of the 24 bases/outs states (ranging from nobody on/nobody out to bases loaded/two outs). With Retrosheets data, I wrote code to track the batter, bases, outs, runs scored over remainder of inning, current game score, and inning. I also used the R Lahman package and databases for individual player information. Below is a brief explanation of the function of each tab on the app.

Potential Runs by bases/outs state: Matrix of all 24 possible bases/outs states, both with expected runs over the remainder of an inning, and the probability of scoring at least one run over the remainder of the inning (for late innings of close games). I used this table to analyze several types of plays, as shown below. Notice that, assuming average hitters, the analysis below shows why sacrifice bunts are always a bad idea. The Runs Created stat for a plate appearance is defined as:

end state – start state + runs created on play.

I first became serious about this after watching the last inning of the 2014 world series. Down 3-2 with two outs and nobody on base, Alex Gordon singled to center and advanced to third on a two base error. As Gordon was heading into third base, Giants shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon been sent home, Crawford would likely have thrown him out at the plate. However, the runs matrix shows only a 26% chance of scoring a run with a man on third and two outs, and with Madison Bumgarner on the mound, it was even less likely that on deck hitter Salvador Perez would be able to drive in Hosmer. So even though sending Gordon would likely have ended the game (and the series), it still may have been the optimal play. This would be similar to hitting 16 vs. a dealer’s ten in Blackjack- you’ll probably lose, but you’re making an optimal play. For equivalency, see the Tag from Third analysis below, as this play would have been equivalent to tagging from third after a catch for the second out.

Runs Created All Regular MLB Players: I filtered out all players with fewer than 400 plate appearances and created an interactive rchart showing each player’s runs potential by runs created. I placed the following filters in the UI: year, innings (1-3, 4-6, 7-extras), run differential at time of at bat (0-1, 2-3, 4+), position, team, bats, age range, and weight. Hovering over a point shows the player and his salary. For example, Mike Trout created 58 runs out of a potential of 332 in 2014. Filtering 2013 for second baseman under the age of 30 and weighing less than 200 pounds, we see Jason Kipnis created 27 runs out of a potential of 300.

Player Runs Table: Same as above, but this shows each player (> 400 plate appearances for the selected season), broken down by each of the eight bases states. For example, in 2014 Jose Abreu created 43.5 runs on a potential of 291, and was most efficient with a runner on second base, where he created 10.3 runs on a potential of only 36.

The following tabs show runs expectancies of various offensive plays from the start state the expected end state, based on the expected Baserunning Success rate in the UI. For each play, I created a graphical as well as a table tab. For the graphical tabs, there is a UI to switch between views of expected runs and scoring probability.

Stolen bases Graphic/Table: For each of fifteen different base stealing situations, I show the start state, end state (based on the UI selected success rate), and the breakeven success rate for the given situation. We see that rather than one generic rule of thumb for breaking even, the situational b/e’s vary widely, ranging from 91% with a runner on second and two outs, to 54% for a double steal with first and second and one out (I assume that any out is the lead runner). Notice though that if only the runner on second attempts to steal, the break even jumps from 54% to 72%.

Tag from Third Graphic/Table: I broke down every situation where a fly ball was caught with a runner on third, where the catch was either the first or second out. I tracked the attempt frequency and success rate for each situation, based on the outs and whether there were trailing runners. Surprisingly, I found that almost every success rate is well over 95%, meaning runners are only tagging when they’re almost certain to score. However, the break evens range from 40% with first and third with two outs (after the catch) to 77% with runners on second and third with one out. I believe this shows a gray area between the b/e and success rates where runners are being far too cautious.

The following tabs show whether a base runner should attempt to advance two bases on a single. Again, of course it depends on the situation.

First to Third Graphic/Table: Here we see that the attempted frequencies are very low, and as expected, lowest on balls hit to left field. However, as with the above tag plays, runners are almost always safe, showing another gray area between attempts and b/e’s. For example, on a single to right field with one out, runners only attempt to advance to third base 42.1% of the time, and are safe 97.3%. If we place the UI Success Rate slider on 0.85, we see that the attempt increases the runs expectancy from 0.87 to 0.99.

Second to Home Graphic/Table: Here we see the old adage, “don’t make the first or second out at the plate”, is not necessarily true. Attempting to score from second on a single depends not only on the outs, but also whether there is a trailing runner. The break evens range from 93% with no outs and no trailing runner on first, to 40% with two outs and no runner on first. Once again, the success rates are almost always higher than the break even rate, showing too much caution.

Sacrifice Bunt Graphic/Table: These tabs show that unless we have a hitter far below average, the sacrifice should never be attempted. For example, in going from a runner on first and no outs to a runner on second with one out, or going from a runner on second with no outs to a runner on third with one out, we drop from 0.85 runs to 0.66 runs and from 1.10 runs to 0.94 runs respectively. Worse, I’m assuming that the bunt is always successful with the lead runner never being thrown out. The only situation where the bunt might be wise is in a late inning and the team is playing for one run after a leadoff double. Getting the runner from second and no outs to third with one out increases the probability of scoring from 0.61 to 0.65, IF the bunt is successful. Even here, it is a poor play if the success rate is less than 90%. The graphic tab allows the user to see how the expected end state changes as the UI success rate slider is altered.

UI code: https://github.com/malter61/retrosheets/blob/master/ui.R
Server code: https://github.com/malter61/retrosheets/blob/master/server.R

Mark Malter is a data scientist currently working for Houghton, Mifflin, Harcourt, as well as the consulting firm Channel Pricing, specializing in building predictive models, cluster analysis, and visualizing data. He is also a sixteen year veteran stock options market-maker at the Chicago Board Options Exchange. He has a BS degree in electrical engineering, an MBA, and is currently working on an MS degree in Predictive Analytics. Mark also spent 14 years as a director and coach of his local youth baseball league.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppTOML 0.0.3: A New Approach to Configuration Files

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A small project I worked on during the last few weeks has now come together in new package RcppTOML which arrived on CRAN yesterday.

It provides R with a reader for TOML files. TOML stands for Tom’s Obvious Markup Language. And before you roll your eyes, glance at the TOML site. It really is different, and has a number of rather wonderful features:

  • free-format indentation as you please
  • comments anywhere, even on the same line
  • actual types such as string, integer, float, bool and datetime (!!) which are all native
  • vectors, of course, of the above
  • arbitrary nesting of tables

Here is a simple illustration where we parse the TOML example file derived from what is part of the main TOML README

R> p <- parseTOML(system.file("toml", "example.toml", package="RcppTOML"))
R> summary(p)
toml object with top-level slots:
   clients, database, owner, servers, title 
read from /usr/local/lib/R/site-library/RcppTOML/toml/example.toml 
R> p
List of 5
 $ clients :List of 2
  ..$ data :List of 2
  .. ..$ : chr [1:2] "gamma" "delta"
  .. ..$ : int [1:2] 1 2
  ..$ hosts: chr [1:2] "alpha" "omega"
 $ database:List of 4
  ..$ connection_max: int 5000
  ..$ enabled       : logi TRUE
  ..$ ports         : int [1:3] 8001 8001 8002
  ..$ server        : chr "192.168.1.1"
 $ owner   :List of 4
  ..$ bio         : chr "GitHub Cofounder & CEOnLikes tater tots and beer."
  ..$ dob         : POSIXct[1:1], format: "1979-05-27 07:32:00"
  ..$ name        : chr "Tom Preston-Werner"
  ..$ organization: chr "GitHub"
 $ servers :List of 2
  ..$ alpha:List of 2
  .. ..$ dc: chr "eqdc10"
  .. ..$ ip: chr "10.0.0.1"
  ..$ beta :List of 2
  .. ..$ dc: chr "eqdc10"
  .. ..$ ip: chr "10.0.0.2"
 $ title   : chr "TOML Example"
NULL
R> 

See much more at the TOML site. I converted one first project at work to this and it really rocks. Point to a file, get a list back and index all components by their names.

We also added really simple S3 classes to the default print() method uses str() for a more compact presentation of what (in R) is of course nested list types.

Internally, the RcppTOML packages use the splendid cpptoml parser by Chase Geigle. This brings in modern C++11 and makes it that CRAN simply cannot build a binary for R on Windows as the g++ version (still, as of April 2015) in Rtools is too old. There is word of an update to Rtools and that point should we able to support Windows as well. Until then, no mas.

A bit more information is on the package page here as well as as the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Downloading and Visualizing Seismic Events from USGS

By Fabio Veronesi

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

The unlucky events that took place in Nepal have flooded the web with visualization of the earthquakes from USGS. They normally visualize earthquakes with a colour scale that depends on the age of the event and a marker size that depends on magnitude. I remembered that some time ago I tested ways for downloading and visualizing data from USG in the same way in R. So I decided to take those tests back, clean them up and publish them. I hope this will not offend anyone, I do not want to disrespect the tragedy, just share my work.

The USGS provides access to csv files for seismic events recording in several time frames: past hour, past day, past week, and in the past 30 days. For each of these, several choices of significance are provided, user can download all the events in the time frame or limit their request to events with magnitude higher than: 1.0, 2.5, 4.5 and significant events. The data are provided in csv files with standard names so that they are always accessible and updated every 15 minutes with new data.
USGS provides the csv files in links with standard names. For example in this case we are downloading all the data in the last month, so the csv file’s name is: all_month.csv. If we wanted to download only the earthquakes in the last day and with a magnitude above 4.5, we would have used the file name: 4.5_day.csv. The links to all the csv provided by USGS are available here: http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

For this experiment we need the following packages: sp, plotrix, and raster
In R we can easily import the data by simply calling the read.table function and reading the csv file from the server:

URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
Earthquake_30Days <- read.table(URL, sep = ",", header = T)

This will download all the seismic events in the past 30 days.
Now we can transform the data, which are stored in a data.frame, into a spatial object using the following two lines:

coordinates(Earthquake_30Days)=~longitude+latitude
projection(Earthquake_30Days)=CRS("+init=epsg:4326")

The first line transforms the object Earthquake_30Days into a SpatialPointsDataFrame. The second gives it its proper projection, which is a geographical projection like Google Maps.
At this point I want to download the borders of all the countries in the world so that I can plot the seismic events with some geographical references:

download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

These three lines can download the border shapefile from the web, unzip it into the working directory and load it.
In this example I will visualize the earthquakes using the same technique used by the USGS, with a colour that varies with the age of the event and a size that depends on magnitude. So the first thing to do is take care of the time. If we check the format of the time variable in the USGS file we see that it is a bit uncommon:

Earthquake_30Days$time[1]
[1] 2015-04-28T12:20:43.410Z

For this reason I created a function to transform this format into something we can use:

conv.time <- function(vector){
split1 <- strsplit(paste(vector),"T")
split2 <- strsplit(split1[[1]][2],"Z")
fin <- paste0(split1[[1]][1],split2[[1]][1])
paste(as.POSIXlt(fin,formate="%Y-%m-%d%H:%M:%OS3"))
}

If I apply this function to the previous example I obtain the following:

conv.time(Earthquake_30Days$time[1])
[1] "2015-04-28 12:20:43"

Now I can create a new variable in the object with this time-stamp:

DT <- sapply(Earthquake_30Days$time,FUN=conv.time)
Earthquake_30Days$DateTime <- as.POSIXlt(DT)

Now we can start the tricky part. For plotting the events with a custom colour scale and a custom size scale, we first need to create them. Moreover, we also need to create the thresholds needed for the legend.
For the colour scale we can do all that using the following lines:

days.from.today <- round(c(Sys.time()-Earthquake_30Days$DateTime)/60,0)
colour.scale <- color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue"),alpha=0.5)
colors.DF <- data.frame(days.from.today,color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue")))
colors.DF <- colors.DF[with(colors.DF, order(colors.DF[,1])), ]
colors.DF$ID <- 1:nrow(colors.DF)
breaks <- seq(1,nrow(colors.DF),length.out=10)

In the first line I calculate the age of the event as the difference between the system time and the event time-stamp. In the second line I create the colour scale with the function in plotrix, from red to blue and with a certain transparency.
Then I need to create the thresholds for the legend. I first create a data.frame with age and colours, then I order it by age and insert an ID column. At this point I can create the thresholds by simply using the seq function.
I do the same thing with the size thresholds:

size.DF <- data.frame(Earthquake_30Days$mag,Earthquake_30Days$mag/5)
size.DF <- size.DF[with(size.DF, order(size.DF[,1])), ]
size.DF$ID <- 1:nrow(size.DF)
breaks.size <- seq(0,max(Earthquake_30Days$mag/5),length.out=5)

Then I can plot the whole thing:

tiff(filename="Earthquake_Map.tif",width=7000,height=4000, res=300)

#Plot
plot(polygons)
plot(Earthquake_30Days, col= colour.scale, cex=Earthquake_30Days$mag/5, pch=16, add=T)

#Title and Legend
title("Earthquakes in the last 30 days",cex.main=3)
legend.pos <- list(x=-28.52392,y=-20.59119)
rect(xleft=legend.pos$x-5, ybottom=legend.pos$y-30, xright=legend.pos$x+30, ytop=legend.pos$y+10, col="white", border=NA)
legendg(legend.pos,legend=c(round(colors.DF[colors.DF$ID %in% round(breaks,0),1],2)),fill=paste(colors.DF[colors.DF$ID %in% round(breaks,0),2]),bty="n",bg=c("white"),y.intersp=0.75,title="Age",cex=0.8)
text(x=legend.pos$x+5,y=legend.pos$y+5,"Legend:")
legend(x=legend.pos$x+15,y=legend.pos$y,legend=breaks.size[2:5]*5,pch=points(rep(legend.pos$x+15,4),c(legend.pos$y-6,legend.pos$y-9,legend.pos$y-12,legend.pos$y-15),pch=16,cex=breaks.size[2:5]),cex=0.8,bty="n",bg=c("white"),y.intersp=1.1,title="Magnitude")

dev.off()

I divided the magnitude by 5, so that the bubbles are not too big. The position of the legends is something that depends of the image, if you decrease the area plotted on the map their location will change and you can use geographical coordinates to change it.
The result is the following image:

Download the image here: Earthquake_Map.jpg

The full script is available here: Script

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R in Insurance 2015 Conference Programme

By Markus Gesmann

(This article was first published on mages’ blog, and kindly contributed to R-bloggers)

The
Special thanks to our sponsors, without whom the conference wouldn’t be possible:
CYBAEA, RStudio, APPLIED AI, Milliman, QBE Re, AEGON, Delta Lloyd Amsterdam , Deloitte.

You find impressions from the previous events on www.rininsurance.com.

We hope to see you in Amsterdam!

This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on his blog: mages’ blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Some basics of biomaRt

By nsaunders

(This article was first published on What You’re Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?

1. What marts and datasets are available?
The function listMarts() returns a data frame with mart names and descriptions.

library(biomaRt)

marts <- listMarts()
head(marts)

              biomart                             version
1             ensembl        ENSEMBL GENES 79 (SANGER UK)
2                 snp    ENSEMBL VARIATION 79 (SANGER UK)
3          regulation   ENSEMBL REGULATION 79 (SANGER UK)
4                vega                VEGA 59  (SANGER UK)
5       fungi_mart_26           ENSEMBL FUNGI 26 (EBI UK)
6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)

The first 4 of those are what you would see if you visited Ensembl BioMart on the Web and clicked the “choose database” drop-down box.

Given a mart object, listDatasets() returns the available datasets. We get a mart object using useMart().

datasets <- listDatasets(useMart("ensembl"))
head(datasets)

                         dataset                                description version
1         oanatinus_gene_ensembl     Ornithorhynchus anatinus genes (OANA5)   OANA5
2        cporcellus_gene_ensembl            Cavia porcellus genes (cavPor3) cavPor3
3        gaculeatus_gene_ensembl     Gasterosteus aculeatus genes (BROADS1) BROADS1
4         lafricana_gene_ensembl         Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2
6        choffmanni_gene_ensembl        Choloepus hoffmanni genes (choHof1) choHof1

2. How to specify a different genome build version?
Assuming that you want human genes in the most recent Ensembl build, you supply mart and dataset information from the previous step to useMart() like this:

mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")

What if you want an older genome build? First, visit the Ensembl Archives page for more information. Next, use the URL information there to supply a host = argument. For example, to see what marts are available for Ensembl version 72 (genome build GRCh37.p11):

listMarts(host = "jun2013.archive.ensembl.org")

               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 72
2     ENSEMBL_MART_SNP  Ensembl Variation 72
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 72
4    ENSEMBL_MART_VEGA               Vega 52
5                pride        PRIDE (EBI UK)

Available datasets can be found as before with the addition of the host = argument to useMart(). To create the version 72 mart object:

mart72.hs <- useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", host = "jun2013.archive.ensembl.org")

3. What filters and attributes can I use?
Let’s review the BioMart terminology.
Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53″, “SRY” and “KIAA1199″.

But how do you know what attributes and filters are available? If you guessed listAttributes() and listFilters(), you’d be right.

attributes <- listAttributes(mart.hs)
head(attributes)

                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name

filters <- listFilters(mart.hs)
head(filters)

             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

You can search for specific attributes by running grep() on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:

attributes[grep("affy", attributes$name),]

                   name                  description
102        affy_hc_g110        Affy HC G110 probeset
103       affy_hg_focus       Affy HG FOCUS probeset
104 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset
105     affy_hg_u133a_2     Affy HG U133A_2 probeset
106       affy_hg_u133a       Affy HG U133A probeset
107       affy_hg_u133b       Affy HG U133B probeset
108      affy_hg_u95av2      Affy HG U95AV2 probeset
109        affy_hg_u95b        Affy HG U95B probeset
110        affy_hg_u95c        Affy HG U95C probeset
111        affy_hg_u95d        Affy HG U95D probeset
112        affy_hg_u95e        Affy HG U95E probeset
113        affy_hg_u95a        Affy HG U95A probeset
114       affy_hugenefl      Affy HuGene FL probeset
115      affy_primeview               Affy primeview
116       affy_u133_x3p       Affy U133 X3P probeset

4. A worked example
As an example, let’s find SNPs on the Y chromosome which have starts that are located within exons. First, we create two mart objects for the genes and SNP datasets. We query the first mart for exons on chromosome Y and the second for SNPs on the same chromosome. Finally we’ll employ the incredibly-useful findOverlaps() function from the GenomicRanges package to get the exonic SNPs.

There are various ways to run BioMart queries; I find the simplest is to use getBM().

library(biomaRt)
library(GenomicRanges)

mart.hs     <- useMart("ensembl", "hsapiens_gene_ensembl")
mart.hs_snp <- useMart("snp", "hsapiens_snp")

exons <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end", "strand"), filters = "chromosome_name", values = "Y", mart = mart.hs)
snps  <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start"), filters = "chr_name", values = "Y", mart = mart.hs_snp)

gr.exons <- GRanges(seqnames = "chrY", ranges = IRanges(start = exons$exon_chrom_start, end = exons$exon_chrom_end, names = exons$ensembl_exon_id))
gr.snps  <- GRanges(seqnames = "chrY", ranges = IRanges(start = snps$chrom_start, end = snps$chrom_start, names = snps$refsnp_id))
ov <- findOverlaps(gr.snps, gr.exons, type = "within")

# ov is a matrix with 2 columns
# col 1 = row numbers for query hits (snps); col 2 = row numbers for subject hits (exons)
exons_snps <- cbind(exons[ov@subjectHits, ], snps[ov@queryHits, ])

head(exons_snps)

     ensembl_gene_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end strand refsnp_id chr_name chrom_start
1967 ENSG00000231341 ENSE00001732601               Y          5207215        5208069     -1    rs4913        Y     5207285
1792 ENSG00000226092 ENSE00001743398               Y         21927521       21927819     -1    rs3802        Y    21927601
588  ENSG00000224657 ENSE00001783563               Y         22658581       22658878      1    rs3802        Y    22658799
2231 ENSG00000215540 ENSE00001693911               Y         23537308       23537606     -1    rs3802        Y    23537388
74   ENSG00000215507 ENSE00001733694               Y         26133010       26133308      1    rs3802        Y    26133228
275  ENSG00000215414 ENSE00001542224               Y         13286638       13287378      1   rs15434        Y    13287334

Filed under:

To leave a comment for the author, please follow the link and comment on his blog: What You’re Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News