rud.is » R 2015-03-30 13:32:08

By hrbrmstr

flights

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

Over on The DO Loop, @RickWicklin does a nice job visualizing the causes of airline crashes in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:

  • Give @jennybc‘s new googlesheets a spin
  • Show some dplyr & tidyr data wrangling (never can have too many examples)
  • Crank out some ggplot zero-based streamgraph-y area charts for the data with some extra ggplot wrangling for good measure

I also decided to use the colors in the original David McCandless/Kashan visualization.

Getting The Data

As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site makes the data available, so I copied it to my Google Drive and gave her package a go:

library(googlesheets)
library(ggplot2) # we'll need the rest of the libraries later
library(dplyr)   # but just getting them out of the way
library(tidyr)
 
# this will prompt for authentication the first time
my_sheets <- list_sheets()
 
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
 
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" 
 
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
 
# take a quick look
glimpse(flights_csv)
 
## Observations: 440
## Variables:
## $ date       (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc        (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country    (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref        (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline    (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat        (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px         (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat        (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase      (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert       (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta       (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause      (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes      (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
 
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
 
# and we convert some columns while we're at it
flights_csv %>%
  mutate(date=as.Date(date),
         fat=as.numeric(fat),
         px=as.numeric(px)) -> flights_csv

A Bit of Cleanup

Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.

The area chart also needs full values for each category combo per year, so we use expand from tidyr with left_join and mutate to fill in the gaps.

Finally, we make proper, ordered labels:

flights_csv %>%
  mutate(year=as.numeric(format(date, "%Y"))) %>%
  mutate(phase=tolower(phase),
         phase=ifelse(grepl("take", phase), "takeoff", phase),
         phase=ifelse(grepl("climb", phase), "takeoff", phase),
         phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
  count(year, meta, phase) %>%
  left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% 
  mutate(n=ifelse(is.na(n), 0, n)) %>% 
  filter(!phase %in% c("grounded", "unknown")) %>%
  mutate(phase=factor(phase, 
                      levels=c("takeoff", "en_route", "approach", "landing"),
                      labels=c("Takeoff", "En Route", "Approach", "Landing"),
                      ordered=TRUE)) -> flights_dat

I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.

The Final Plot

I’m a big fan of an incremental, additive build idiom for ggplot graphics. By using the gg <- gg + … style one can move lines around, comment them out, etc without dealing with errant + signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows:

  • main ggplot call with mappings if the graph is short, otherwise add the mappings to the geoms
  • all geom_ or stat_ layers in the order I want them, and using line breaks to logically separate elements (like aes) or to wrap long lines for easier readability.
  • all scale_ elements in order from axes to line to shape to color to fill to alpha; I’m not as consistent as I’d like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking
  • facet call (if any)
  • label setting, always with labs unless I really have a need for using ggtitle
  • base theme_ call
  • all other theme elements, one per gg <- gg + line

I know that’s not everyone’s cup of tea, but it’s just how I roll ggplot-style.

For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don’t have Futura Medium.

gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) 
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
                       position="stack", method="gam", formula=y~s(x)) 
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, 
                             labels=c("Criminal", "Human Error",
                                      "Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))
gg

That ultimately produces:

with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the bump between 2005 and 2010 for landing crashes.

As an aside, Boeing has a really nice PDF on some of this data with quite a bit more detail.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Visualization cheatsheet, plus Spanish translations

By Garrett.Grolemund

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Need for Processing Speed: data.table

By tobias

runtime with data.frame

(This article was first published on Open Analytics – Blog, and kindly contributed to R-bloggers)
Monday 30 March 2015 – 15:05

The first time I discovered data.table it felt like magic. I was waiting on a process that was projected to take the better part of an afternoon. In the meantime, I followed the data.table tutorial, rewrote my code using the data.table structure, and fully executed said code, all while the data.frame equivalent was wheezing along.
In the last year, data.table has gotten even faster.

data.table’s Automatic Indexing

For the uninitiated, data.table is a data structure that extends upon data.frame and streamlines the functionality for the age of “big data”. There are some syntactical differences, most of which are covered in this handy cheat sheet.

As of data.table package version 1.9.4 operations of the form DT[x == .] and DT[x %in% .] have been optimised to, whenever possible, build an index automatically on the first run, so that successive runs can make use of binary search for fast subsets instead of vector scans. Note that this query optimization is taken care of internally, and requires no user-end specification.

Let’s consider a toy example. Suppose we have a data.frame with column ‘x’ of length 10-million, whose values consist of integers randomly sampled between 1 and 10000. Further suppose we save said object as both a data.frame and a data.table.

  1. DF = data.frame(x=sample(x = 1e4, size = 1e7, replace = TRUE), y = 1L)
  2. DT = as.data.table(DF)

Next, we perform simple operations on each. For example, “return all rows where x takes the value ‘100L'”. The use of ‘L’ indicates we are looking exclusively for integers, and gives some small gains in computation time.

For the data.frame, we have the following:

For data.table, the first run will be fast, but much of the computation time is devoted to automatically building an index. Subsequent runs use the binary search with these indices, and are therefore even faster.

runtimes with data.table

The first run is dramatically faster than the data.frame equivalent, but after indexing, the second run takes virtually no time at all! However, note that as the time for the first data.table run includes time to create an index and perform a binary search, there may be instances that the first call of data.table may be slightly slower than a vector scan equivalent. Successive calls (after indexing) will be significantly faster for data.table, as evidenced above.

Additionally, further data.table query optimizations are in the works.

Installing the latest version of data.table

Version 1.9.4 of data.table is available on CRAN, but the latest and greatest is accessible via a github install. We recommend the following installation steps.

  1. setRepositories()
  2. library(devtools)
  3. install_github("Rdatatable/data.table", build_vignettes = FALSE)

The setRepositories() call will prompt you to select repositories, at which point you can select the following. For additional instructions, please see the installation guide on github.

setting R package repositories

Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: Open Analytics – Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R & Google Maps and R & Robotics (ROS)

By BNOSAC – Belgium Network of Open Source Analytical Consultants

(This article was first published on BNOSAC – Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

The next RBelgium meetup will be about R & Google Maps and R & Robotics (ROS).
BNOSAC will be hosting the event this time.

This is the schedule:
• 17h30-18h: open questions
• 18h-19h: R and Google Maps

• 19h-20h: R and Robotics (ROS)

For practical information: http://www.meetup.com/RBelgium/events/220918123

To leave a comment for the author, please follow the link and comment on his blog: BNOSAC – Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sampling Distributions and Central Limit Theorem in R

By Nicole Radziwill

sd-blog-1

(This article was first published on Quality and Innovation » R, and kindly contributed to R-bloggers)

The Central Limit Theorem (CLT), and the concept of the sampling distribution, are critical for understanding why statistical inference works. There are at least a handful of problems that require you to invoke the Central Limit Theorem on every ASQ Certified Six Sigma Black Belt (CSSBB) exam. The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each one, the collection of those averages will be normally distributed… and it doesn’t matter what the shape of the source distribution is!

I wrote some R code to help illustrate this principle for my students. This code allows you to choose a sample size (n), a source distribution, and parameters for that source distribution, and generate a plot of the sampling distributions of the mean, sum, and variance. (Note: the sampling distribution for the variance is a Chi-square distribution!)

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
   r <- 10000  # Number of replications/samples - DO NOT ADJUST
   # This produces a matrix of observations with  
   # n columns and r rows. Each row is one sample:
   my.samples <- switch(src.dist,
	"E" = matrix(rexp(n*r,param1),r),
	"N" = matrix(rnorm(n*r,param1,param2),r),
	"U" = matrix(runif(n*r,param1,param2),r),
	"P" = matrix(rpois(n*r,param1),r),
	"C" = matrix(rcauchy(n*r,param1,param2),r),
        "B" = matrix(rbinom(n*r,param1,param2),r),
	"G" = matrix(rgamma(n*r,param1,param2),r),
	"X" = matrix(rchisq(n*r,param1),r),
	"T" = matrix(rt(n*r,param1),r))
   all.sample.sums <- apply(my.samples,1,sum)
   all.sample.means <- apply(my.samples,1,mean)   
   all.sample.vars <- apply(my.samples,1,var) 
   par(mfrow=c(2,2))
   hist(my.samples[1,],col="gray",main="Distribution of One Sample")
   hist(all.sample.sums,col="gray",main="Sampling Distributionnof
	the Sum")
   hist(all.sample.means,col="gray",main="Sampling Distributionnof the Mean")
   hist(all.sample.vars,col="gray",main="Sampling Distributionnof
	the Variance")
}

There are 9 population distributions to choose from: exponential (E), normal (N), uniform (U), Poisson (P), Cauchy (C), binomial (B), gamma (G), Chi-Square (X), and the Student’s t distribution (t). Note also that you have to provide either one or two parameters, depending upon what distribution you are selecting. For example, a normal distribution requires that you specify the mean and standard deviation to describe where it’s centered, and how fat or thin it is (that’s two parameters). A Chi-square distribution requires that you specify the degrees of freedom (that’s only one parameter). You can find out exactly what distributions require what parameters by going here: http://en.wikibooks.org/wiki/R_Programming/Probability_Distributions.

Here is an example that draws from an exponential distribution with a mean of 1/1 (you specify the number you want in the denominator of the mean):

sdm.sim(50,src.dist="E",param1=1)

The code above produces this sequence of plots:

You aren’t allowed to change the number of replications in this simulation because of the nature of the sampling distribution: it’s a theoretical model that describes the distribution of statistics from an infinite number of samples. As a result, if you increase the number of replications, you’ll see the mean of the sampling distribution bounce around until it converges on the mean of the population. This is just an artifact of the simulation process: it’s not a characteristic of the sampling distribution, because to be a sampling distribution, you’ve got to have an infinite number of samples. Watkins et al. have a great description of this effect that all statistics instructors should be aware of. I chose 10,000 for the number of replications because 1) it’s close enough to infinity to ensure that the mean of the sampling distribution is the same as the mean of the population, but 2) it’s far enough away from infinity to not crash your computer, even if you only have 4GB or 8GB of memory.

Here are some more examples to try. You can see that as you increase your sample size (n), the shapes of the sampling distributions become more and more normal, and the variance decreases, constraining your estimates of the population parameters more and more.

sdm.sim(10,src.dist="E",1)
sdm.sim(50,src.dist="E",1)
sdm.sim(100,src.dist="E",1)
sdm.sim(10,src.dist="X",14)
sdm.sim(50,src.dist="X",14)
sdm.sim(100,src.dist="X",14)
sdm.sim(10,src.dist="N",param1=20,param2=3)
sdm.sim(50,src.dist="N",param1=20,param2=3)
sdm.sim(100,src.dist="N",param1=20,param2=3)
sdm.sim(10,src.dist="G",param1=5,param2=5)
sdm.sim(50,src.dist="G",param1=5,param2=5)
sdm.sim(100,src.dist="G",param1=5,param2=5)

To leave a comment for the author, please follow the link and comment on his blog: Quality and Innovation » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Autoregressive Conditional Poisson Model – I

By statcompute

rplot

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with ACP package.

library(acp)

### acp(1, 1) without covariates ###
mdl <- acp(y ~ -1, data = cnt)
summary(mdl)
# acp.formula(formula = y ~ -1, data = cnt)
#
#   Estimate   StdErr t.value   p.value    
# a 0.632670 0.169027  3.7430 0.0002507 ***
# b 0.349642 0.067414  5.1865 6.213e-07 ***
# c 0.184509 0.134154  1.3753 0.1708881    

### generate predictions ###
f <- predict(mdl)
pred <- data.frame(yhat = f, cnt)
tail(pred, 5)
#          yhat y
# 164 1.5396921 1
# 165 1.2663993 0
# 166 0.8663321 1
# 167 1.1421586 3
# 168 1.8923355 6

### calculate predictions manually ###
pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] 
# [1] 1.142159

pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] 
# [1] 1.892336

plot.ts(pred, main = "Predictions")

To leave a comment for the author, please follow the link and comment on his blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

intuition beyond a Beta property

By xi’an

betas

(This article was first published on Xi’an’s Og » R, and kindly contributed to R-bloggers)

A self-study question on X validated exposed an interesting property of the Beta distribution:

If x is B(n,m) and y is B(n+½,m) then √xy is B(2n,2m)

While this can presumably be established by a mere change of variables, I could not carry the derivation till the end and used instead the moment generating function E[(XY)s/2] since it naturally leads to ratios of B(a,b) functions and to nice cancellations thanks to the ½ in some Gamma functions [and this was the solution proposed on X validated]. However, I wonder at a more fundamental derivation of the property that would stem from a statistical reasoning… Trying with the ratio of Gamma random variables did not work. And the connection with order statistics does not apply because of the ½. Any idea?

Filed under: Books, Kids, R, Statistics, University life Tagged: beta distribution, cross validated, moment generating function, Stack Echange

To leave a comment for the author, please follow the link and comment on his blog: Xi’an’s Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Iteratively Populating Templated Sentences With Inline R in knitr/Rmd

By Tony Hirst

reusableCOdeinRmd

(This article was first published on OUseful.Info, the blog… » Rstats, and kindly contributed to R-bloggers)

As part of the Wrangling F1 Data With R project, I want to be able to generate sentences iteratively from a templated base.

The following recipe works for sentences included in an external file:

What I’d really like to be able to do is put the Rmd template into a chunk something like this…:

```{rmd stintPara, eval=FALSE, echo=FALSE}
`r name` completed `r sum(abs(stints[stints['name']==name,]['l']))` laps over `r nrow(stints[stints['name']==name,])` stints, with a longest run of `r max(abs(stints[stints['name']==name,]['l']))` laps.
```

and then do something like:

```{r results='asis'}
stints['name']=factor(stints$name)
for (name in levels(stints$name)){
  cat(paste0(knit_child('stintPara',quiet=TRUE),'nn'))
}
```

Is anything like that possible?

PS via the knitr Google Group, h/t Jeff Newmiller for pointing me to the knit_child() text argument…

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog… » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Segmenting F1 Qualifying Session Laptimes

By Tony Hirst

qualifying_lap_times_0_pdf__page_1_of_4_

(This article was first published on OUseful.Info, the blog… » Rstats, and kindly contributed to R-bloggers)

I’ve started scraping some FIA timing sheets again, including practice and qualifying session laptimes. One of the things I’d like to do is explore various ways of looking at the qualifying session laptimes, which means identifying which qualifying session each laptime falls into:

For looking at session utilisation charts I’ve been making use of accumulated time into session to help display the data, as the following session utilisation chart (including green and purple laptimes) shows:

The horizontal x-axis is time into session from a basetime of the first time-of-day timestamp recorded on the timing sheets for the session.

If we look at the distribution of qualifying session laptimes for the 2015 Malaysian Grand Prix, we get something like this:

simpleSessionTimes

We can see a big rain delay gap, and also a tighter gap between the first and second sessions.

If we try to run a k-means clustering algorithm on the data, using 3 means for the three sessions, we see that in this case it isn’t managing to cluster the laptimes into actual sessions:

# Attempt to identify qualifying session using K-Means Cluster Analysis around 3 means
clusters &lt;- kmeans(f12015test['cuml'], 3)

f12015test = data.frame(f12015test, clusters$cluster)

ggplot(f12015test)+geom_text(aes(x=cuml, y=stime,
label=code, colour=factor(clusters.cluster)) ,angle=45,size=3)

qsession-kmeans

In particular, so of the Q1 laptimes are being grouped with Q2 laptimes.

However, we know that there is at least a 2 minute gap between sessions so if we assume that the only times there will be a two minute gap between recorded laptimes during the whole of qualifying session will be in the periods between the qualifying sessions, we can can generate a flag on those gaps, and then generate session number counts by counting on those flags.

#Look for a two minute gap
f12015test=arrange(f12015test,cuml)
f12015test['gap']=c(0,diff(f12015test[,'cuml']))
f12015test['gapflag']= (f12015test['gap']&gt;=120)
f12015test['qsession']=1+cumsum(f12015test[,'gapflag'])

ggplot(f12015test)+ geom_text(aes(x=cuml, y=stime, label=code), angle=45,size=3
+facet_wrap(~qsession, scale=&quot;free&quot;)

qsession_facets

(To tighten this up, we might also try to factor in the number of cars in the pits at any particular point in time…)

This chart clearly shows how the first qualifying session saw cars trialling evenly throughout the session, whereas in Q2 and Q3 they were far more bunched up (which perhaps creates more opportunities for cars to get in each others’ way on flying laps…)

One of the issues with this chart is that we don’t get to zoom in to actual flying laps. If all the flying lap times were about the same time, we could simply generate y-axis limits based on purple laptimes:

minl=min(f12015test$purple)*0.95
maxl=min(f12015test$purple)*1.3

#Use these values in ylim()...

However, where the laptimes differ significantly across sessions as they do in this case due to a dramatic change in weather conditions, we probably need to filter the data for each session separately.

Another crib we might use is to identify PIT lap and out-laps (laps immediately following a PIT event) and filter those out of the laptime traces.

Versions of these recipes will soon be added to the Wrangling F1 Data With R book. Once you buy into the book, you get all future updates to it for no additional cost, even in the case of the minimum book price increasing over time.

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog… » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Space Launch Sites over Time

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

Continuing from

Code

library(dplyr)
library(ggplot2)
r1 <- readLines('launchlog.txt')
colwidth %
c(.,max(nchar(r1))) %>%
diff(.)

cols <- read.fwf(textConnection(r1[1]),
widths=colwidth,
header=FALSE,
comment.char=’@’) %>%
unlist(.) %>%
make.names(.) %>%
gsub(‘.+$’,”,.)

r2 <- read.fwf('launchlog.txt',
widths=colwidth,
col.names=cols)
r3 <- filter(r2,!is.na(Suc))
Sys.setlocale(category = “LC_TIME”, locale = “C”)
r3$Launch.Date..UTC[1:3]
r3$Date <- as.Date(r3$Launch.Date..UTC,format='%Y %b %d')

xtabs(~ Site , r3) %>%
as.data.frame(.) %>%
arrange(.,-Freq) %>%
filter(.,Freq>100)
#LC=launch complex
ll <- levels(r3$Site)
r3$loc1 <- factor(gsub('( |,)[[:print:]]+$','',ll)[r3$Site])
xtabs(~ loc1 , r3) %>%
as.data.frame(.) %>%
arrange(.,-Freq) %>%
filter(.,Freq>100)
#http://planet4589.org/space/log/sites.txt
#NIIP-5=GIK-5=Baykonur
#NIIP-53=GNIIP=GIK-1=Plesetsk
#CC=Cape Canaveral=KSC=John F. Kennedy Space Center, Florida
#V=Vandenberg
#CSG=Centre Spatial Guyanais, Kourou, Guyane Francaise
#JQ=Jiuquan Space Center, Nei Monggol Zizhiqu, China
#XSC=Xichang Space Center, Sichuan, China
#GTsP-4, Kapustin Yar, Volgograd, Rossiya
#L1011 = modified Lockheed L-1011 TriStar aircraft
#Odyssey = Sea launch from platform

locbycountry <- read.csv(text='
CC , USA
CSG , France
GIK-1 , Russia
GIK-5 , Russia
GNIIPV ,Russia
GTsP-4 ,Russia
JQ ,China
KASC ,Japan
KSC ,USA
L-1011 ,Air
MARS ,USA
NIIP-5 ,Russia
NIIP-53 ,Russia
ODYSSEY ,Sea platform
PA ,USA
SHAR ,India
TNSC ,Japan
TYSC ,China
V ,USA
WI ,USA
XSC , China’,
skip=1,header=FALSE,
col.names=c(‘loc1′,’Geography’),
stringsAsFactors=FALSE,
strip.white=TRUE) %>%
arrange(.,Geography)

r4 %
as.data.frame(.) %>%
# arrange(.,-Freq) %>%
filter(.,Freq>10) %>%
select(.,loc1) %>%
merge(.,r3) %>%
mutate(.,loc1=as.character(loc1)) %>%
merge(.,locbycountry) %>%
mutate(.,loc1=factor(loc1,levels=rev(locbycountry$loc1)))

ggplot(r4, aes(loc1, Date,col=Geography)) +
geom_point()+
coord_flip()+
geom_jitter(position = position_jitter(width = .25)) +
xlab(‘Geography’) +
theme(legend.position=”bottom”)+
guides(col=guide_legend(nrow=2))

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News