More Airline Crashes via the Hadleyverse

By hrbrmstr

(This article was first published on » R, and kindly contributed to R-bloggers)

I saw a fly-by #rstats mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a GitHub repo by @philjette. It seems there’s a web site (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:

Wrote some R code for looking at historical crash data
#data #r #DataMining

— PhilJ (@philjette) March 26, 2015

The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette‘s example in modern Hadleyverse style (i.e. dplyr, rvest, etc). I even submitted a pull request to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny ggplot2 graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.

Just to get the setup bits out of the way, here’s all the packages I’ll be using:


Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.

The basic setup is to:

  • grab the HTML for the page of a given year
  • extract and format the crash dates
  • extract location & operator information, which is made slightly annoying since the site uses a
    and includes spurious newlines within a single


  • extract aircraft type and registration (same issues as previous column)
  • extract accident details, which are embedded in a highly formatted column that requires str_match_all to handle (well)

Some things worth mentioning:

  • data_frame is super-helpful in not-creating factors from the character vectors
  • bind_rows and bind_cols are a nice alternative to using data.table functions
  • I think stringr needs a more pipe-friendly replacement for gsub and, perhaps, even ifesle (yes, I guess I could submit a PR). The . just feels wrong in pipes to me, still
  • if you’re not using pbapply functions (free progress bars for everyone!) you should be, especially for long scraping operations
  • sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary

Here’s the new get_data function:

#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
  if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
    stop("year must be >=1920 and <=current year", call.=FALSE)
  # get crash date
  pg <- html(sprintf(crash_base, year, year))
  pg %>%
    html_nodes("table > tr > td:nth-child(1)") %>%
    html_text() %>%
    extract(-1) %>%
    as.Date(, format="%d %b %Y") %>%
    data_frame(date=.) -> date
  # get location and operator
  loc_op <- bind_rows(lapply(1:length(date), function(i) {
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      gsub("(^[[:space:]]*|[[:space:]]*$)", "", .) %>%
      gsub("^(Near|Off) ", "", .) -> loc
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      gsub("(^[[:space:]]*|[[:space:]]*$|n)", "", .) -> op
    data_frame(location=loc, operator=op)
  # get type & registration
  type_reg <- bind_rows(lapply(1:length(date), function(i) {
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      gsub("(^[[:space:]]*|[[:space:]]*$|n)", "", .) %>%
      ifelse(.=="?", NA, .) -> typ
    pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      gsub("(^[[:space:]]*|[[:space:]]*$|n)", "", .) %>%
      ifelse(.=="?", NA, .) -> reg
    data_frame(type=typ, registration=reg)
  # get fatalities
  pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
    html_text() %>%
    str_match_all("([[:digit:]]+)/([[:digit:]]+)(([[:digit:]]+))") %>%
    lapply(function(x) {
      data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
    }) %>%
    bind_rows %>% tail(-1) -> afg
  bind_cols(date, loc_op, type_reg, afg)

While that gets one year, it’s super-simple to get all crashes since 1950:

crashes <- bind_rows(pblapply(1950:2015, get_data))

Yep. That’s it. Now crashes contains a data.frame (well, tbl_df) of all the crashes since 1950, ready for further analysis.

For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.

If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.

To leave a comment for the author, please follow the link and comment on his blog: » R. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News