Hurricane Irma’s rains, visualized with R

By David Smith

💧

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The USGS has followed up their visualization of Hurricane Harvey rainfalls with an updated version of the animation, this time showing the rain and flooding from Hurricane Irma in Florida:

Another #rstats #dataviz! Precip and #flooding from #HurricaneIrma #opensource code: https://t.co/rpocPQe7zR #openscience pic.twitter.com/rGX1SNiYEM

— USGS R Community (@USGS_R) September 15, 2017

This iteration improves the Harvey version by displaying rainfall in a fine grid over the state rather than on a county-by-county basis. Once again you can find the R code and data on Github, and an interactive version of the chart is available at the link below.

USGS Vizlab: Hurricane Irma’s Water Footprint

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Time Series Analysis in R Part 1: The Time Series Object

By Troy Walters

Many of the methods used in time series analysis and forecasting have been around for quite some time but have taken a back seat to machine learning techniques in recent years. Nevertheless, time series analysis and forecasting are useful tools in any data scientist’s toolkit. Some recent time series-based competitions have recently appeared on kaggle, such as one hosted by Wikipedia where competitors are asked to forecast web traffic to various pages of the site. As an economist, I have been working with time series data for many years; however, I was largely unfamiliar with (and a bit overwhelmed by) R’s functions and packages for working with them. From the base ts objects to a whole host of other packages like xts, zoo, TTR, forecast, quantmod and tidyquant, R has a large infrastructure supporting time series analysis. I decided to put together a guide for myself in Rmarkdown. I plan on sharing this as I go in a series of blog posts. In part 1, I’ll discuss the fundamental object in R – the ts object.

The Time Series Object

In order to begin working with time series data and forecasting in R, you must first acquaint yourself with R’s ts object. The ts object is a part of base R. Other packages such as xts and zoo provide other APIs for manipulating time series objects. I’ll cover those in a later part of this guide.

Here we create a vector of simulated data that could potentially represent some real-world time-based data generation process. It is simply a sequence from 1 to 100 scaled up by 10 to avoid negatives and with some random normal noise added to it. We can use R’s base plot() function to see what it looks like:

set.seed(123)
t 

Gives this plot:

This could potentially represent some time series, with time represented along the x-axis. However, it's hard to tell. The x-axis is simply an index from 1 to 100 in this case.

A vector object such as t above can easily be converted to a time series object using the ts() function. The ts() function takes several arguments, the first of which, x, is the data itself. We can look at all of the arguments of ts() using the args() function:

args(ts)
function (data = NA, start = 1, end = numeric(), frequency = 1, 
    deltat = 1, ts.eps = getOption("ts.eps"), class = if (nseries > 
        1) c("mts", "ts", "matrix") else "ts", names = if (!is.null(dimnames(data))) colnames(data) else paste("Series", 
        seq(nseries))) 
NULL

To begin, we will focus on the first four arguments – data, start, end and frequency. The data argument is the data itself (a vector or matrix). The start and end arguments allow us to provide a start date and end date for the series. Finally the frequency argument lets us specify the number of observations per unit of time. For example, if we had monthly data, we would use 12 for the frequency argument, indicating that there are 12 months in the year.

Let’s assume our generated data is quarterly data that starts in the first quarter of 2000. We would turn it into a ts object as below. We specify the start argument as a two element vector. The first element is the year and the second element is the observation of that year in which the data start. Because our data is quarterly, we use 4 for the frequency argument.

tseries            Qtr1       Qtr2       Qtr3       Qtr4
2000   7.076670  10.388758  23.910958  14.493559
2001  15.905014  28.005455  20.226413   9.144571
2002  14.192030  16.880366  29.568573  24.518697
2003  25.805400  24.774779  21.109112  38.508392
2004  30.484953  14.233680  33.909491  26.690460
2005  23.525234  30.474176  25.817969  28.897761
2006  30.624725  24.193147  42.864509  39.073612
2007  31.033041  48.776704  43.985250  39.934500
2008  49.265880  50.146934  50.751068  50.820482
2009  50.877424  47.566618  46.858261  47.336703
2010  46.137051  50.544579  44.142226  69.182692
2011  63.455734  48.138240  54.179806  54.733413
2012  64.459756  59.416417  62.773230  61.800173
2013  62.699907  73.580216  63.419603  76.615294
2014  56.158730  72.092296  69.866980  71.511591
2015  73.657476  68.483736  70.667548  66.869972
2016  67.497461  78.124700  80.137468  78.371030
2017  85.455872  94.350593  77.562782  65.835818
2018  90.040170  79.035595  80.183940  93.179000
2019  85.006589  79.454976  90.269124  89.027760
2020  91.040349  94.696963  90.405380  98.510636
2021  93.456594  98.322474 104.677873 101.046270
2022  96.718479 108.041653 107.954527 105.838779
2023 104.671122  99.604657 114.524567 101.798183
2024 122.311331 118.728274 107.350097 102.815054
plot(tseries)

Gives this plot:

Notice that now when we plot the data, R recognizes that it is a ts object and plots the data as a line with dates along the x-axis.

Aside from creating ts objects containing a single series of data, we can also create ts objects that contain multiple series. We can do this by passing a matrix rather than a vector to the x argument of ts().

set.seed(123)
seq 

Gives this plot:

Now when we plot the ts object, R automatically facets the plot.

At this point, I should mention what really happens when we call the plot() function on a ts object. R recognizes when the x argument is a ts object and actually calls the plot.ts() function under the hood. We can verify this by using it directly. Notice that it produces an identical graph.

plot.ts(tsm)

Gives this plot:

The plot.ts() function has different arguments geared towards time series objects. We can look at these again using the args() function.

args(plot.ts)
function (x, y = NULL, plot.type = c("multiple", "single"), xy.labels, 
    xy.lines, panel = lines, nc, yax.flip = FALSE, mar.multi = c(0, 
        5.1, 0, if (yax.flip) 5.1 else 2.1), oma.multi = c(6, 
        0, 5, 0), axes = TRUE, ...) 
NULL

Notice that it has an argument called plot.type that lets us indicate whether we want our plot to be faceted (multiple) or single-panel (single). Although in the case of our data above we would not want to plot all three series on the same panel given the difference in scale for ts3, it can be done quite easily.

I am not going to go in-depth into using R’s base plotting capability. Although it is perfectly fine, I strongly prefer to use ggplot2 as well as the ggplot-based graphing functions available in Rob Hyndman’s forecast package. We will discuss these in later parts of this guide.

Convenience Functions for Time Series

There are several useful functions for use with ts objects that can make programming easier. These are window(), start(), end(), and frequency(). These are fairly self-explanatory. The window function is a quick and easy way to obtain a slice of a time series object. For example, look again at our object tseries. Assume that we wanted only the data from the first quarter of 2000 to the last quarter of 2012. We can do so using window():

tseries_sub           Qtr1      Qtr2      Qtr3      Qtr4
2000  7.076670 10.388758 23.910958 14.493559
2001 15.905014 28.005455 20.226413  9.144571
2002 14.192030 16.880366 29.568573 24.518697
2003 25.805400 24.774779 21.109112 38.508392
2004 30.484953 14.233680 33.909491 26.690460
2005 23.525234 30.474176 25.817969 28.897761
2006 30.624725 24.193147 42.864509 39.073612
2007 31.033041 48.776704 43.985250 39.934500
2008 49.265880 50.146934 50.751068 50.820482
2009 50.877424 47.566618 46.858261 47.336703
2010 46.137051 50.544579 44.142226 69.182692
2011 63.455734 48.138240 54.179806 54.733413
2012 64.459756 59.416417 62.773230 61.800173

The start() function returns the start date of a ts object, end() gives the end date, and frequency() returns the frequency of a given time series:

start(tsm)
end(tsm)
frequency(tsm)
## [1] 2000    1
## [1] 2024    4
## [1] 4

That’s all for now. In Part 2, we’ll dive into some of the many transformation functions for working with time series in R. See you then.

    Related Post

    1. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    2. Using MongoDB with R
    3. Finding Optimal Number of Clusters
    4. Analyzing the first Presidential Debate
    5. GoodReads: Machine Learning (Part 3)

    Source:: R News

    Pirating Web Content Responsibly With R

    By hrbrmstr

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

    There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

    We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

    Scouring The Seas Web For Pirate Data

    Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

    (site png snapshot taken with splashr)

    I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

    library(V8)
    library(stringi)
    library(httr)
    library(rvest)
    library(robotstxt)
    library(jwatr) # github/hrbrmstr/jwatr
    library(hrbrthemes)
    library(purrrlyr)
    library(rprojroot)
    library(tidyverse)
    
    report_urls %
      pull(url_list) %>%
      flatten_chr() -> target_urls
    
    head(target_urls)
    ## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
    ## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
    ## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
    ## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
    ## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
    ## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

    Time to pillage some details!

    But…Can We Really Do It?

    I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

    robotstxt::get_robotstxt("https://www.icc-ccs.org/")
    ## # If the Joomla site is installed within a folder such as at
    ## # e.g. www.example.com/joomla/ the robots.txt file MUST be
    ## # moved to the site root at e.g. www.example.com/robots.txt
    ## # AND the joomla folder name MUST be prefixed to the disallowed
    ## # path, e.g. the Disallow rule for the /administrator/ folder
    ## # MUST be changed to read Disallow: /joomla/administrator/
    ## #
    ## # For more information about the robots.txt standard, see:
    ## # http://www.robotstxt.org/orig.html
    ## #
    ## # For syntax checking, see:
    ## # http://www.sxw.org.uk/computing/robots/check.html
    ##
    ## User-agent: *
    ## Disallow: /administrator/
    ## Disallow: /cache/
    ## Disallow: /cli/
    ## Disallow: /components/
    ## Disallow: /images/
    ## Disallow: /includes/
    ## Disallow: /installation/
    ## Disallow: /language/
    ## Disallow: /libraries/
    ## Disallow: /logs/
    ## Disallow: /media/
    ## Disallow: /modules/
    ## Disallow: /plugins/
    ## Disallow: /templates/
    ## Disallow: /tmp/

    Ahoy! We’ve got a license to pillage!

    But, we don’t have a license to abuse their site.

    While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:

    s_GET  httr_raw_responses
    
    write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")
    
    good_responses 

    There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

    I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

    Digging For Treasure

    Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

    Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

    Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:

    # make field names great again
    mfga  pirate_cols
    
    # iterate over the good responses with a progress bar
    pb ` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
      html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
        html_text() %>%
        stri_split_lines() %>%
        .[[1]] %>%
        grep("narrations_ro", ., value=TRUE) %>%
        sprintf("var dat = %s;", .) %>%
        ctx$eval()
    
      p %
        map_df(~{
          list(
            field = mfga(.x[[3]]$label),
            value = .x[[3]]$value
          )
        }) %>%
        filter(value != "") %>%
        distinct(field, .keep_all = TRUE) %>%
        spread(field, value)
    
    }) %>%
      type_convert(col_types = pirate_cols) %>%
      filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
      filter(lubridate::year(date) > 2012) %>%
      mutate(
        attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
        attack_posn_map = stri_replace_all_regex(attack_posn_map, "[() ]", "")
      ) %>%
      separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
      mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df
    
    write_rds(pirate_df, "data/pirate_df.rds")

    The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

    Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

    Now we can iterate over the good (complete) responses.

    The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the javascript engine and then retrieve the data from said evaluation.

    Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

    That whole process goes really quickly, but why not save off the clean data at the end for good measure?

    Gotta Have A Pirate Map

    Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

    mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
      count(year_mon) %>%
      ggplot(aes(year_mon, n)) +
      geom_segment(aes(xend=year_mon, yend=0)) +
      scale_y_comma() +
      labs(x=NULL, y=NULL,
           title="(Confirmed) Piracy Incidents per Month",
           caption="Source: International Chamber of Commerce Commercial Crime Services ") +
      theme_ipsum_rc(grid="Y")

    And, finally, a map showing pirate encounters but colored by year:

    world %
      arrange(year) %>%
      mutate(year = factor(year)) -> plot_df
    
    ggplot() +
      geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
      geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
      ggalt::coord_proj("+proj=wintri") +
      viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
      labs(x=NULL, y=NULL,
           title="Piracy Incidents per Month (Confirmed)",
           caption="Source: International Chamber of Commerce Commercial Crime Services ") +
      theme_ipsum_rc(grid="XY") +
      theme(legend.position = "bottom")

    Taking Up The Mantle of the Dread Pirate Hrbrmstr

    Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

    There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Boston EARL speaker announcement: Agenda now available

    By Mango Solutions

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    Mango Solutions are delighted to announce the full EARL Boston agenda for this coming November!

    As always, we received more abstracts than we could accept and all were of excellent quality. Because of this, the decision-making process was not easy; however, we believe we’ve put together a great agenda that showcases how a range of industries are solving business problems with R.

    We’re confident that R users of all levels and from all industries will find the EARL Boston Conference useful.

    On to the speakers!

    Joining David, Mara and Tareef, are speakers from companies such as DataCamp, IBM, Microsoft, Capital One, Goldman Sachs International, Novartis and GfK.

    Thursday 2 November

    • Dr Shatrunjai Singh, John Hancock Insurance
    • Charlie Thompson, VideoBlocks
    • Chris Campbell, Mango Solutions
    • Sudha Subramanian, Sparkfish
    • Joseph Ciesielski, Philadelphia Youth Network
    • Daniel Hadley, Sorenson Impact Center
    • Emily Riederer, Capital One Financial Corporation
    • Bob Engle, Biogen
    • Matteo Crimella, Goldman Sachs International
    • Dr Xiao Ni, Novartis
    • Patrick Turgeon, rally
    • Bo Wang, Pfizer Inc.
    • Dr Adam Jenkins, Biogen
    • Richard Cotton, DataCamp
    • Lou Bajuk-Yorgan, TIBCO Software
    • Jeff Allen, RStudio
    • Jonathan Keane, Crunch
    • Mark Sellors, Mango Solutions

    Friday 3 November

    • Nic Crane, Mango Solutions
    • Marlene Marchena
    • Shweta Gopaulakrishnan, Brigham and Women’s Hospital, Channing Division of Network Medicine
    • Tanya Cashorali, TCB Analytics
    • Monika Wahi, DethWench
    • Richard Pugh, Mango Solutions
    • Michael Conklin, GfK
    • Alex Albright, Harvard University
    • Jaya Matthew, Emmanuel Awa and Robert Alexander, Microsoft
    • Omayma Said, Wuzzuf
    • Sharon Machlis, International Data Group
    • David Smith, Microsoft
    • Alok Singh, IBM
    • Aidan Boland, Clavis Insight
    • Ali Zaidi, Microsoft
    • Matt Dancho, Business Science
    • Mark Hornick, Oracle
    • Francesca Lazzeri, Microsoft

    To download the agenda, click here.

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    How to make a line chart with ggplot2

    By Sharp Sight

    Last week’s blog post about Amazon’s search for a location for a second headquarters left me thinking about the company’s growth.

    After looking at the long term growth of the stock price, it occurred to me that visualizing the stock price data would be a great example of how to create a line chart in R using ggplot2.

    So in this blog post, I’ll show you how to make a line chart with ggplot2, step by step.

    Let’s jump in.

    First, we’ll load several packages: we’ll load readr, which we’ll use to read in the data; tidyverse, which includes ggplot2, dplyr, and several other important packages; and stringr, which will let us do some string manipulation.

    #===============
    # LOAD PACKAGES
    #===============
    
    library(readr)
    library(tidyverse)
    library(stringr)
    
    

    Now that we’ve loaded the packages that we need, we’ll read in the data.

    The data are contained in a .csv file that I’ve uploaded to the Sharp Sight webpage.

    We’ll use readr::read_csv() to read in the file. This is an extremely straightforward use of read_csv().

    #==========
    # READ DATA
    #==========
    
    stock_amzn 
    

    Now we'll quickly inspect the data by looking at the column names and printing out the first few rows of data.

    #========
    # INSPECT
    #========
    
    stock_amzn %>% names()
    stock_amzn %>% head()
    
    

    Upon inspection, you can see that the column names are capitalized. This is a minor problem, but ideally you want your variable names to be lower case; this makes them easier to type.

    To convert the variable names to all lower case, we’ll use the str_to_lower() function from the stringr package.

    #=========================================================
    # CHANGE COLUMN NAMES: lower case
    # - in the raw form (as read in) the first letter of 
    #   each variable is capitalized. 
    # - This makes them harder to type!  Not ideal.
    # - we'll use stringr::str_to_lower() to change the column
    #   names to lower case 
    #=========================================================
    
    colnames(stock_amzn) % str_to_lower()
    
    # inspect
    stock_amzn %>% names()
    
    

    Here, on the right-hand-side of the assignment operator, we’re using colnames(stock_amzn) to retrieve the column names. Then we pipe the column names into str_to_lower() which converts the names to lower case.

    The resulting output is then re-assigned to the column names of the dataframe. We do this by using the following: colnames(stock_amzn) . Essentially, we’re taking the result from the right-hand-side and assigning that result to the column names using colnames(). To be clear, colnames() can both retrieve column names and set the column names.

    Now that the data are in the right form, let’s make a simple line chart.

    #======
    # PLOT 
    #======
    
    #--------------------------------------
    # FIRST ITERATION
    # - this is the quick-and-dirty version
    #--------------------------------------
    
    ggplot(data = stock_amzn, aes(x = date, y = close)) +
      geom_line()
    
    

    This is about as simple as it gets in ggplot2, but let’s break it down.

    The ggplot() function indicates that we’re going to use ggplot2 to make a plot.

    The data = parameter specifies that we’re going to be plotting data in the stock_amzn dataframe.

    Then, the aes() function allows us to specify our variable mappings. With the statement x = date, we are mapping the date variable to the x-axis. Similarly, with the statement y = close, we are mapping the close variable to the y-axis.

    Finally, geom_line() specifies that we want to draw lines.

    Again, this is just about as simple as it gets.

    Once you know more about how ggplot2 works, you can format the plot.

    Having said that, let’s take a look at a ‘polished’ version of the plot … a version that’s been heavily formatted:

    #--------------------------------------
    # POLISHED VERSION
    # - this is the 'finalized' version
    # - we arrive at this after a lot of
    #   itteration ....
    #--------------------------------------
    
    ggplot(stock_amzn, aes(x = date, close)) +
      geom_line(color = 'cyan') +
      geom_area(fill = 'cyan', alpha = .1) +
      labs(x = 'Date'
           , y = 'ClosingnPrice'
           , title = "Amazon's stock price has increased dramaticallynover the last 20 years") +
      theme(text = element_text(family = 'Gill Sans', color = "#444444")
            ,panel.background = element_rect(fill = '#444B5A')
            ,panel.grid.minor = element_line(color = '#4d5566')
            ,panel.grid.major = element_line(color = '#586174')
            ,plot.title = element_text(size = 28)
            ,axis.title = element_text(size = 18, color = '#555555')
            ,axis.title.y = element_text(vjust = 1, angle = 0)
            ,axis.title.x = element_text(hjust = 0)
            ) 
    

    And here’s the final chart:



    If you’re a beginner, don’t be intimidated: this finalized chart is not hard to do.

    Really. With a little practice, you should be able to learn to create a well-formatted chart like this very quickly. It should take you only a few hours to learn how the code works, and you should be able to memorize this syntax within a week or two.

    … and when I say memorize, I mean that you should be able to write all of this code from memory.

    Ideally, if you’re fluent in R and ggplot2, it should only take you 10 or 15 minutes to write all of this code, start to finish.

    Sign up now, and discover how to become fluent in R

    Are you still struggling with R and ggplot2?

    Becoming fluent in R is really straightforward, if you know how to practice.

    If you’re ready to master R, sign up for our email list.

    Not only will you receive tutorials (delivered to your inbox) …

    … but you’ll also get lessons on how to practice so that you can master R as quickly as possible.

    And if you sign up right now, you’ll also get access to our “Data Science Crash Course” for free.

    SIGN UP NOW

    The post How to make a line chart with ggplot2 appeared first on SHARP SIGHT LABS.

    Source:: R News

    Accessing patent data with the patentsview package

    By Chris Baker

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    Why care about patents?

    1. Patents play a critical role in incentivizing innovation, without
    which we wouldn’t have much of the technology we rely on everyday

    What does your iPhone, Google’s PageRank algorithm, and a butter
    substitute called Smart Balance all have in common?



    …They all probably wouldn’t be here if not for patents. A patent
    provides its owner with the ability to make money off of something that
    they invented, without having to worry about someone else copying their
    technology. Think Apple would spend millions of dollars developing the
    iPhone if Samsung could just come along and rip it
    off
    ?
    Probably not.

    2. Patents offer a great opportunity for data analysis

    There are two primary reasons for this:

    • Patent data is public. In return for the exclusive right to
      profit off an invention, an individual/company has to publicly
      disclose the details of their invention to the rest of the world.
      Examples of those
      details

      include the patent’s title, abstract, technology classification,
      assigned organizations, etc.
    • Patent data can answer questions that people care about.
      Companies (especially big ones like IBM and Google) have a vested
      interest in extracting insights from patents, and spend a lot of
      time/resources trying figure out how to best manage their
      intellectual property (IP) rights. They’re plagued by questions like
      “who should I sell my underperforming patents to,” “which technology
      areas are open to new innovations,” “what’s going to be the next big
      thing in the world of buttery spreads,” etc. Patents offer a way to
      provide data-driven answers to these questions.

    Combined, these two things make patents a prime target for data
    analysis. However, until recently it was hard to get at the data inside
    these documents. One had to either collect it manually using the
    official United States Patent and Trademark
    Office

    (USPTO) search
    engine
    , or figure
    out a way to download, parse, and model huge XML data dumps. Enter
    PatentsView.

    PatentsView and the patentsview package

    PatentsView is one
    of USPTO’s new initiatives intended to increase the usability and value
    of patent data. One feature of this project is a publicly accessible API
    that makes it easy to programmatically interact with the data. A few of
    the reasons why I like the API (and PatentsView more generally):

    • The API is free (no credential required) and currently doesn’t
      impose rate limits/bandwidth throttling.
    • The project offers bulk downloads of patent
      data
      on their website (in a
      flat file format), for those who want to be closest to the data.
    • Both the API and the bulk download data contain disambiguated
      entities such as inventors, assignees, organizations, etc. In other
      words, the API will tell you whether it thinks that John Smith on
      patent X is the same person as John Smith on patent Y. 1

    The patentsview R package is a wrapper around the PatentsView API. It
    contains a function that acts as a client to the API (search_pv()) as
    well as several supporting functions. Full documentation of the package
    can be found on its
    website.

    Installation

    You can install the stable version of patentsview from CRAN:

    install.packages("patentsview")
    

    Or get the development version from GitHub:

    if (!require(devtools)) install.packages("devtools")
    
    devtools::install_github("ropensci/patentsview")
    

    Getting started

    The package has one main function, search_pv(), that makes it easy to
    send requests to the API. There are two parameters to search_pv() that
    you’re going to want to think about just about every time you call it –
    query and fields. You tell the API how you want to filter the patent
    data with query, and which fields you want to retrieve with
    fields. 2

    query

    Your query has to use the PatentsView query
    language
    , which is
    a JSON-based syntax that is similar to the one used by Lucene. You can
    write the query directly and pass it as a string to search_pv():

    library(patentsview)
    
    qry_1  $data
    #> #### A list with a single data frame on the patent data level:
    #>
    #> List of 1
    #>  $ patents:'data.frame': 25 obs. of  3 variables:
    #>   ..$ patent_id    : chr [1:25] "7313829" ...
    #>   ..$ patent_number: chr [1:25] "7313829" ...
    #>   ..$ patent_title : chr [1:25] "Sealing device for body suit and sealin"..
    #>
    #> $query_results
    #> #### Distinct entity counts across all downloadable pages of output:
    #>
    #> total_patent_count = 100,000
    

    …Or you can use the domain specific language (DSL) provided in the
    patentsview package to help you write the query:

    qry_2  {"_gt":{"patent_year":2007}}
    
    search_pv(query = qry_2)
    #> $data
    #> #### A list with a single data frame on the patent data level:
    #>
    #> List of 1
    #>  $ patents:'data.frame': 25 obs. of  3 variables:
    #>   ..$ patent_id    : chr [1:25] "7313829" ...
    #>   ..$ patent_number: chr [1:25] "7313829" ...
    #>   ..$ patent_title : chr [1:25] "Sealing device for body suit and sealin"..
    #>
    #> $query_results
    #> #### Distinct entity counts across all downloadable pages of output:
    #>
    #> total_patent_count = 100,000
    

    qry_1 and qry_2 will result in the same HTTP call to the API. Both
    queries search for patents in USPTO that were published after 2007.
    There are three gotchas to look out for when writing a query:

    1. Field is queryable. The API has 7 endpoints (the default
      endpoint is “patents”), and each endpoint has its own set of fields
      that you can filter on. The fields that you can filter on are not
      necessarily the same as the ones that you can retrieve.
      In other
      words, the fields that you can include in query (e.g.,
      patent_year) are not necessarily the same as those that you can
      include in fields. To see which fields you can query on, look in
      the fieldsdf data frame (View(patentsview::fieldsdf)) for fields
      that have a “y” indicator in their can_query column for your given
      endpoint.
    2. Correct data type for field. If you’re filtering on a field in
      your query, you have to make sure that the value you are filtering
      on is consistent with the field’s data type. For example,
      patent_year has type “integer,” so if you pass 2007 as a string
      then you’re going to get an error (patent_year = 2007 is good,
      patent_year = "2007" is no good). You can find a field’s data type
      in the fieldsdf data frame.
    3. Comparison function works with field’s data type. The comparison
      function(s) that you use (e.g., the greater-than function shown
      above, qry_funs$gt()) must be consistent with the field’s data
      type. For example, you can’t use the “contains” function on fields
      of type “integer” (qry_funs$contains(patent_year = 2007) will
      throw an error). See ?qry_funs for more details.

    In short, use the fieldsdf data frame when you write a query and you
    should be fine. Check out the writing queries
    vignette

    for more details.

    fields

    Up until now we have been using the default value for fields. This
    results in the API giving us some small set of default fields. Let’s see
    about retrieving some more fields:

    search_pv(
      query = qry_funs$gt(patent_year = 2007),
      fields = c("patent_abstract", "patent_average_processing_time",
                 "inventor_first_name", "inventor_total_num_patents")
    )
    #> $data
    #> #### A list with a single data frame (with list column(s) inside) on the patent data level:
    #>
    #> List of 1
    #>  $ patents:'data.frame': 25 obs. of  3 variables:
    #>   ..$ patent_abstract               : chr [1:25] "A sealing device for a"..
    #>   ..$ patent_average_processing_time: chr [1:25] "1324" ...
    #>   ..$ inventors                     :List of 25
    #>
    #> $query_results
    #> #### Distinct entity counts across all downloadable pages of output:
    #>
    #> total_patent_count = 100,000
    

    The fields that you can retrieve depends on the endpoint that you are
    hitting. We’ve been using the “patents” endpoint thus far, so all of
    these are retrievable:
    fieldsdf[fieldsdf$endpoint == "patents", "field"]. You can also use
    get_fields() to list the retrievable fields for a given endpoint:

    search_pv(
      query = qry_funs$gt(patent_year = 2007),
      fields = get_fields(endpoint = "patents", groups = c("patents", "inventors"))
    )
    #> $data
    #> #### A list with a single data frame (with list column(s) inside) on the patent data level:
    #>
    #> List of 1
    #>  $ patents:'data.frame': 25 obs. of  31 variables:
    #>   ..$ patent_abstract                       : chr [1:25] "A sealing devi"..
    #>   ..$ patent_average_processing_time        : chr [1:25] "1324" ...
    #>   ..$ patent_date                           : chr [1:25] "2008-01-01" ...
    #>   ..$ patent_firstnamed_assignee_city       : chr [1:25] "Cambridge" ...
    #>   ..$ patent_firstnamed_assignee_country    : chr [1:25] "US" ...
    #>   ..$ patent_firstnamed_assignee_id         : chr [1:25] "b9fc6599e3d60c"..
    #>   ..$ patent_firstnamed_assignee_latitude   : chr [1:25] "42.3736" ...
    #>   ..$ patent_firstnamed_assignee_location_id: chr [1:25] "42.3736158|-71"..
    #>   ..$ patent_firstnamed_assignee_longitude  : chr [1:25] "-71.1097" ...
    #>   ..$ patent_firstnamed_assignee_state      : chr [1:25] "MA" ...
    #>   ..$ patent_firstnamed_inventor_city       : chr [1:25] "Lucca" ...
    #>   ..$ patent_firstnamed_inventor_country    : chr [1:25] "IT" ...
    #>   ..$ patent_firstnamed_inventor_id         : chr [1:25] "6416028-3" ...
    #>   ..$ patent_firstnamed_inventor_latitude   : chr [1:25] "43.8376" ...
    #>   ..$ patent_firstnamed_inventor_location_id: chr [1:25] "43.8376211|10."..
    #>   ..$ patent_firstnamed_inventor_longitude  : chr [1:25] "10.4951" ...
    #>   ..$ patent_firstnamed_inventor_state      : chr [1:25] "Tuscany" ...
    #>   ..$ patent_id                             : chr [1:25] "7313829" ...
    #>   ..$ patent_kind                           : chr [1:25] "B1" ...
    #>   ..$ patent_number                         : chr [1:25] "7313829" ...
    #>   ..$ patent_num_cited_by_us_patents        : chr [1:25] "5" ...
    #>   ..$ patent_num_claims                     : chr [1:25] "25" ...
    #>   ..$ patent_num_combined_citations         : chr [1:25] "35" ...
    #>   ..$ patent_num_foreign_citations          : chr [1:25] "0" ...
    #>   ..$ patent_num_us_application_citations   : chr [1:25] "0" ...
    #>   ..$ patent_num_us_patent_citations        : chr [1:25] "35" ...
    #>   ..$ patent_processing_time                : chr [1:25] "792" ...
    #>   ..$ patent_title                          : chr [1:25] "Sealing device"..
    #>   ..$ patent_type                           : chr [1:25] "utility" ...
    #>   ..$ patent_year                           : chr [1:25] "2008" ...
    #>   ..$ inventors                             :List of 25
    #>
    #> $query_results
    #> #### Distinct entity counts across all downloadable pages of output:
    #>
    #> total_patent_count = 100,000
    

    Example

    Let’s look at a quick example of pulling and analyzing patent data.
    We’ll look at patents from the last ten years that are classified below
    the H04L63/00 CPC
    code
    .
    Patents in this area relate to “network architectures or network
    communication protocols for separating internal from external
    traffic.” 3 CPC codes offer a quick and dirty way to find patents of
    interest, though getting a sense of their hierarchy can be tricky.

    1. Download the data

    library(patentsview)
    
    # Write a query:
    query 
    1. See where the patents are coming from (geographically)

    library(leaflet)
    library(htmltools)
    library(dplyr)
    library(tidyr)
    
    data %
        unnest(assignees) %>%
        select(assignee_id, assignee_organization, patent_number,
               assignee_longitude, assignee_latitude) %>%
        group_by_at(vars(-matches("pat"))) %>%
        mutate(num_pats = n()) %>%
        ungroup() %>%
        select(-patent_number) %>%
        distinct() %>%
        mutate(popup = paste0("",
                              htmlEscape(assignee_organization), "

    Patents:", num_pats, "")) %>% mutate_at(vars(matches("_l")), as.numeric) %>% filter(!is.na(assignee_id)) leaflet(data) %>% addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% addCircleMarkers(lng = ~assignee_longitude, lat = ~assignee_latitude, popup = ~popup, ~sqrt(num_pats), color = "yellow")

    1. Plot the growth of the field’s topics over time

    library(ggplot2)
    library(RColorBrewer)
    
    data %
        unnest(cpcs) %>%
        filter(cpc_subgroup_id != "H04L63/02") %>% # remove patents categorized into only top-level category of H04L63/02
        mutate(
          title = case_when(
            grepl("filtering", .$cpc_subgroup_title, ignore.case = T) ~
              "Filtering policies",
            .$cpc_subgroup_id %in% c("H04L63/0209", "H04L63/0218") ~
              "Architectural arrangements",
            grepl("Firewall traversal", .$cpc_subgroup_title, ignore.case = T) ~
              "Firewall traversal",
            TRUE ~
              .$cpc_subgroup_title
          )
        ) %>%
        mutate(title = gsub(".*(?=-)-", "", title, perl = TRUE)) %>%
        group_by(title, patent_year) %>%
        count() %>%
        ungroup() %>%
        mutate(patent_year = as.numeric(patent_year))
    
    ggplot(data = data) +
      geom_smooth(aes(x = patent_year, y = n, colour = title), se = FALSE) +
      scale_x_continuous("nPublication year", limits = c(2007, 2016),
                         breaks = 2007:2016) +
      scale_y_continuous("Patentsn", limits = c(0, 700)) +
      scale_colour_manual("", values = brewer.pal(5, "Set2")) +
      theme_bw() + # theme inspired by https://hrbrmstr.github.io/hrbrthemes/
      theme(panel.border = element_blank(), axis.ticks = element_blank())
    

    Learning more

    For analysis examples that go into a little more depth, check out the
    data applications
    vignettes

    on the package’s website. If you’re just interested in search_pv(),
    there are
    examples
    on the site for that as well. To contribute to the package or report an
    issue, check out the issues page on
    GitHub
    .

    Acknowledgments

    I’d like to thank the package’s two reviewers, Paul
    Oldham
    and Verena
    Haunschmid
    , for taking the time to review
    the package and providing helpful feedback. I’d also like to thank
    Maëlle Salmon for shepherding the package
    along the rOpenSci review process, as well Scott
    Chamberlain
    and Stefanie
    Butland
    for their miscellaneous
    help.


    1. This is both good and bad, as there are errors in the disambiguation. The algorithm that is responsible for the disambiguation was created by the winner of the PatentsView Inventor Disambiguation Technical Workshop.

    2. These two parameters end up getting translated into a MySQL query by the API’s server, which then gets sent to a back-end database. query and fields are used to create the query’s WHERE and SELECT clauses, respectively.

    3. There is a slightly more in-depth definition that says that these are patents “related to the (logical) separation of traffic/(sub-) networks to achieve protection.”

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Talk like a pirate day 2017

    By Christoph Safferling

    pirates arr, not err!

    (This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

    International Talk Like A Pirate Day

    Q: What has 8 legs, 8 arms and 8 eyes?
    A: 8 pirates.

    Avast, ye scurvy scum! Today be September 19, the International Talk Like A Pirate Day. While it’s a silly “holiday”, it’s a great chance to switch your Facebook to Pirate English (“Arr, this be pleasing to me eye!”). Also, it’s as good a reason as any to go sailing the Internet Main, scouring for booty (data)!

    Well, any excuse, y’know…

    What is today known (more or less) as “Pirate English” (the bit with all the “arrr!”, “ahoy, ye matey!”, etc) is in fact the accent employed by Robert Newton, the actor who played Long John Silver in the first non-silent film (“talkie”) Treasure Island. He speaks with a Bristol accent, which was in fact the main trading port with the West Indies (Bristol, not the accent, of course). The accent we today associate with pirates could well have been historical!

    More useless information on pirates (such as that there is no historical evidence of a pirate ever owning a parrot as a pet – again this stereotype evolved from the Treasure Island movie) is on the wonderful as always treasure trove of superfluous information, the QI page on pirates.

    Data

    First, let’s get some data on pirates. I have some hazy memories on seeing data on historical shipping, and also piracy somewhere a while ago, but I cannot find it any more. I must be getting older. So, I’ll do what any self-respecting young whippersnapper would do and grab data from Wikipedia. There is a list of famous pirates; it’s biased, of course. I see relatively few entries on Barbary pirates, for instance. No matter, the list covers most of the Golden Age of Piracy in the Caribbean, and I’ll use it to generate a dataset on these pirates.

    We’ll grab the wiki page, linking to the static ID of the page for reproducability.

    ## retrieve wikipedia list of pirates page
    pirates.url  "https://en.wikipedia.org/w/index.php?title=List_of_pirates&oldid=799791558"
    
    ## read the page into R
    pirates.wiki  read_html(pirates.url)

    We then use purrr::map_dfr to grab all tables in the HTML, and push them into a single dataframe and do some basic cleaning. Man, I hate Windows; utf-8 and Windows still in 2017 is a major pain in the stern.

    ## grab all tables on the page; tables 3-5 are the ones of main interest
    # 3: Rise of the English Sea Dogs and Dutch Corsairs: 1560-1650
    # 4: Age of the Buccaneers: 1650-1690
    # 5: Golden Age of Piracy: 1690-1730
    pirates.raw  pirates.wiki %>% 
      # grab all data tables; they all have the class "wikitable"
      html_nodes(".wikitable") %>% 
      # extract all tables and put them into one dataframe
      purrr::map_dfr(html_table, fill = TRUE) %>% 
      # clean column names
      setNames(make.names(names(.))) %>% 
      mutate(
        # stupid wiki has capital and non-capital letters in headers...
        Years.active = if_else(is.na(Years.active), Years.Active, Years.active),
        # empty table cells should be NA
        Years.active = parse_character(Years.active),
        Life = parse_character(Life), 
        # god, I hate windows... R on Win will chocke on ‒ and – characters, so replace them with ASCII
        # http://unicode-search.net/unicode-namesearch.pl?term=dash
        # U+2012 to U+2015 are dashes
        Life = stringr::str_replace_all(Life, "(u2012|u2013)", "-"),
        Years.active = stringr::str_replace_all(Years.active, "(u2012|u2013)", "-")
      ) %>% 
      # keep columns we want and re-order
      select(Name, Life, Country.of.origin, Years.active, Comments)

    With the raw data pulled, we now need to clean it a bit. Our goal is to generate a data set of the “pirating times” of each individual pirate. Wiki lists the active times, the “flourished” times, and/or the life times. We’ll split the age columns (usually yyyy-yyyy) into two, providing some padding if there is only one of the dates available. After this, we’ll clean up the columns into integer columns, ignoring the rows with did not provide accurate enough year data (e.g. “17th century”). This is a fun project, not a scientific one.

    ## generate our pirating times dataset
    pirates.dta  pirates.raw %>% 
      mutate(
        # born and died forced into the "yyyy-yyyy" schedule for later tidyr::separate()
        Life.parsed = gsub("(b.|born) *([[:digit:]]+)", "2-", x = Life),
        Life.parsed = gsub("(d.|died) *([[:digit:]]+)", "-2", x = Life.parsed),
        Years.active = gsub("(to) *([[:digit:]]+)", "-2", x = Years.active),
        # move the "flourished" dates to its own column
        flourished = if_else(grepl("fl.", Life.parsed), gsub("fl. *", "", Life.parsed), as.character(NA)),
        # remove "flourished" dates from life
        Life.parsed = if_else(grepl("fl.", Life.parsed), as.character(NA), Life.parsed)
      ) %>% 
      # split the life, active, and flourished years into 2 columns (start/stop), each
      separate(Life.parsed, c("born", "died"), sep = "-", extra = "merge", fill = "right") %>% 
      separate(Years.active, c("active.start", "active.stop"), sep = "-", extra = "merge", fill = "right", remove = FALSE) %>% 
      separate(flourished, c("fl.start", "fl.stop"), sep = "-", extra = "merge", fill = "right", remove = FALSE) %>% 
      # 
      mutate_at(
        vars(born, died, active.start, active.stop, fl.start, fl.stop), 
        f.keep.last.four.digits
      ) %>% 
      mutate_at(
        vars(born, died, active.start, active.stop, fl.start, fl.stop),
        #as.integer
        parse_integer
      ) %>%
      # get the best guess of piracing time: active > flourished > estimated from life
      mutate(
        # take active if available, if not take flourished
        piracing.start = if_else(is.na(active.start), fl.start, active.start),
        piracing.stop = if_else(is.na(active.stop), fl.stop, active.stop),
        # neither active, nor flourished date, estimate from life
        piracing.start = if_else(is.na(piracing.start)&is.na(piracing.stop), f.activity.from.life(b = born, d = died), piracing.start),
        piracing.stop = if_else(is.na(piracing.start)&is.na(piracing.stop), f.activity.from.life(b = born, d = died, return.b = FALSE), piracing.stop),
        # finally, let's get single-year start/stops cleaned up
        piracing.start = if_else(is.na(piracing.start), piracing.stop, piracing.start), 
        piracing.stop = if_else(is.na(piracing.stop), piracing.start, piracing.stop)
      ) %>% 
      # reorder
      select(Name, Life, piracing.start, piracing.stop, Country.of.origin, everything()) 

    With the dataset now sufficiently cleared up, we will now change the dates to decades only – so change 1643 to 1640. We’ll graph our data by decade. We’ll use some rowwise() %>% do() magic to generate one row per pirate per activity decade (so two rows if he was active from 1648 to 1651). We will add up the pirates active in each decade to get to our visualisation of pirate activity, and some pirates were active for far longer than others!

    ## building the pirate activity, with each pirate having one row per decade in which he was active
    pirate.activity  pirates.dta %>% 
      mutate(
        piracing.start = piracing.start %/%10L*10L,
        piracing.stop = piracing.stop %/%10L*10L,
        arrr = piracing.stop - piracing.start,
        country = if_else(Country.of.origin %in% top.n.countries(10), Country.of.origin, "other")
      ) %>% 
      filter(piracing.start > 1000) %>% 
      rowwise() %>% 
      do(
        data.frame(
          Name = .$Name,
          Life = .$Life, 
          piracing.start = .$piracing.start,
          piracing.stop = .$piracing.stop,
          # need one row per decade active 
          piracing.years = seq(.$piracing.start, .$piracing.stop, by = 10L),
          arrr = .$arrr,
          Country.of.origin = .$Country.of.origin,
          country = .$country, 
          stringsAsFactors = FALSE
        )
      )

    With the data now set up, we can set out to graph the pirate activity, and then also to build a random pirate generator!

    Pirates in Action

    Let’s take a look at the countries these famous pirates came from:

    pirates.raw %>% 
      group_by(Country.of.origin) %>% 
      tally(sort = TRUE) %>% 
      head(15)
    Country.of.origin n
    England 108
    Netherlands 47
    France 39
    Unknown 25
    Colonial America 22
    United States 19
    Spain 10
    China 9
    Venezuela 8
    Germany 7
    Ireland 7
    Wales 7
    Scotland 6
    Ottoman Empire 4
    Portugal 4

    England is over-represented by quite a large margin, especially if you group Wales/Scotland/Ireland, and potentially Colonial America. The Netherlands and France come in second and third place, respectively. No big surprises here.

    We’ll pull the data on pirate activity to build a graph of the overall pirate activity over the decades. This is a rather straightforward bar chart:

    ## pirate activity per decade
    pirate.activity %>% 
      ungroup() %>% 
      #ungroup.rowwise_df() %>% 
      group_by(piracing.years, country) %>% 
      tally() %>% 
      ggplot(aes(x=piracing.years, y=n))+
      geom_bar(aes(fill=country, colour=country), stat = "identity", position = "stack")+
      # scale_fill_brewer(type = "qual", palette = 3)+
      # scale_colour_brewer(type = "qual", palette = 3)+
      scale_fill_manual(values = my.colours)+
      scale_colour_manual(values = my.colours)+
      theme_bw()+
      theme(plot.title = element_text(lineheight=.8, face="bold"))+
      labs(
        title = "Arrr!tivity of famous pirates per decaden u2620u2620u2620", 
        x = "", 
        y = "number of pirates active"
      )

    pirate activity over the decades

    One can nicely see the increase in pirate’s activity during the Golden Age (roughly 1560-1700). It drops shortly in 1700, building up again shortly thereafter. The middle of the 18th century is more or less free from pirates (at least those infamous enough to end up on Wikipedia), until piracy starts again around 1800 with the English-French wars. The Somalian pirates of the 2000s are the last on the graph!

    This leads us to the next questions: Somalia was the heyday of pirates in the 2000s, but we cannot reasonably expect it to be in the 1600s. When was a country’s individual golden age of pirating?

    To answer this, we’ll build a ridgeplot (formaly also known as a joyplot, from the Joy Division band t-shirt; the name is now discouraged for obvious reasons), showcasing the density plots of pirate activity per country.

    ## piracy golden age by top n (15) countries
    pirates.dta %>% 
      filter(piracing.start > 1000) %>% 
      filter(Country.of.origin %in% top.n.countries(15)) %>% 
      ggplot()+
      geom_density_ridges(aes(x = piracing.start, y = Country.of.origin, fill = Country.of.origin))+
      theme_bw()+
      theme(plot.title = element_text(lineheight=.8, face="bold"))+
      labs(
        title = "Pirate activity by countryn u2620u2620u2620", 
        x = "", 
        y = "country"
      )

    pirate activity over the decades; per country

    The “usual suspects” add up nicely – England, the Netherlands, France. It’s interesting to see Germany show up as early (around 1350-1400). These are the pirates of the Hansa, mostly. Venezuela also had a “late blooming” pirate life in the 19th century, mostly. Interesting, I had not known about that!

    A Pirate’s Life For Me!

    With the historical data on pirates and their activity times, we can also try to generate a random pirate story! For this, we will restrict the data to the golden age of piracy, dropping all pirates which started their pirate life before 1500, and after 1750. We will want to create a random pirate with a statistically “correct” starting year of piracy, a statistically “correct” piracy tenure, and varnish him (or her) with a random pirate name!

    Golden Age Pirate: starting piracy year

    Let’s start with the pirating starting activity year.

    pirates.fit  pirates.dta %>% 
      filter(piracing.start > 1500) %>% 
      filter(piracing.start  1750) %>% 
      mutate(
        arrr = piracing.stop - piracing.start
      )
      
    pirates.fit %>% 
      ggplot()+
      geom_density(aes(x=piracing.start))

    pirates' starting their trade

    Unfortunately, it’s a heavily left-skewed distribution, and after some hours of looking into distributions, I couldn’t find a good fit, and we’ll have to resort to pulling from the empirical distribution (ie, the data itself) for generating a random start year for a pirate:

    f.gen.pirating.startyear  function() {
      # generate a startyear by pulling from the empirical distribution (pirates.fit$piracing.start)
      base::sample(pirates.fit$piracing.start, 1)
    }

    Golden Age Pirate: piracing tenure

    We have better luck with the tenure of a pirate: the distribution follows a nice count-data distribution, I’m guessing a negative binomial.

    pirates.fit %>% 
      ggplot()+
      geom_density(aes(x=piracing.start))

    pirates' tenure

    As it’s discrete data, we’ll use the negative binomial distribution to estimate the theoretical distribution of a pirate’s tenure using library(fitdistrplus).

    #library(fitdistrplus)
    # check which distribution would be a good fit: 
    #fitdistrplus::descdist(pirates.fit$arrr, boot = 1000)
    
    # negative binomial
    f1  fitdistrplus::fitdist(pirates.fit$arrr, "nbinom")
    fitdistrplus::plotdist(pirates.fit$arr, "nbinom", para = list(size=f1$estimate[1], mu=f1$estimate[2]))

    The results are excellent: we can clearly see the oversampling of “round” years (10, in particular), which comes from the fact that with uncertain dates, Wikipedia noted a pirate’s activity in decades, rather than actual years. But still, our distribution of rnbinom(x, size = 0.3876683, mu = 6.395578) works nicely.

    pirates' tenure estimation

    We can thus draw a random tenure for a pirate by using this function:

    f.gen.pirating.tenure  function() {
      # generate a tenure time from the estimated theoretical neg binomial distribution (pirates.fit$arrr)
      rnbinom(1, size=f1$estimate[1], mu=f1$estimate[2])
    }

    Golden Age Pirate: silly name generator

    All that is missing now is a random pirate name generator. Fantasy name generator have a nice collection of name generators, and a pirate one is included! They allow the use of the names “to name anything in any of your own works”, so let’s use them here!

    We source the names into R and create a simple function pulling a random name, allowing you to specify male or female.

    source("pirate-names.R")
    
    f.gen.pirate.name  function(female = FALSE) {
      # generate a pirate name from a list of 
      # - first names (male/female)
      # - nick names (male/female)
      # - last names
      if(female) {
        paste(
          base::sample(names.female, 1), 
          base::sample(names.nickf, 1), 
          base::sample(names.last, 1)
        )
      } else {
        paste(
          base::sample(names.male, 1), 
          base::sample(names.nickm, 1), 
          base::sample(names.last, 1)
        )
      }
    }

    Golden Age Pirate: a pirate’s story!

    With all the pieces set up, we can now generate a random pirate:

    f.pirate.story  function(female = FALSE) {
      # build the pirate story by joining name, startyear, and tenure time
      yy  f.gen.pirating.startyear()
      rr  paste0(
        "Your name is ", 
        f.gen.pirate.name(female), 
        " and you roamed the Caribbean from ", 
        yy, " to ", yy+f.gen.pirating.tenure(),
        ". Arrr! u2620"
      )
      return(rr)
    }

    u2620 is the utf-8 code for the skull-and-crossbones emoji. Calling the function several times then results in:

    female story
    FALSE Your name is Gerard ‘The Fierce’ Eastoft and you roamed the Caribbean from 1603 to 1605. Arrr! ☠
    FALSE Your name is Orton ‘Grommet’ Drachen and you roamed the Caribbean from 1657 to 1657. Arrr! ☠
    FALSE Your name is Ascot ‘Shadow’ Artemis and you roamed the Caribbean from 1650 to 1652. Arrr! ☠”
    TRUE Your name is Ethyl ‘Temptress’ Mitchell and you roamed the Caribbean from 1610 to 1625. Arrr! ☠
    TRUE Your name is Christine ‘Daffy’ Ward and you roamed the Caribbean from 1657 to 1657. Arrr! ☠
    TRUE Your name is Huldah ‘Twitching’ Vome and you roamed the Caribbean from 1718 to 1718. Arrr! ☠

    Concluding remarks

    The code is available on github, as always.

    I wish I could find those historical shipping and piracy data again – any hints or pointers are very much appreciated!

    Talk like a pirate day 2017 was originally published by Kirill Pomogajko at Opiate for the masses on September 19, 2017.

    To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Recap: Applications of R at EARL London 2017

    By David Smith

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The fourth EARL London conference took place last week, and once again it was an enjoyable and informative showcase of practical applications of R. Kudos to the team from Mango for hosting a great event featuring interesting talks and a friendly crowd.

    As always, there were more talks on offer than I was able to attend (most of the event was in three parallel tracks), but here are a few highlights that I was able to catch:

    Jenny Bryan from RStudio gave an outstanding keynote: “Workflows: You Should Have One“. The talk was a really useful collection of good practices for data analysis with R. Note I didn’t say “best practices” there, and that’s something that resonated with me from Jenny’s talk: sometimes the best practices are the ones that are “good enough” and don’t add unnecessary complexity or fragility. Software Carpentry’s “Good Enough Practices for Scientific Computing“is a great read that I hadn’t come across before this talk.

    Rachel Kirkham from the National Audit Office uses R and Shiny to scrutinize public spending on behalf of taxpayers in the UK. (View the slides here.) One interesting beneficial aspect of R for this application is the ability to bring R to the data, thereby sidestepping regulations that forbid the data being exported from the NAO data center.

    Joy McKenney from Northumbrian Water gave a truly fascinating talk on using R to monitor and predict flows in a sewer system. (View the slides here.) One surprising application from this talk was being able to model whether an overflowing sewer line is due to rainfall or because of a blockage in the system — a tool that can be used to detect “fatbergs” clogging up the system.

    Pieter Vos from Philips Research showed how they deploy applications to doctors to evaluate things like cancer risk. (View the slides here.) The applications call out to R to make the underlying statistical calculations.

    Joe Cheng from RStudio gave a remarkable talk on a major concept for the R language itself: “promises”. (View the slides here.) With promises, you can ask for the results of a long-running computation, but get back to the R command line instantaneously. The results will be available to you, well, when they’re ready. It sounds like magic, but it’s already working (in beta) with the promises package and promises (pun intended) to bring more responsiveness to Shiny applications.

    Ashley Turner from Transport for London gave a talk that I was very sad I couldn’t see (it conflicted with my own session), but fortunately the slides are available. R has been used for over 2 years to model how Londoners get about via car, bus and Tube. It was used for official report on London transportation, which included some fascinating data on the routes certain commuters prefer for getting from (say) Kings Cross to Waterloo.

    Matthew Upson from the Government Digital Service reported on the ongoing program to implement a reproducible data science workflow for creating official reports. (View the slides here.) He says that this process has reduced production time for some official reports by 75%, and you can learn more about the program here.

    I was also informed by the organizers that my own talk on Reproducible Data Science with R (slides here) won the “award” for loudest presentation at the conference. I think the audio desk had the volume up a little too high for this video!

    As I said, there are many more excellent talks beyond those listed above. You can explore the other talks by clicking here and then clicking through each speaker portrait — slides, where available, are linked at the bottom of each abstract.

    The next EARL conference will take place in Boston, November 1-3 and promises to include even more practical applications of R. I hope to see you there!

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Mapping Fall Foliage with sf

    By hrbrmstr

    I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

    Where to see the best fall foliage, based on your location: https://t.co/12pQU29ksB pic.twitter.com/JiywYVpmno

    — Vox (@voxdotcom) September 18, 2017

    Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild #rstats web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

    I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

    Full RStudio project file (with pre-cached data) is on GitHub.

    library(rprojroot)
    library(sf)
    library(magick)
    library(tidyverse) # NOTE: Needs github version of ggplot2
    
    root %
      walk(~{
        sav_tmp %
      filter(!is.na(rate1)) -> foliage_sf
    
    # now, we do some munging so we have better labels and so we can
    # iterate over the weeks
    gather(foliage_sf, week, value, -id, -geometry) %>%
      mutate(value = factor(value)) %>%
      filter(week != "rate1") %>%
      mutate(week = factor(week,
                           levels=unique(week),
                           labels=format(seq(as.Date("2017-08-26"),
                                             as.Date("2017-11-11"), "1 week"),
                                         "%b %d"))) -> foliage_sf
    
    # now we make a ggplot object for each week and save it out to a png
    pb  gg
    
      ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)
    
    })
    
    # we read them all back in and animate the foliage
    sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
      map(image_read) %>%
      image_join() %>%
      image_animate(1)

    Source:: R News

    Big Data Analytics World Championship (30th September 2017) – Free Entry Code and details

    By John Persico

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Invitation to the 3rd Big Data Analytics World Championships 2017

    We invite members of the R-Bloggers/R-Users community to participate in the 2017 TEXATA Big Data Analytics World Championships (www.texata.com).

    You’re Free Discount Code is: 2017RUSERS (normally $30 fully paid entry).

    Here are important dates for the TEXATA educational business competition:

    Round 1 on 30th September 2017 (Online)

    Round 2 on 14th October 2017 (Online)

    World Finals on 15th-16th November 2017 (Austin)

    About TEXATA

    The competition is a celebration of big data analytics skills and communities. It targets at students and young professionals looking to learn more and showcase their creative business skills in big data analytics. It’s ideal for people working or studying or interested in: Technology, Computer Science, Statistics and Data, Data Science, IT Professionals, Software Engineering, Analytical, Quantitative and Engineering disciplines. The competition is a great way to learn more about leading Companies, Communities, Universities and Institutions in the big data and business analytics industries. All participants will receive updates throughout 2017 and 2018 of free offers, upcoming conferences and educational events of our community partners as part of joining the TEXATA event. Best of all, it’s all free with this discount code thanks to RBloggers/R-Users.

    How It Works

    The structure involves two Online Rounds with a Live World Finals in Austin Texas. Each of the Online Rounds will last 4 Hours in duration. The Top World Finalists will be flown from around the world to compete in the case-study based finals challenge with Finals Judges. The organizers hold similar world championship events in other professional services industries – including Finance and Financial Modeling (www.modeloff.com) and High IQ World Championships (www.hiqora.com) – which gives participants a good background idea of the nature of the educational, fun and challenging nature of the elite business and professional league competitions.

    Visit www.texata.com for more information.

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News