Geocomputation with R: workshop at eRum

By jannesm

geocompr_logo

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

This is a post by Robin Lovelace, Jakub Nowosad and Jannes Muenchow. Together we’re writing an open source book called Geocomputation with R. The project aims to introducing people to R’s rapidly evolving geographic data capabilities and provide a foundation for developing scripts, functions and applications for geographic data science. We recently presented some contents of the in-progress book at the eRum
conference, where Jannes ran a workshop on the topic. In this article we share teaching materials from eRum for the benefit of those who couldn’t be there in person and provide a ‘heads-up’ to the R-Spatial community about plans for the book. We’ll start with an overview of ‘geocomputation’ (and define what we mean by the term) and finish by describing how R can be used as a bridge to access dedicated GIS software.

Gecomp ‘base’ ics

The first thing many people will be asking is “what is geocomputation anyway”? As Jannes mentioned in his talk, the choice of name was influenced by the fact that the term seems to have originated in Leeds, where one of the authors (Robin) is based. The first conference on the subject was in Leeds in 1996, with associated extended abstracts including old-school computer graphics still available here; and there was a 21 year home-coming anniversary in 2017 where Robin and Jakub presented. A more practical reason is that the term is unambiguous: it’s about using computing techniques to do new things with geographic data, as indicated in Section 1.1 of the book. Our approach differs in one way from the early conception of geocomputation, however:

Unlike early works in the field all the work presented in this book is reproducible using code and example data supplied alongside the book (using R, a great language for reproducible research).

Like many open source projects R is evolving. Although ‘base R’ is conservative (as demonstrated in Roger Bivand’s keynote, in which he did a live demo using a version of R from 1997 that still runs!), the ‘ecosystem’ of packages that extend its capabilities changes fast (video here, slides at rsbivand/eRum18).

To clarify what we mean by ‘base R’, we can identify base packages with the following code (source: stackoverflow):

x = installed.packages()
row.names(x)[!is.na(x[ ,"Priority"])]

## [1] "base" "boot" "class" "cluster" "codetools"
## [6] "compiler" "datasets" "foreign" "graphics" "grDevices"
## [11] "grid" "KernSmooth" "lattice" "MASS" "Matrix"
## [16] "methods" "mgcv" "nlme" "nnet" "parallel"
## [21] "rpart" "spatial" "splines" "stats" "stats4"
## [26] "survival" "tcltk" "tools" "utils"

The output shows there are 28 packages that are currently part of the base distribution (R Core makes “base R” as Martin Maechler put it
during another keynote). These can be relied on to change very little in terms of their API, although bug fixes and performance improvements happen continuously.

The same cannot be said of contributed packages. Packages are created,
die (or are abandoned) and change, sometimes dramatically. And this applies as much (or more) to r-spatial as to any other part of R’s ecosystem, as can be seen by looking at any one of R’s task views. At the time of writing the Spatial task view alone listed 177 packages, many of them recently contributed and in-development.

In this context it helps to have an understanding of the history (Bivand, Pebesma, and Gómez-Rubio 2013). Like in politics, knowing the past can help navigate the future. More specifically, knowing which packages are mature or up-and-coming can help decide which one to use!

For this reason, after a slide on set-up (which is described in detail in chapter 2 of the book), the workshop spent a decent amount of time talking about the history of spatial data in R, as illustrated in slide 20. A more detailed account of the history of R-spatial is provided in section 1.5 of the book.

The slides outlining the basics of Geocomputation with R (which is roughly a synonym for r-spatial) can be found here.

Vector data

Spatial vector data are best used for objects that represent discrete borders such as bus stops (points), streets (lines) and houses (polygons). For instance, we can represent ‘Budapest’ (the city where eRum 2018 was held) as a spatial point with the help of the sf package (Pebesma 2018) as follows:

budapest_df = data.frame(
name = "Budapest",
x = 19.0,
y = 47.5
)
class(budapest_df)

## [1] "data.frame"

budapest_sf = sf::st_as_sf(budapest_df, coords = c("x", "y"))
class(budapest_sf)

## [1] "sf" "data.frame"

Why bother creating a new class if both objects contain the same essential data? It’s what you can do with an object that’s important. The reason for using the sf class can be understood as follows: it gives the budapest_sf spatial superpowers. We can, for example, now identify what country the point is using a spatial function such as a spatial join implemented in the function st_join()` (spatial subsetting would also do the trick, as covered in section 4.2.1). First, we need to load the ‘world’ dataset and set the ‘CRS’ of the object:

# set-up:
library(spData)
sf::st_crs(budapest_sf) = 4326

# spatial join:
sf::st_join(budapest_sf, world)

## Simple feature collection with 1 feature and 11 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 19 ymin: 47.5 xmax: 19 ymax: 47.5
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## name iso_a2 name_long continent region_un subregion
## 1 Budapest HU Hungary Europe Europe Eastern Europe
## type area_km2 pop lifeExp gdpPercap geometry
## 1 Sovereign country 92476.46 9866468 75.87317 24016.3 POINT (19 47.5)

The slides describing vector data in R can be found here.

Raster data

On the other hand, a raster data represents continuous surfaces in form of a regular grid. You can think about a raster as a matrix object containing information about its spatial location. It has rows and columns, each cell has a value (it could be NA) and its spatial properties are described by the cell resolution (res), outer borders (bounding box – xmn, xmx, ymn, ymx), and coordinate reference system (crs). In R the raster package supports the spatial raster format (Hijmans 2017).

library(raster)
elev = raster(nrow = 6, ncol = 6,
vals = 1:36,
res = 0.5,
xmn = -1.5, xmx = 1.5,
ymn = -1.5, ymx = 1.5,
crs = "+proj=longlat")

The data structure makes raster processing much more efficient and faster than vector data processing.

elev2 = elev^2
elev3 = elev / elev2
elev4 = (elev2 - elev3) * log(elev)

elev_stack = stack(elev, elev2, elev3, elev4)
plot(elev_stack)

Raster objects can be subsetted (by index, coordinates, or other spatial objects), transformed using local, focal, zonal and global operations, and summarized. Importantly, there are many tools allowing for interactions between raster and vector data models, and transformation between them.

The slides associated with the raster data part of the workshop can be found here.

Visualizing spatial data

The spatial powers mentioned previously have numerous advantages. One of the most attractive is that geographic data in an appropriate class can be visualized on a map, the topic of Chapter 9 of the book. The workshop was an opportunity to expand on the contents of that chapter and ask: what’s the purpose of maps in the first place? To answer that question we used an early data visualization / infographic created by Alexander von Humboldt, illustrated below. The point of this is that it’s not always the accuracy of a map that’s most important (although that is important): the meaning that you wish to convey and the target audience should be central to the design of the map (in Humboldt’s case the unity of nature to an audience of Enlightenment book readers!):

In the context of geographic data in R, it is easier than ever to create attractive maps to tell a story. The previously created point representing Budapest, for example, can be visualized using the tmap package as follows:

library(tmap)
budapest_df = data.frame(name = "Budapest", x = 19, y = 47.5)
class(budapest_df)
#> [1] "data.frame"
budapest_sf = sf::st_as_sf(budapest_df, coords = c("x", "y"))
class(budapest_sf)
#> [1] "sf" "data.frame"
tmap_mode("view")
#> tmap mode set to interactive viewing
m = tm_shape(budapest_sf) + tm_dots() + tm_view(basemaps = "OpenStreetMap",
set.view = 9)
tmap_leaflet(m)

A range of mapping techniques were covered in the workshop including the plot() method from the sf package that generates multiple maps from a single object by default, such as this one representing the nz (short for New Zealand) object from the spData package:

More advanced maps were demonstrated, including this animated map of the United States (for information on how to make animated maps with R, see section 9.3) of Geocomputation with R.

The slides forming the basis of the visualization part of the tutorial can be found here.

Last but not least was a section on GIS bridges

Defining a Geographic Information System as a system for the analysis, manipulation and visualization of geographic data (Longley 2015), we can safely claim that R already has become a GIS. However, R has also its shortcomings when it comes to spatial data analysis. To name but a few, R is not particularly good at handling large geographic data, it is not a geodatabase and it is missing literally hundreds of geoalgorithms readily available in GIS software packages and spatial libraries. Fortunately, R has been designed from the beginning as an interactive interface to other languages and software packages (Chambers 2016). Hence, as long as we can access the functionality of GIS software from within R, we can easily overcome R’s spatial data analysis shortcomings. For instance, when attaching, the sf package to the global environment, it automatically links to GEOS, GDAL and proj.4, this means, the sf package gives the R user automatically access to the functionality of these spatial libraries. Equally, there are a number of packages that provides access to the geoalgorithms of major open source GIS Desktop software packages:

  1. rgrass7 provides access to GRASS7
  2. RSAGA provides access to SAGA GIS
  3. RQGIS provides access to QGIS. For much more details and background information, please check out the corresponding R Journal publication.

Note that you must have installed the GIS software on your machine before you can use it through R.[1]

In the workshop we shortly presented how to use RQGIS. The corresponding slides can be found here. In the book we additionally demonstrate how to use RSAGA and rgrass7 in Chapter 10.

Background on the book

Geocomputation with R is a collaborative project. We joined forces because each of us has been been teaching and contributing to R’s spatial ecosystem for years and we all had a similar vision of a book to disseminate R’s impressive geographic capabilities more widely.

As described in a previous article by Jakub, we’re making good progress towards finishing the book by the end of summer 2018, meaning Geocomputation with R will be published before the end of the year. The target audience is broad but we think it will be especially useful to post and under-graduate students, R users wanting to work with spatial data, and GIS users wanting to get to grips with command-line statistical modeling software. A reason for publishing the article here is that we have around 3 months (until the end of August) to gather as much feedback on the book as possible before it’s published. We plan to keep the code and prose up-to-date after that but now is the ideal time to get involved. We welcome comments and suggestions on the issue tracker, especially from people with experience in the R-Spatial world in relation to:

  • Bugs: issues with lines of prose or code that could be improved.
  • Future-proofing: will the code and advice given stand the test of time? If you think some parts will go out of date quick, please let us know!
  • Anything else: ideas for other topics to cover, for example.

We would like to thank the anonymous peer reviewers who have provided feedback so far. We’re still working on changes to respond to their excellent comments. If you’re interested in getting involved in this project, please see the project’s GitHub repo at github.com/Robinlovelace/geocompr and check-out the in-progress chapters at geocompr.robinlovelace.net.

References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. 2nd ed. 2013 edition. New York:Springer.

Chambers, John M. 2016. Extending R. CRC Press.

Hijmans, Robert J. 2017. Raster: Geographic Data Analysis and Modeling.

Longley, Paul. 2015. Geographic Information Science & Systems. Fourth edition. Hoboken, NJ: Wiley.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal.

[1] Note also that RPyGeo provides access to ArcMap which is a commercial Desktop GIS software.

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New round of R Consortium grants announced

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The R Consortium has just announced its latest round of project grants. After reviewing the proposals submitted by the R community, the Infrastructure Steering Committee has elected to fund the following projects for the Spring 2018 season:

  • Further updates to the DBI package, to provide a consistent interface between R and databases
  • Updating the infrastructure in R for building binary packages on Windows and Mac
  • A collaboration with Statisticians in the Pharmaceutical Industry to validate R packages to regulatory standards
  • A project to validate and maintain some of the historic numerical algorithms used by R
  • A working group to facilitate working with US census data in R
  • Consolidation of tools and workflows for handling missing data in R
  • Templates and tools for the development of teaching resources with R

This latest round of grants brings the total funding for projects proposed by R community members to more than $650,000. The R Consortium itself is funded by its members (including my own employer, Microsoft), so if you’d like to see more such projects why not ask your own employer to become a member?

R Consortium: Announcing the R Consortium ISC Funded Project grant recipients for Spring 2018

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Three ways of visualizing a graph on a map

By Markus Konrad

(This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers)

When visualizing a network with nodes that refer to a geographic place, it is often useful to put these nodes on a map and draw the connections (edges) between them. By this, we can directly see the geographic distribution of nodes and their connections in our network. This is different to a traditional network plot, where the placement of the nodes depends on the layout algorithm that is used (which may for example form clusters of strongly interconnected nodes).

In this blog post, I’ll present three ways of visualizing network graphs on a map using R with the packages igraph, ggplot2 and optionally ggraph. Several properties of our graph should be visualized along with the positions on the map and the connections between them. Specifically, the size of a node on the map should reflect its degree, the width of an edge between two nodes should represent the weight (strength) of this connection (since we can’t use proximity to illustrate the strength of a connection when we place the nodes on a map), and the color of an edge should illustrate the type of connection (some categorical variable, e.g. a type of treaty between two international partners).

Preparation

We’ll need to load the following libraries first:

library(assertthat)
library(dplyr)
library(purrr)
library(igraph)
library(ggplot2)
library(ggraph)
library(ggmap)

Now, let’s load some example nodes. I’ve picked some random countries with their geo-coordinates:

country_coords_txt 

So we now have 15 countries, each with an ID, geo-coordinates (lon and lat) and a name. These are our graph nodes. We’ll now create some random connections (edges) between our nodes:

set.seed(123)  # set random generator state for the same output

N_EDGES_PER_NODE_MIN % mutate(category = as.factor(category))

Each of these edges defines a connection via the node IDs in the from and to columns and additionally we generated random connection categories and weights. Such properties are often used in graph analysis and will later be visualized too.

Our nodes and edges fully describe a graph so we can now generate a graph structure g with the igraph library. This is especially necessary for fast calculation of the degree or other properties of each node later.

g 

We now create some data structures that will be needed for all the plots that we will generate. At first, we create a data frame for plotting the edges. This data frame will be the same like the edges data frame but with four additional columns that define the start and end points for each edge (x, y and xend, yend):

edges_for_plot %
  inner_join(nodes %>% select(id, lon, lat), by = c('from' = 'id')) %>%
  rename(x = lon, y = lat) %>%
  inner_join(nodes %>% select(id, lon, lat), by = c('to' = 'id')) %>%
  rename(xend = lon, yend = lat)

assert_that(nrow(edges_for_plot) == nrow(edges))

Let’s give each node a weight and use the degree metric for this. This will be reflected by the node sizes on the map later.

nodes$weight = degree(g)

Now we define a common ggplot2 theme that is suitable for displaying maps (sans axes and grids):

maptheme 

Not only the theme will be the same for all plots, but they will also share the same world map as “background” (using map_data('world')) and the same fixed ratio coordinate system that also specifies the limits of the longitude and latitude coordinates.

country_shapes 

Plot 1: Pure ggplot2

Let’s start simple by using ggplot2. We’ll need three geometric objects (geoms) additional to the country polygons from the world map (country_shapes): Nodes can be drawn as points using geom_point and their labels with geom_text; edges between nodes can be realized as curves using geom_curve. For each geom we need to define aesthetic mappings that “describe how variables in the data are mapped to visual properties” in the plot. For the nodes we map the geo-coordinates to the x and y positions in the plot and make the node size dependent on its weight (aes(x = lon, y = lat, size = weight)). For the edges, we pass our edges_for_plot data frame and use the x, y and xend, yend as start and end points of the curves. Additionally, we make each edge’s color dependent on its category, and its “size” (which refers to its line width) dependent on the edges’ weights (we will see that the latter will fail). Note that the order of the geoms is important as it defines which object is drawn first and can be occluded by an object that is drawn later in the next geom layer. Hence we draw the edges first and then the node points and finally the labels on top:

ggplot(nodes) + country_shapes +
  geom_curve(aes(x = x, y = y, xend = xend, yend = yend,     # draw edges as arcs
                 color = category, size = weight),
             data = edges_for_plot, curvature = 0.33,
             alpha = 0.5) +
  scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths
  geom_point(aes(x = lon, y = lat, size = weight),           # draw nodes
             shape = 21, fill = 'white',
             color = 'black', stroke = 0.5) +
  scale_size_continuous(guide = FALSE, range = c(1, 6)) +    # scale for node size
  geom_text(aes(x = lon, y = lat, label = name),             # draw text labels
            hjust = 0, nudge_x = 1, nudge_y = 4,
            size = 3, color = "white", fontface = "bold") +
  mapcoords + maptheme

A warning will be displayed in the console saying “Scale for ‘size’ is already present. Adding another scale for ‘size’, which will replace the existing scale.”. This is because we used the “size” aesthetic and its scale twice, once for the node size and once for the line width of the curves. Unfortunately you cannot use two different scales for the same aesthetic even when they’re used for different geoms (here: “size” for both node size and the edges’ line widths). There is also no alternative to “size” I know of for controlling a line’s width in ggplot2.

With ggplot2, we’re left of with deciding which geom’s size we want to scale. Here, I go for a static node size and a dynamic line width for the edges:

ggplot(nodes) + country_shapes +
  geom_curve(aes(x = x, y = y, xend = xend, yend = yend,     # draw edges as arcs
                 color = category, size = weight),
             data = edges_for_plot, curvature = 0.33,
             alpha = 0.5) +
  scale_size_continuous(guide = FALSE, range = c(0.25, 2)) + # scale for edge widths
  geom_point(aes(x = lon, y = lat),                          # draw nodes
             shape = 21, size = 3, fill = 'white',
             color = 'black', stroke = 0.5) +
  geom_text(aes(x = lon, y = lat, label = name),             # draw text labels
            hjust = 0, nudge_x = 1, nudge_y = 4,
            size = 3, color = "white", fontface = "bold") +
  mapcoords + maptheme

Plot 2: ggplot2 + ggraph

Luckily, there is an extension to ggplot2 called ggraph with geoms and aesthetics added specifically for plotting network graphs. This allows us to use separate scales for the nodes and edges. By default, ggraph will place the nodes according to a layout algorithm that you can specify. However, we can also define our own custom layout using the geo-coordinates as node positions:

node_pos %
  select(lon, lat) %>%
  rename(x = lon, y = lat)   # node positions must be called x, y
lay 

We pass the layout lay and use ggraph’s geoms geom_edge_arc and geom_node_point for plotting:

ggraph(lay) + country_shapes +
  geom_edge_arc(aes(color = category, edge_width = weight,   # draw edges as arcs
                    circular = FALSE),
                data = edges_for_plot, curvature = 0.33,
                alpha = 0.5) +
  scale_edge_width_continuous(range = c(0.5, 2),             # scale for edge widths
                              guide = FALSE) +
  geom_node_point(aes(size = weight), shape = 21,            # draw nodes
                  fill = "white", color = "black",
                  stroke = 0.5) +
  scale_size_continuous(range = c(1, 6), guide = FALSE) +    # scale for node sizes
  geom_node_text(aes(label = name), repel = TRUE, size = 3,
                 color = "white", fontface = "bold") +
  mapcoords + maptheme

The edges’ widths can be controlled with the edge_width aesthetic and its scale functions scale_edge_width_*. The nodes’ sizes are controlled with size as before. Another nice feature is that geom_node_text has an option to distribute node labels with repel = TRUE so that they do not occlude each other that much.

Note that the plot’s edges are differently drawn than with the ggplot2 graphics before. The connections are still the same only the placement is different due to different layout algorithms that are used by ggraph. For example, the turquoise edge line between Canada and Japan has moved from the very north to south across the center of Africa.

Plot 3: the hacky way (overlay several ggplot2 “plot grobs”)

I do not want to withhold another option which may be considered a dirty hack: You can overlay several separately created plots (with transparent background) by annotating them as “grobs” (short for “graphical objects”). This is probably not how grob annotations should be used, but anyway it can come in handy when you really need to overcome the aesthetics limitation of ggplot2 described above in plot 1.

As explained, we will produce separate plots and “stack” them. The first plot will be the “background” which displays the world map as before. The second plot will be an overlay that only displays the edges. Finally, a third overlay shows only the points for the nodes and their labels. With this setup, we can control the edges’ line widths and the nodes’ point sizes separately because they are generated in separate plots.

The two overlays need to have a transparent background so we define it with a theme:

theme_transp_overlay 

The base or “background” plot is easy to make and only shows the map:

p_base 

Now we create the first overlay with the edges whose line width is scaled according to the edges’ weights:

p_edges 

The second overlay shows the node points and their labels:

p_nodes 

Finally we combine the overlays using grob annotations. Note that proper positioning of the grobs can be tedious. I found that using ymin works quite well but manual tweaking of the parameter seems necessary.

p 

As explained before, this is a hacky solution and should be used with care. Still it is useful also in other circumstances. For example when you need to use different scales for point sizes and line widths in line graphs or need to use different color scales in a single plot this way might be an option to consider.

All in all, network graphs displayed on maps can be useful to show connections between the nodes in your graph on a geographic scale. A downside is that it can look quite cluttered when you have many geographically close points and many overlapping connections. It can be useful then to show only certain details of a map or add some jitter to the edges’ anchor points.

The full R script is available as gist on github.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Defining Marketing with the Rvest and Tidytext Packages

By Peter Prevos

Defining Marketing with the Rvest and Tidytext Packages: Word cloud

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

I am preparing to facilitate another session of the marketing course for the La Trobe University MBA. The first lecture delves into the definition of marketing. Like most other social phenomena, marketing is tough to define. Definitions of social constructs often rely on the perspective taken by the person or group writing the definition. As such, definitions also change over time. While a few decades ago, definitions of marketing revolved around sales and advertising, contemporary definitions are more holistic and reference creating value.

Heidi Cohen wrote a blog post where she collated 72 definitions of marketing. So rather than arguing over which definition is the best, why not use all definitions simultaneously? This article attempts to define a new definition of marketing, using a data science approach. We can use the R language to scrape the 72 definitions from Heidi’s website and attempt text analysis to extract the essence of marketing from this data set.

I have mentioned in a previous post about qualitative data science that automated text analysis is not always a useful method to extract meaning from a text. I decided to delve a little deeper into automated text analysis to see if we find out anything useful about marketing using the rvest and tidytext packages.

The presentation below shows the slides I use in my introductory lecture into marketing. The code and analyses are shown below the slideshow. You can download the most recent version of the code from my GitHub Repository.

Scraping text with Rvest

Web scraping is a common technique to download data from websites where this data is not available as a clean data source. Web scraping starts with downloading the HTML code from the website and the filtering the wanted text from this file. The rvest package makes this process very easy.

The code for this article uses a pipe (%>%) with three rvest commands. The first step is to download the wanted html code from the web using the read_html function. The output of this function is piped to the html_nodes function, which does all the hard work. In this case, the code picks all lines of text that are embedded in an

container, i.e. ordered lists. You can use the SelectorGadget to target the text you like to scrape. The last scraping step cleans the text by piping the output of the previous commands to the html_text function.

The result of the scraping process is converted to a Tibble, which is a type of data frame used in the Tidyverse. The definition number is added to the data, and the Tibble is converted to the format required by the Tidytext package. The resulting data frame is much longer than the 72 definitions because there are other lists on the page. Unfortunately, I could not find a way to filter only the 72 definitions.

library(tidyverse)
library(rvest)
definitions %
    html_nodes("ol li") %>%
    html_text() %>%
    as_data_frame() %>%
    mutate(No = 1:nrow(.)) %>%
    select(No, Definition = value)

Tidying the Text

The Tidytext package extends the tidy data philosophy to a text. In this approach to text analysis, a corpus consists of a data frame where each word is a separate item. The code snippet below takes the first 72 rows and the unnest_tokens function extracts each word from the 72 definitions. This function can also extract ngrams and other word groups from the text. The Tidytext package is an extremely versatile piece of software which goes far beyond the scope of this article. Julia Silge and David Robinson have written a book about text mining using this package, which provides a very clear introduction to the craft of analysing text.

The last section of the pipe removes the trailing “s” from each word to convert plurals into single words. The mutate function in the Tidyverse creates or recreates a new variable in a data frame.

library(tidytext)
def_words %
    unnest_tokens(word, Definition) %>%
    mutate(word = gsub("s$", "", word))

This section creates a data frame with two variables. The No variable indicates the definition number (1–72) and the word variable is a word within the definition. The order of the words is preserved in the row name. To check the data frame you can run unique(def_words$No[which(def_words$word == "marketing")]). This line finds all definition numbers with the word “marketing”, wich results, as expected, in the number 1 to 72.

Using Rvest and Tidytext to define marketing

We can now proceed to analyse the definitions scraped from the website with Rvest and cleaned with Tidytext. The first step is to create a word cloud, which is a popular way to visualise word frequencies. This code creates a data frame for each unique word, excluding the word marketing itself, and uses the wordcloud package to visualise the fifty most common words.

library(wordcloud)
library(RColorBrewer)

word_freq %
    anti_join(stop_words) %>%
    count(word) %>%
    filter(word != "marketing")

word_freq %>%
    with(wordcloud(word, n, max.words = 50, rot.per = .5,
                   colors = rev(brewer.pal(5, "Dark2"))))

While a word cloud is certainly a pretty way to visualise the bag of words in a text, it is not the most useful way to get the reader to understand the data. The words are jumbled, and the reader needs to search for meaning. A better way to visualise word frequencies is a bar chart. This code takes the data frame created in the previous snippet, determines the top ten occurring words. The mutate statement reorders to factor levels so that the words are plotted in order.

word_freq %>%
    top_n(10) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) + 
        geom_col(fill = "dodgerblue4") +
        coord_flip() +
        theme(text = element_text(size=20))

A first look at the word cloud and bar chart suggests that marketing is about customers and products and services. Marketing is a process that includes branding and communication; a simplistic but functional definition.

Topic Modeling using Tidytext

Word frequencies are a weak method to analyse text because it interprets each word as a solitary unit. Topic modelling is a more advanced method that analyses the relationships between words, i.e. the distance between them. The first step is to create a Document-Term Matrix, which is a matrix that indicates how often a word appears in a text. As each of the 72 texts are very short, I decided to treat the collection of definitions as one text about marketing. The cast_dtm function converts the data frame to a Document-Term Matrix.

The following pipe determines the top words in the topics. Just like k-means clustering, the analyst needs to choose the number of topics before analysing the text. In this case, I have opted for 4 topics. The code determines the contribution of each word to the four topics and selects the five most common words in each topic. The faceted bar chart shows each of the words in the four topics.

marketing_dtm %
    mutate(doc = 1) %>%
    cast_dtm(doc, word, n)

marketing_lda %
    tidy(matrix = "beta") %>%
    group_by(topic) %>%
    top_n(5, beta) %>%
    ungroup() %>%
    arrange(topic, -beta)

marketing_lda %>%
    mutate(term = reorder(term, beta)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
       geom_col(show.legend = FALSE) +
       facet_wrap(~topic, scales = "free") +
       coord_flip() +
       theme(text = element_text(size=20))

This example also does not tell me much more about what marketing is, other than giving a slightly more sophisticated version of the word frequency charts. This chart shows me that marketing is about customers that enjoy a service and a product. Perhaps the original definitions are not distinctive enough to be separated from each other. The persistence of the word “president” is interesting as it seems to suggest that marketing is something that occurs at the highest levels in the business.

Defining Marketing with the Rvest and Tidytext Packages: Topic modeling
This article is only a weak summary of the great work by Julia Silge who coauthored the Tidytext package. The video below provides a comprehensive introduction to topic modelling.

What have we learnt?

This excursion into text analysis using rvest and Tidytext shows that data science can help us to make some sense out of an unread text. If I did not know what this page was about, then perhaps this analysis would enlighten me. This kind of analysis can assist us in wading through to large amounts of text to select the ones we want to read. I am still not convinced that this type of analysis will provide any knowledge beyond what can be obtained from actually reading and engaging with a text.

Although I am a data scientist and want to maximise the use of code in analysing data, I am very much in favour of developing human intelligence before we worry about the artificial kind.

The post Defining Marketing with the Rvest and Tidytext Packages appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Quick guide for converting from JAGS or BUGS to NIMBLE

By nimble-admin

(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

Converting to NIMBLE from JAGS, OpenBUGS or WinBUGS

Converting to NIMBLE from JAGS, OpenBUGS or WinBUGS

NIMBLE is a hierarchical modeling package that uses nearly the same modeling language as the popular MCMC packages WinBUGS, OpenBUGS and JAGS. NIMBLE makes the modeling language extensible — you can add distributions and functions — and also allows customization of MCMC or other algorithms that use models. Here is a quick summary of steps to convert existing code from WinBUGS, OpenBUGS or JAGS to NIMBLE. For more information, see examples on r-nimble.org or the NIMBLE User Manual.

Main steps for converting existing code

These steps assume you are familiar with running WinBUGS, OpenBUGS or JAGS through an R package such as R2WinBUGS, R2jags, rjags, or jagsUI.

  1. Wrap your model code in nimbleCode({}), directly in R.
  • This replaces the step of writing or generating a separate file containing the model code.
  • Alternatively, you can read standard JAGS- and BUGS-formatted code and data files using
    readBUGSmodel.
  • Provide information about missing or empty indices
    • Example: If x is a matrix, you must write at least x[,] to show it has two dimensions.
    • If other declarations make the size of x clear, x[,] will work in some circumstances.
    • If not, either provide index ranges (e.g. x[1:n, 1:m]) or use the dimensions argument to nimbleModel to provide the sizes in each dimension.
  • Choose how you want to run MCMC.
    • Use nimbleMCMC() as the just-do-it way to run an MCMC. This will take all steps to
      set up and run an MCMC using NIMBLE’s default configuration.
    • To use NIMBLE’s full flexibility: build the model, configure and build the MCMC, and compile both the model and MCMC. Then run the MCMC via runMCMC or by calling the run function of the compiled MCMC. See the NIMBLE User Manual to learn more about what you can do.

    See below for a list of some more nitty-gritty additional steps you may need to consider for some models.

    Example: An animal abundance model

    This example is adapted from Chapter 6, Section 6.4 of Applied Hierarchical Modeling in Ecology: Analysis of distribution, abundance and species richness in R and BUGS. Volume I: Prelude and Static Models by Marc Kéry and J. Andrew Royle (2015, Academic Press). The book’s web site provides code for its examples.

    Original code

    The original model code looks like this:

    
    cat(file = "model2.txt","
    model {
    # Priors
    for(k in 1:3){                # Loop over 3 levels of hab or time factors
       alpha0[k] ~ dunif(-10, 10) # Detection intercepts
       alpha1[k] ~ dunif(-10, 10) # Detection slopes
       beta0[k] ~ dunif(-10, 10)  # Abundance intercepts
       beta1[k] ~ dunif(-10, 10)  # Abundance slopes
    }
    
    # Likelihood
    # Ecological model for true abundance
    for (i in 1:M){
       N[i] ~ dpois(lambda[i])
       log(lambda[i]) 

    Brief summary of the model

    This is known as an “N-mixture” model in ecology. The details aren’t really important for illustrating the mechanics of converting this model to NIMBLE, but here is a brief summary anyway. The latent abundances N[i] at sites i = 1...M are assumed to follow a Poisson. The j-th count at the i-th site, C[i, j], is assumed to follow a binomial with detection probability p[i, j]. The abundance at each site depends on a habitat-specific intercept and coefficient for vegetation height, with a log link. The detection probability for each sampling occasion depends on a date-specific intercept and coefficient for wind speed. Kéry and Royle concocted this as a simulated example to illustrate the hierarchical modeling approaches for estimating abundance from count data on repeated visits to multiple sites.

    NIMBLE version of the model code

    Here is the model converted for use in NIMBLE. In this case, the only changes to the code are to insert some missing index ranges (see comments).

    library(nimble)
    Section6p4_code  nimbleCode( {
        # Priors
        for(k in 1:3) {                # Loop over 3 levels of hab or time factors
          alpha0[k] ~ dunif(-10, 10) # Detection intercepts
          alpha1[k] ~ dunif(-10, 10) # Detection slopes
          beta0[k] ~ dunif(-10, 10)  # Abundance intercepts
          beta1[k] ~ dunif(-10, 10)  # Abundance slopes
        }
    
        # Likelihood
        # Ecological model for true abundance
        for (i in 1:M){
          N[i] ~ dpois(lambda[i])
          log(lambda[i])  beta0[hab[i]] + beta1[hab[i]] * vegHt[i]
          # Some intermediate derived quantities
          critical[i]  step(2-N[i])# yields 1 whenever N is 2 or less
          z[i]  step(N[i]-0.5)     # Indicator for occupied site
          # Observation model for replicated counts
          for (j in 1:J){
            C[i,j] ~ dbin(p[i,j], N[i])
            logit(p[i,j])  alpha0[j] + alpha1[j] * wind[i,j]
            }
        }
    
        # Derived quantities; unnececssary when running for inference purpose
        # NIMBLE: We have filled in indices in the next two lines.
        Nocc  sum(z[1:100])         # Number of occupied sites among sample of M
        Ntotal  sum(N[1:100])       # Total population size at M sites combined
        Nhab[1]  sum(N[1:33])  # Total abundance for sites in hab A
        Nhab[2]  sum(N[34:66]) # Total abundance for sites in hab B
        Nhab[3]  sum(N[67:100])# Total abundance for sites in hab C
        for(k in 1:100){         # Predictions of lambda and p ...
          for(level in 1:3){    #    ... for each level of hab and time factors
            lam.pred[k, level]  exp(beta0[level] + beta1[level] * XvegHt[k])
            logit(p.pred[k, level])  alpha0[level] + alpha1[level] * Xwind[k]
            }
          }
        # NIMBLE: We have filled in indices in the next line. 
        N.critical  sum(critical[1:100]) # Number of populations with critical size
      })
    

    Simulated data

    To carry this example further, we need some simulated data. Kéry and Royle provide separate code to do this. With NIMBLE we could use the model itself to simulate data rather than writing separate simulation code. But for our goals here, we simply copy Kéry and Royle’s simulation code, and we compact it somewhat:

    # Code from Kery and Royle (2015)
    # Choose sample sizes and prepare obs. data array y
    set.seed(1)                   # So we all get same data set
    M  100                      # Number of sites
    J  3                        # Number of repeated abundance measurements
    C  matrix(NA, nrow = M, ncol = J) # to contain the observed data
    
    # Create a covariate called vegHt
    vegHt  sort(runif(M, -1, 1)) # sort for graphical convenience
    
    # Choose parameter values for abundance model and compute lambda
    beta0  0                    # Log-scale intercept
    beta1  2                    # Log-scale slope for vegHt
    lambda  exp(beta0 + beta1 * vegHt) # Expected abundance
    
    # Draw local abundance
    N  rpois(M, lambda)
    
    # Create a covariate called wind
    wind  array(runif(M * J, -1, 1), dim = c(M, J))
    
    # Choose parameter values for measurement error model and compute detectability
    alpha0  -2                        # Logit-scale intercept
    alpha1  -3                        # Logit-scale slope for wind
    p  plogis(alpha0 + alpha1 * wind) # Detection probability
    
    # Take J = 3 abundance measurements at each site
    for(j in 1:J) {
      C[,j]  rbinom(M, N, p[,j])
    }
    
    # Create factors
    time  matrix(rep(as.character(1:J), M), ncol = J, byrow = TRUE)
    hab  c(rep("A", 33), rep("B", 33), rep("C", 34))  # assumes M = 100
    
    # Bundle data
    # NIMBLE: For full flexibility, we could separate this list
    #         into constants and data lists.  For simplicity we will keep
    #         it as one list to be provided as the "constants" argument.
    #         See comments about how we would split it if desired.
    win.data  list(
        ## NIMBLE: C is the actual data
        C = C,
        ## NIMBLE: Covariates can be data or constants
        ##         If they are data, you could modify them after the model is built
        wind = wind,
        vegHt = vegHt,
        XvegHt = seq(-1, 1,, 100), # Used only for derived quantities
        Xwind = seq(-1, 1,,100),   # Used only for derived quantities
        ## NIMBLE: The rest of these are constants, needed for model definition
        ## We can provide them in the same list and NIMBLE will figure it out.
        M = nrow(C),
        J = ncol(C),
        hab = as.numeric(factor(hab))
    )
    

    Initial values

    Next we need to set up initial values and choose parameters to monitor in the MCMC output. To do so we will again directly use Kéry and Royle’s code.

    Nst  apply(C, 1, max)+1   # Important to give good inits for latent N
    inits  function() list(N = Nst,
                             alpha0 = rnorm(3),
                             alpha1 = rnorm(3),
                             beta0 = rnorm(3),
                             beta1 = rnorm(3))
    
    # Parameters monitored
    # could also estimate N, bayesian counterpart to BUPs before: simply add "N" to the list
    params  c("alpha0", "alpha1", "beta0", "beta1", "Nocc", "Ntotal", "Nhab", "N.critical", "lam.pred", "p.pred")
    

    Run MCMC with nimbleMCMC

    Now we are ready to run an MCMC in nimble. We will run only one chain, using the same settings as Kéry and Royle.

    samples  nimbleMCMC(
        code = Section6p4_code,
        constants = win.data, ## provide the combined data & constants as constants
        inits = inits,
        monitors = params,
        niter = 22000,
        nburnin = 2000,
        thin = 10)
    
    ## defining model...
    
    ## Detected C as data within 'constants'.
    
    ## Adding C as data for building model.
    
    ## building model...
    
    ## setting data and initial values...
    
    ## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
    ## checking model sizes and dimensions...
    ## checking model calculations...
    ## model building finished.
    ## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
    ## compilation finished.
    ## runMCMC's handling of nburnin changed in nimble version 0.6-11. Previously, nburnin samples were discarded *post-thinning*.  Now nburnin samples are discarded *pre-thinning*.  The number of samples returned will be floor((niter-nburnin)/thin).
    ## running chain 1...
    
    ## |-------------|-------------|-------------|-------------|
    ## |-------------------------------------------------------|
    

    Work with the samples

    Finally we want to look at our samples. NIMBLE returns samples as a simple matrix with named columns. There are numerous packages for processing MCMC output. If you want to use the coda package, you can convert a matrix to a coda mcmc object like this:

    library(coda)
    coda.samples  as.mcmc(samples)
    

    Alternatively, if you call nimbleMCMC with the argument samplesAsCodaMCMC = TRUE, the samples will be returned as a coda object.

    To show that MCMC really happened, here is a plot of N.critical:

    plot(jitter(samples[, "N.critical"]), xlab = "iteration", ylab = "N.critical",
         main = "Number of populations with critical size",
         type = "l")
    

    Next steps

    NIMBLE allows users to customize MCMC and other algorithms in many ways. See the NIMBLE User Manual and web site for more ideas.

    Smaller steps you may need for converting existing code

    If the main steps above aren’t sufficient, consider these additional steps when converting from JAGS, WinBUGS or OpenBUGS to NIMBLE.

    1. Convert any use of truncation syntax
      • e.g. x ~ dnorm(0, tau) T(a, b) should be re-written as x ~ T(dnorm(0, tau), a, b).
      • If reading model code from a file using readBUGSmodel, the x ~ dnorm(0, tau) T(a, b) syntax will work.
    2. Possibly split the data into data and constants for NIMBLE.
      • NIMBLE has a more general concept of data, so NIMBLE makes a distinction between data and constants.
      • Constants are necessary to define the model, such as nsite in for(i in 1:nsite) {...} and constant vectors of factor indices (e.g. block in mu[block[i]]).
      • Data are observed values of some variables.
      • Alternatively, one can provide a list of both constants and data for the constants argument to nimbleModel, and NIMBLE will try to determine which is which. Usually this will work, but when in doubt, try separating them.
    3. Possibly update initial values (inits).
      • In some cases, NIMBLE likes to have more complete inits than the other packages.
      • In a model with stochastic indices, those indices should have inits values.
      • When using nimbleMCMC or runMCMC, inits can be a function, as in R packages for calling WinBUGS, OpenBUGS or JAGS. Alternatively, it can be a list.
      • When you build a model with nimbleModel for more control than nimbleMCMC, you can provide inits as a list. This sets defaults that can be over-ridden with the inits argument to runMCMC.

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package

    By hrbrmstr

    🔗

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

    Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

    When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

    $ cd ~/Downloads
    $ ls -l
    total 264856
    -rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
    -rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
    -rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
    -rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
    -rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf
    -rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
    -rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
    -rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
    -rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg
    

    You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

    I didn’t think so. Thus begat the xattrs package.

    Exploring Past Downloads

    Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

    We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

    First, let’s see what is hidden with the RStudio disk image:

    
    library(xattrs)
    library(reticulate) # not 100% necessary but you'll see why later
    library(tidyverse) # we'll need this later
    
    list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
    ## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"    
    ## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"   
    

    There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:

    
    get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
    ##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
    ##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
    ##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
    ##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
    ## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
    ## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
    ## [169] 00 00 00 89
    

    Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:

    
    get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
    ## [1] "bplist00xa20102_20Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_20,https://dailies.rstudio.com/rstudio/oss/mac/bvZ"
    

    So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

    There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

    I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.

    
    biplist 

    That’s much better.

    Let’s work with metadata for the whole directory:

    
    list.files("~/Downloads", full.names = TRUE) %>% 
      keep(has_xattrs) %>% 
      set_names(basename(.)) %>% 
      map_df(read_xattrs, .id="file") -> xdf
    
    xdf
    ## # A tibble: 24 x 4
    ##    file            name                                  size contents   
    ##         
    ##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16  
    ##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110 
    ##  3 1109.1968.pdf   com.apple.quarantine                    74  
    ##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16  
    ##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116 
    ##  6 1109.1968v1.pdf com.apple.quarantine                    74  
    ##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86  
    ##  8 1804.09970.pdf  com.apple.quarantine                    82  
    ##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16  
    ## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104 
    ## # ... with 14 more rows
    
    ## count(xdf, name, sort=TRUE)
    ## # A tibble: 5 x 2
    ##   name                                     n
    ##   
    ## 1 com.apple.metadata:kMDItemWhereFroms     9
    ## 2 com.apple.quarantine                     9
    ## 3 com.apple.lastuseddate#PS                4
    ## 4 com.apple.diskimages.fsck                1
    ## 5 com.apple.diskimages.recentcksum         1
    

    Now we can focus on the task at hand: recovering the URLs:

    
    list.files("~/Downloads", full.names = TRUE) %>% 
      keep(has_xattrs) %>% 
      set_names(basename(.)) %>% 
      map_df(read_xattrs, .id="file") %>% 
      filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% 
      mutate(where_from = map(contents, biplist$readPlistFromString)) %>% 
      select(file, where_from) %>% 
      unnest() %>% 
      filter(!where_from == "")
    ## # A tibble: 15 x 2
    ##    file                where_from                                                       
    ##                                                                
    ##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                              
    ##  2 1109.1968.pdf       https://www.google.com/                                          
    ##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                        
    ##  4 1109.1968v1.pdf     https://www.google.com/                                          
    ##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                             
    ##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf           
    ##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                             
    ##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf        
    ##  9 athena-ug.pdf       https://www.google.com/                                          
    ## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz 
    ## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                    
    ## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
    ## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                      
    ## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
    ## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/             
    

    (There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

    Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you’ll need to deal with an ugly list object vs some quaint text vectors.

    FIN

    Have some fun exploring what other secrets your OS may be hiding from you and if you’re on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

    Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don’t always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    New Version of ggplot2

    By Ari Lamstein

    (This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

    I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2.

    Choroplethr Update Required

    The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

    This change is the third or fourth time I’ve had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: “More time is spent maintaining old software than writing new software.”

    Simple Features Support

    One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been “I can’t do that until ggplot2 adds support for Simple Features.” Thankfully, this new version of ggplot2 introduces that support!

    Currently all maps in the choroplethr ecosystem are stored as ggplot2 “fortified” dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

    1. ggplot2 does not support interactive graphics.
    2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

    Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

    1. Updating choroplethr to be able to render maps stored in the Simple Features format.
    2. Migrating choroplethr’s existing maps to the Simple Features format.

    After that, I can start experimenting with adding interactive graphics support to choroplethr.

    Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

    Thoughts on the CRAN Ecosystem

    This issue has caused me to reflect a bit about the stability of the CRAN ecosystem.

    ggplot2 is used by about 1,700 packages. It’s unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

    And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

    This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user’s code.

    The post New Version of ggplot2 appeared first on AriLamstein.com.

    To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Does financial support in Australia favour residents born elsewhere? Responding to racism with data

    By Simon Jackson

    erp_over_time-1.png

    (This article was first published on blogR, and kindly contributed to R-bloggers)

    Seeing a racist outburst made me wonder whether the Australian Government unfairly supports people based on their background. Using data from the Australian Government and Bureau of Statistics, I couldn’t find compelling evidence of this being true. Don’t believe me? Read on and see what you make of the data.

    Australian racism goes viral, again

    Australian racism went viral again this year when a man was filmed abusing staff at Centrelink, which delivers social security payments and services to Australians (see story here). The man yells that he didn’t vote for multiculturalism and that Centrelink is supporting everyone except “Australians”. It is distressing to watch, especially as someone whose ancestors found a home in Australia having escaped persecution. He can’t take it back, but the man did publically apologise and may be suffering from mental illness (see story here).

    This topic is still polarising. Many of us want to vilify this man while others may applaud him. But hate begets hate, and fighting fire with fire makes things worse. As a data scientist, the best way I know to respond to the assumptions and stereotypes that fuel racism is with evidence. So, without prejudice, let us investigate the data and uncover to whom the Australian government provides support through Centrelink.

    Centrelink supports Australians, so who are we talking about?

    With rare exceptions, Centrelink supports Australian Residents living in Australia (see here and here). So, the claim that Centrelink supports everyone but Australians in misguided. Perhaps the reference to “multiculturalism” can direct us to a more testable question. Centrelink offers support to Australian residents who can be born anywhere in the world. So in this article, I’ll use publically accessible data to investigate differences in support given to residents born in Australia or elsewhere.

    Estimated Residential Population

    The Figure below shows changes in Australia’s Estimated Residential Population, which is an official value published by the Australian Bureau of Statistics and used for policy formation and decision making.

    The residential population has been increasing from about 17 million in 1992 to over 24 million in 2016. In contrast, the percentage of residents who are Australian-born has decreased from 76.9% to 71.5%. This will guide our sense of whether Centrelink payments are unbiased.

    As a side note, Census statistics reported that the percentage of Australian-born residents in 2016 was 66.7% (4.8% lower than the official estimate above). This discrepancy is the result of the the Australian Bureau of Statistics making adjustments that you can learn about here.

    All Centrelink Payments

    Centrelink data is published quarterly and has included country-of-birth breakdowns since December 2016 (which aligns with the last available population data reported above). At this time, Centrelink made 15 million payments to Australian residents.

    In December 2016, 71.5% of Australia’s Estimated Residential population was Australian-born. Comparably, 68.8% of all Centrelink payments went to Australian-born residents.

    The data shows that Centrelink payments are made to residents born in Australia or elsewhere in approximately the same proportions as these groups are represented in the population. The difference of a couple of percent indicates that slightly fewer payments were going to Australian-born residents than we’d expect. As we’ll see in the following section, this difference can be almost wholly accounted for by the Age Pension. Still, the difference is small enough to negate the claim that Centrelink substantially favours residents born outside Australia.

    Breakdown by Type

    It’s also possible to break down these total numbers into the specific payment types shown below (detailed list here).

    It’s expected that these payment types, which support specific needs, will show biases in favour of certain groups. For example, ABSTUDY supports study costs and housing for Aboriginal or Torres Strait Islander residents. This should mostly go to Australian-born residents. To investigate, we can extend the Figure above to include the number of Australian-born recipients:

    payment_type_total_v_aus-1.png

    Looking at this Figure, most Centrelink payments fall along the dotted line, which is what we’d expect from a fair system (if 71.5% of the recipients were Australian-born).

    The outlier is the Age Pension, which falls below the line. More recipients of the Age Pension are born outside Australia than is reflected in the total population. I cannot say from this data alone why there is some bias in the Age Pension and perhaps a knowledgeable reader can comment. Nonetheless, this discrepancy is large enough that removing the Age Pension from consideration results in 70.5% of all other Centrelink payments going to Australian-born residents – almost exactly the proportion in the population.

    Ignoring Total Numbers

    The Figure below shows the percentage of Australian-born recipients for each payment type, ignoring totals.

    payment_type_p_aus-1.png

    At the upper end of the scale, we can see Australian-born recipients being over-represented for ABSTUDY and Youth Allowance payments. At the lower end, residents who are born outside Australia are over-represented for Wife and Widow pension and allowance.

    These payments with large biases (in either direction) have some common features. They have very specific eligibility criteria and are among the least-awarded services (shown in earlier Figures). Furthermore, the granting of payments to new recipients has been stopped in some cases such as the Wife Pension.

    These findings are consistent with the expectation that specific types of payments should be biased in specific ways. It also shows that substantial biases only arise for specific payments that are awarded to very few individuals.

    Concluding remarks

    In response to a racist outburst, I sought out publically available data to investigate whether there was evidence that the Australian Government unfairly supported residents based on their country of origin. I found that the percentage of residents born outside Australia has increased over time. However, with the minor exception of the Age pension (which the outraged man was not eligible for), residents born in Australia or elsewhere were fairly represented in the total number of Centrelink payments.

    I’d like to thank the Australian Government and Australian Bureau of Statistics for publicising this data and making it possible for me to respond to racism with evidence. If you’d like to reproduce this work or dig into the data yourself, everything from explaining where I got the data to create this article is freely available on GitHub. You can also keep in touch with me on LinkedIn or by following @drsimonj on Twitter.

    To leave a comment for the author, please follow the link and comment on their blog: blogR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Summertime Fun: R Graphs with LaCroixColoR, Magick Animations + More!

    By Laura Ellis

     Image courtesy of  medium.com

    (This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

    Now that we are back from the Memorial day long weekend, does anyone else feel like we’ve kicked off summer? I know that summer doesn’t begin officially until June 21, but I am already in the spirit. It might seem strange to “be in the spirit” since I’m living in Austin, Texas and we get temperatures up to 95-99 degrees Fahrenheit (35 to 37 degrees Celsius). But I think as a Canadian, I’ve been programmed to believe summer is fun. Heat is FUN darn it! I mean what is really to complain about? We’ve got AC, water parks, and LaCroix to keep us cool.

    Image courtesy of medium.com

    Lacroix is for Summer

    To get in the mood for summer, this blog is LaCroix themed! We will be exploring the stock prices of National Beverage Corp (FIZZ) which is the LaCroix parent company. Data was provided courtesy of nasdaq.com. We will be exploring the data in the color palette of my favorite LaCroix flavor; Peach Pear. This is made possible with the new LaCroixColoR package created by Johannes Björk. We will then be adding appropriate gif stickers to the graphs, courtesy of giphy.com and the magick package created by Jeroen Ooms. Note that when I was creating this blog, I referenced both the magick tutorial and an awesome “how to” blog on adding animations by Daniel Hadley

    Set Up

    The first thing we need to do is install and load all required packages for our summertime fun work with R. These are all of the install.packages() and library() commands below. Note that some packages can be installed directly via CRAN and some need to be installed from github via the devtools package. I wrote a blog on navigating various R package install issues here. We then download the data set from my github profile through the fread() function. We can then create a few helper date columns which make it easier to display in the graph.

    ######## Set Up Work ########
    
    #Install necessary packages
    install.packages("ggplot2")
    install.packages("magrittr")
    install.packages("magick")
    install.packages("devtools")
    install.packages("curl")
    install.packages("data.table")
    install.packages("lubridate")
    install.packages("devtools")
    library(devtools)
    install_github("johannesbjork/LaCroixColoR")
    
    #Load necessary packages
    library(LaCroixColoR)
    library(ggplot2)
    library(magick)
    library(magrittr) 
    library(curl)
    library(data.table)
    library(lubridate)
    
    #Bring in the nasdaq  data from Nasdaq.com for the LaCroix parent company:
    #National Beverage Corp
    #https://www.nasdaq.com/symbol/fizz/historical
    
    df= fread('https://raw.githubusercontent.com/lgellis/MiscTutorial/master/lacroix/FIZZ_Stock.csv')
    attach(df)
    df$mdy 
    

    Create the LaCroix plot with the LaCroix color palette

    Next, we want to create our graph which plots the FIZZ stock price over time. To give the graph a more appropriate LaCroix style flair, we will employ the employ the LaCroixColoR package. We will call on the lacroix_palette function when assigning the colors through the scale_color_manual function in ggplot 2. When calling the lacroix_palette function, we need to specify a flavor and of course I picked my favorite flavor: Peach Pear!

    ######## Image One - The LaCroix Curve ########
    
    # Create base plot
    fizz 
    

    LaCroix-FIZZ.png

    Adding Animations

    Now it's time to inject even more fun into these graphs – ANIMATIONS! In case you are wondering, yes this is the most important thing to spend your time doing. Come on this is fun! Who doesn't want to have a cartoon character climbing their graph? Or have Vincent Vega pointing at your x-axis? This type of activity is known as data nerd humor, or graph comedy.

    In this blog, we are breaking the graph into three parts of the stock journey and adding an animation for each phase: initial stock climb, climbing the mountain and the sad decline.

    To add an animation to your saved plot, you need to do 5 key parts:

    • Bring in your gif and apply any transformations (scale, rotation, etc)
    • Set the background image
    • Flatten the animation and background into frames
    • Turn the frame into an animation
    • Print and save

    Animation 1 – The curve starts climbing

    For this image, I wanted to bring in a sticker with a lot of zest! This lucky guy just realized his stock is climbing. He is stoked and doing a happy dance.

    #Bring in the gif - scale and rotate
    laCroixAnimation %
      image_scale("150") %>%
      image_rotate(-30)
    laCroixAnimation
    
    # Combine the plot and animation
    # Set the Background image
    background 
    

    laCroixImage1.gif

    Animation 2 – Climbing the Mountain

    For this image, I wanted to bring in a sticker with an earned sense of confidence. This stock climb is step but the little buddy is sticking with it. He deserves to give a triumphant wave.

    #Use base plot previously created
    #Bring in the gif - scale and rotate
    laCroixAnimation %
      image_scale("300") %>%
      image_rotate(10)
    laCroixAnimation
    # Combine the plot and animation
    # Background image
    background 
    

    laCroixImage2.gif

    Animation 3 – The Sad Decline

    For this image, I wanted to bring in a sticker that reflects the sad realization of a stock hitting somewhat of a free fall. I am doing my part to boost the LaCroix stock prices, but I'm only one woman. And therefore, we need a crying cat at the end of this curve to reflect the cooling off of stock prices in 2017/2018.

    ######## Image Three - Sad Decline ########
    
    #Use base plot previously created
    #Bring in the gif - scale
    laCroixAnimation %
      image_scale("80")
    laCroixAnimation
    
    # Background image
    background 
    

    laCroixImage3.gif

    Thank You and Enjoy Your LaCroix

    Thanks for reading along while we took a silly little journey through LaCroix stock prices, the LaCroixColoR package and Magick animations. Please feel free to let me know your thoughts in the comments or on twitter.

    Note that the full code is available on my github repo. If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

    To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Source:: R News

    A recipe for recipes

    By That’s so Random

    ↩

    (This article was first published on That’s so Random, and kindly contributed to R-bloggers)

    If you build statistical or machine learning models, the recipes package can be useful for data preparation. A recipe object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps and checks to the package1. Use of the package is probably intuitive. Developing new steps and checks, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps and checks becomes easier and more organized.

    S3 classes in recipes

    Lets build a very simple recipe:

    library(tidyverse)
    library(recipes)
    rec  recipe(mtcars) %>%
      step_center(everything()) %>%
      step_scale(everything())  %>%
      check_missing(everything())
    rec %>% class()
    
    ## [1] "recipe"
    

    As mentioned a recipe is a container for the steps and checks. It is a list of class recipe on which the prep, bake, print and tidy methods do the work as described in their respective documentation files. The steps and checks added to the recipe are stored inside this list. As you can see below, each of them are of their own subclass, as well as of the generic step or check classes.

    rec$steps %>% map(class)
    
    ## [[1]]
    ## [1] "step_center" "step"       
    ## 
    ## [[2]]
    ## [1] "step_scale" "step"      
    ## 
    ## [[3]]
    ## [1] "check_missing" "check"
    

    Each subclass has the same four methods defined as the recipe class. As one of the methods is called on the recipe, it will call the same method on all the steps and checks that are added to the recipe. (Exception is prep.recipe, which does not only callprep on all the steps and checks, but also bake). This means that when implementing a new step or check, you should provide these four methods. Additionally, we’ll need the function that is called by the user to add it to the recipe object and a constructor (if you are not sure what this is, see 15.3.1 of Advanced R by Hadley Wickham).

    The recipes for recipes

    When writing a new step or check you will probably be inclined to copy-paste an existing step and start tweaking it from the top. Thus, first writing the function, then the constructor, and then the methods one by one. I think this is suboptimal and can get messy quickly. My preferred way is to start by not bothering about recipes at all, but write a general function that does the preparation work on a single vector. Once you are happy with this function you sit and think about which arguments to this function should be provided with upfront. These should be added as arguments to the function that is called by the user. Next you think about which function arguments are statistics that should be learned on the train set in the prep part of the recipe. You’ll then go on and do the constructor that is called by both the main function and the prep method. Only then you’ll write the function that is called by the user. You’ll custom step or check is completed by writing the four methods, since the functionality is already created, these are mostly bookkeeping.

    For both checks and steps I made a skeleton available, in which all the “overhead” that should be in a new step or check is present. This is more convenient than copy-paste and existing example and try to figure out what is step specific and should be erased. You’ll find the skeletons on github. Note that for creating your own steps and checks you should clone the package source code, since the helper functions that are used are not exported by the package.

    Putting it to practice

    We are going to do two examples in which the recipe for recipes is applied.

    Example 1: A signed log

    First up is a signed-log (inspired by Practical Data Science with R), which is taking the log over the absolute value of a variable, multiplied by its original sign. Thus, enabling us to take logs over negative values. If a variable is between -1 and 1, we’ll set it to 0, otherwise things get messy.

    1) preparing the function

    This is what the function on a vector could look like, if we did not bother about recipes:

    signed_log  function(x, base = exp(1)) {
      ifelse(abs(x)  1, 
             0, 
             sign(x) * log(abs(x), base = base))
    }
    

    2) think about the arguments

    The only argument of the function is base, that should be provided upfront when adding the step to a recipe object. There are no statistics to be learned in the prep step.

    3) the constructor

    Now we are going to write the first of the recipes functions. This is the constructor that produces new instances of the object of class step_signed_log. The first four arguments are present in each step or check, they are therefore part of the skeletons. The terms argument will hold the information on which columns the step should be performed. For role, train and skip, please see the documentation in one of the skeletons. base is step_signed_log specific, as used in 1). In prep.step_signed_log the tidyselect arguments will be converted to a character vector holding the actual column names. columns will store these names in the step_signed_log object. This container argument will not be necessary if the columns names are also present in another way. For instance, step_center has the argument means, that will be populated by the means of the variables of the train set by its prep method. In the bake method the names of the columns to be prepared are already provided by the names of the means and there is need to use the columns argument.

    step_signed_log_new 
      function(terms   = NULL,
               role    = NA,
               skip    = FALSE,
               trained = FALSE,
               base    = NULL,
               columns = NULL) {
        step(
          subclass = "signed_log",
          terms    = terms,
          role     = role,
          skip     = skip,
          trained  = trained,
          base     = base,
          columns  = columns
        )
      }
    

    4) the function to add it to the recipe

    Next up is the function that will be called by the user to add the step to its recipe. You’ll see the internal helper function add_step is called. It will expand the recipe with the step_signed_log object that is produced by the constructor we just created.

    step_signed_log 
      function(recipe,
               ...,
               role    = NA,
               skip    = FALSE,
               trained = FALSE,
               base    = exp(1),
               columns = NULL) {
        add_step(
          recipe,
          step_signed_log_new(
            terms   = ellipse_check(...),
            role    = role,
            skip    = skip,
            trained = trained,
            base    = base,
            columns = columns
          )
        )
      }
    

    5) the prep method

    As recognized in 2) we don’t have to do much in the prep method of this particular step, since the preparation of new sets does not depend on statistics learned on the train set. The only thing we do here is applying the internal function terms_select function to replace the tidyselect selections, by the actual names of the columns on which step_signed_log should be applied. We call the constructor again, indicating that the step is trained and we supplying the column names at the columns argument.

    prep.step_signed_log  function(x,
                                     training,
                                     info = NULL, 
                                      ...) {
      col_names  terms_select(x$terms, info = info)
      step_signed_log_new(
        terms   = x$terms,
        role    = x$role,
        skip    = x$skip,
        trained = TRUE,
        base    = x$base,
        columns = col_names
      )
    }
    

    6) the bake method

    We are now ready to apply the baking function, designed at 1), inside the recipes framework. We loop through the variables, apply the function and return the updated data frame.

    bake.step_signed_log  function(object,
                                     newdata,
                                     ...) {
      col_names  object$columns
      for (i in seq_along(col_names)) {
        col  newdata[[ col_names[i] ]]
        newdata[, col_names[i]] 
          ifelse(abs(col)  1, 
                 0, 
                 sign(col) * log(abs(col), base = object$base))
      }
      as_tibble(newdata)
    }
    

    7) the print method

    This assures pretty printing of the recipe object to which step_signed_log is added. You use the internal printer function with a message specific for the step.

    print.step_signed_log 
      function(x, width = max(20, options()$width - 30), ...) {
        cat("Taking the signed log for ", sep = "")
        printer(x$columns, x$terms, x$trained, width = width)
        invisible(x)
    }
    

    8) the tidy method

    Finally, tidy will add a line for this step to the data frame when the tidy method is called on a recipe.

    tidy.step_signed_log  function(x, ...) {
      if (is_trained(x)) {
        res  tibble(terms = x$columns)
      } else {
        res  tibble(terms = sel2char(x$terms))
      }
      res
    }
    

    Lets do a quick check to see if it works as expected

    recipe(data_frame(x = 1)) %>% 
      step_signed_log(x) %>% 
      prep() %>% 
      bake(data_frame(x = -3:3))
    
    ## # A tibble: 7 x 1
    ##            x
    ##        
    ## 1 -1.0986123
    ## 2 -0.6931472
    ## 3  0.0000000
    ## 4  0.0000000
    ## 5  0.0000000
    ## 6  0.6931472
    ## 7  1.0986123
    

    Example 2: A range check

    Model predictions might be invalid when the range of a variable in new data is shifted from the range of the variable in the train set. Lets do a second example in which we check if the range of a numeric variable is approximately equal to the range of the variable in the train set. We do so by checking if the variable’s minimum value in the new data is not smaller than its minimum value in the train set. The variable’s maximum value in the test set should not exceed the maximum in the train set. We allow for some slack (proportion of the variable range in the train set) to account for natural variation.

    1) preparing the function

    As mentioned, checks are about throwing informative errors if assumptions are not met. This is a function we could apply on new variables, without bothering about recipes:

    range_check_func  function(x,
                                 lower,
                                 upper,
                                 slack_prop = 0.05,
                                 colname = "x") {
      min_x  min(x)
      max_x  max(x)
      slack  (upper - lower) * slack_prop
      if (min_x  (lower - slack) & max_x > (upper + slack)) {
        stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack,
             "n", "max x is ", max_x, ", upper bound is ", upper + slack, 
             call. = FALSE)
      } else if (min_x  (lower - slack)) {
        stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack, 
             call. = FALSE)
      } else if (max_x > (upper + slack)) {
        stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack, 
             call. = FALSE)
      }
    }
    

    2) think about the arguments

    The slack_prop is a choice that the user of the check should make upfront. This is thus an argument of check_range. Then there are two statistics to be learned in the prep method: lower and upper. These should be arguments of the function and the constructor as well. However, when calling the function these are always NULL, they are container arguments that will filled when calling prep.check_range.

    3) the constructor

    We start again with the four arguments present in every step or check. Subsequently we add the three arguments that we recognized to be part of the check.

    check_range_new 
      function(terms = NULL,
               role  = NA,
               skip    = FALSE,
               trained = FALSE,
               lower   = NULL,
               upper   = NULL,
               slack_prop = NULL) {
        check(subclass = "range",
              terms    = terms,
              role     = role,
              trained  = trained,
              lower    = lower,
              upper    = upper,
              slack_prop = slack_prop)
      }
    

    4) the function to add it to the recipe

    As we know by now, it is just calling the constructor and adding it to the recipe.

    check_range 
      function(recipe,
               ...,
               role = NA,
               skip    = FALSE,
               trained = FALSE,
               lower   = NULL,
               upper   = NULL,
               slack_prop = 0.05) {
        add_check(
          recipe,
          check_range_new(
            terms   = ellipse_check(...),
            role    = role,
            skip    = skip,
            trained = trained,
            lower   = lower,
            upper   = upper,
            slack_prop = slack_prop
          )
        )
      }
    

    5) the prep method

    Here the method is getting a lot more interesting, because we actually have work to do. We are calling vapply on each of the columns the check should be applied on, to derive the minimum and maximum. Again the constructor is called and the learned statistics are now populating the lower and upper arguments.

    prep.check_range 
      function(x,
               training,
               info = NULL,
               ...) {
        col_names  terms_select(x$terms, info = info)
        lower_vals  vapply(training[ ,col_names], min, c(min = 1), 
                             na.rm = TRUE)
        upper_vals  vapply(training[ ,col_names], max, c(max = 1), 
                             na.rm = TRUE)
        check_range_new(
          x$terms,
          role = x$role,
          trained = TRUE,
          lower   = lower_vals,
          upper   = upper_vals,
          slack_prop = x$slack_prop
        )
      }
    

    6) the bake method

    The hard work has been done already. We just get the columns on which to apply the check and check them with the function we created at 1).

    bake.check_range  function(object,
                                 newdata,
                                 ...) {
      col_names  names(object$lower)
      for (i in seq_along(col_names)) {
        colname  col_names[i]
        range_check_func(newdata[[ colname ]],
                         object$lower[colname],
                         object$upper[colname],
                         object$slack_prop,
                         colname)
      }
      as_tibble(newdata)
    }
    

    7) the print method

    print.check_range 
      function(x, width = max(20, options()$width - 30), ...) {
        cat("Checking range of ", sep = "")
        printer(names(x$lower), x$terms, x$trained, width = width)
        invisible(x)
      }
    

    8) the tidy method

    tidy.check_range  function(x, ...) {
      if (is_trained(x)) {
        res  tibble(terms = x$columns)
      } else {
        res  tibble(terms = sel2char(x$terms))
      }
      res
    }
    

    Again, we check quickly if it works

    cr1  data_frame(x = 0:100)
    cr2  data_frame(x = -1:101)
    cr3  data_frame(x = -6:100)
    cr4  data_frame(x = 0:106)
    recipe_cr  recipe(cr1) %>% check_range(x) %>% prep()
    cr2_baked  recipe_cr %>% bake(cr2)
    cr3_baked  recipe_cr %>% bake(cr3)
    
    ## Error: min x is -6, lower bound is -5
    
    cr4_baked  recipe_cr %>% bake(cr4)
    
    ## Error: max x is 106, upper bound is 105
    

    Conclusion

    If you like to add your own data preparation steps and data checks to the recipes package, I advise you to do this in a structured way so you are not distracting by the bookkeeping while implementing the functionality. I propose eight subsequent parts to develop a new step or check:

    1) Create a function on a vector that could be applied in the bake method, but does not bother about recipes yet.
    2) Recognize which function arguments should be provided upfront and which should be learned in the prep method.
    3) Create a constructor in the form step__new or check__new.
    4) Create the actual function to add the step or check to a recipe, in the form step_ or check_.
    5) Write the prep method.
    6) Write the bake method.
    7) Write the print method.
    8) Write the tidy method.

    As mentioned, the source code is maintained here. Make sure to clone the latest version to access the package internal functions.

    1. Together with Max I added the checks framework to the package. Where steps are transformations of variables, checks are assertions of them. If a check passes nothing happens. If it fails, it will break the bake method of the recipe and throws an informative error.

    To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News