When A Fire Starts to Burn – Fiery 1.0 released

By Data Imaginist – R posts

(This article was first published on Data Imaginist – R posts, and kindly contributed to R-bloggers)

I’m pleased to announce that fiery has been updated to version 1.0 and is now
available on CRAN. As the version bump suggests, this is a rather major update
to the package, fixing and improving upon the framework based on my experience
with it, as well as introducing a number of breaking changes. Below, I will go
through the major points of the update, but also give an overview of the
framework itself, as I did not have this blog when it was first released and
information on the framework is thus scarce on the internet (except this
nice little post
by Bob Rudis).

Significant changes in v1.0

The new version of fiery introduces both small and large changes. I’ll start
by listing the breaking changes that one should be aware of in existing fiery
servers, and then continue to describe other major changes.

Embracing reqres BREAKING

My reqres package was
recently released and has been
adopted by fiery as the interface for working with HTTP messaging. I have been
a bit torn on whether to build reqres into fiery or simply let
routr use it internally, but in the end
the benefits of a more powerful interface to HTTP requests and responses far
outweighed the added dependency and breaking change.

The change means that everywhere a request object is handed on to an event
handler (e.g. handlers listening to the request event) it is no longer passing
a rook environment but a Request object. The easiest fix in existing code is
to simply extract the rook environment from the Request object using the
origin field (this, of course, will not allow you to experience the joy of
reqres).

The change to reqres also brings other, smaller, changes to the code base.
header event handlers are now expected to return either TRUE or FALSE to
indicate whether to proceed or terminate, respectively. Prior to v1.0 they were
expected to return either NULL or a rook complaint list response, but as
responses are now linked to requests, returning them does not make sense. In the
same vein, the return values of request event handlers are ignored and the
response is not passed to after-request event handlers as the response can be
extracted directly from the request.

Arguments from before-request and before-message event handlers BREAKING

The before-request and before-message events are fired prior to the actual
HTTP request and WebSocket message handling. The return values from any handler
is passed on as arguments to the request and message handlers respectively
and these events can thus be used to inject data into the main request and
message handling. Prior to v1.0 these values were passed in directly as named
arguments, but will now be passed in as a list in the arg_list argument. This
is much easier and consistent to work with. An example of the change is:

# Old interface
app  Fire$new()
app$on('before-request', function(...) {
    list(arg1 = 'Hello', arg2 = 'World')
})
app$on('request', function(arg1, arg2, ...) {
    message(arg1, ' ', arg2)
})

# New interface
app  Fire$new()
app$on('before-request', function(...) {
    list(arg1 = 'Hello', arg2 = 'World')
})
app$on('request', function(arg_list, ...) {
    message(arg_list$arg1, ' ', arg_list$arg2)
})

As can be seen the code ends up being a bit more verbose, but the argument list
will be much more predictable.

Embracing snake_case BREAKING

When I first started developing fiery I was young and confused
(😜). Bottom line I don’t think my
naming scheme was very elegant. While consistent (snake_case for methods and
camelCase for fields), this mix is a bit foreign and I’ve decided to use this
major release to clean up in the naming and use snake_case consistently
throughout fiery. This has the effect of renaming the triggerDir field to
trigger_dir and refreshRate to refresh_rate. Furthermore this change is
taken to its conclusion by also changing the plugin interface and require
plugins to expose an on_attach() method rather than an onAttach() method.

Keeping the event cycle in non-blocking mode

fiery supports running the server in both a blocking and a non-blocking way
(that is, whether control should be returned to the user after the server is
started, or not). Before v1.0 the two modes were not equal in their life cycle
events as only the blocking server had support for cycle-start and cycle-end
events as well as handling of timed, delayed, and async evaluation. This has
changed and the lifetime of an app running in the two different modes are now
the same. To achieve this fiery uses the
later package to continually schedule cycle
evaluation for execution. This means that no matter the timing, cycles will only
be executed if the R process is idle, and it also has the slight inconvenience
of not allowing to stop a server as part of a cycle event (Bug report here:
https://github.com/rstudio/httpuv/issues/78). Parallel to the refresh rate of
a blocking server, the refresh rate of a non-blocking server can be set using
the refresh_rate_nb field. By default it is longer than that of a blocking
server, to give the R process more room to receive instructions from the
console.

Mounting a server

With v1.0 it is now possible to specify the root of a fiery server. The root
is the part of the URL path that is stripped from the path before sending
requests on to the handler. This means that it is possible to create sub-app in
fiery that do not care at which location they are run. If e.g. the root is set
to /demo/app then requests made for /demo/app/... will look like /...
internally, and switching the location of the app does not require any change in
the underlying app logic or routing. The root defaults to '' (nothing), but
can be changed with the root field.

Package documentation

Documentation can never by too good. The state of affairs for documenting
classes based on reference semantics is not perfect in R, and I still struggle
with the best setup. Still, the current iteration of the documentation is a vast
improvement, compared to the previous release. Notable changes include separate
entries for documentation of events and plugins.

Grab bag

The host and port can now be set during construction using the host and port
arguments in Fire$new(). Fire objects now has a print method, making them
much nicer to look at. The host, port, and root is now advertised when a server
starts. WebSocket connections can now be closed from the server using the
close_ws_con method.

A Fiery Overview

As promised in the beginning, I’ll end with giving an overview of how fiery is
used. I’ll do this by updating Bob’s prediction server to the bright future where
routr and reqres makes life easy for you:

We’ll start by making our fancy AI-machine-learning model of linear
regressiveness:

set.seed(1492)
x  rnorm(15)
y  x + rnorm(15)
fit  lm(y ~ x)
saveRDS(fit, "model.rds")

With this at our disposable, we can begin to build up our app:

library(fiery)
library(routr)
app  Fire$new()

# When the app starts, we'll load the model we saved. Instead of
# polluting our namespace we'll use the internal data store

app$on('start', function(server, ...) {
  server$set_data('model', readRDS('model.rds'))
  message('Model loaded')
})

# Just for show off, we'll make it so that the model is atomatically
# passed on to the request handlers

app$on('before-request', function(server, ...) {
    list(model = server$get_data('model'))
})

# Now comes the biggest deviation. We'll use routr to define our request
# logic, as this is much nicer
router  RouteStack$new()
route  Route$new()
router$add_route(route, 'main')

# We start with a catch-all route that provides a welcoming html page
route$add_handler('get', '*', function(request, response, keys, ...) {
    response$type  'html'
    response$status  200L
    response$body  '

All your AI are belong to us

'
TRUE }) # Then on to the /info route route$add_handler('get', '/info', function(request, response, keys, ...) { response$status 200L response$body structure(R.Version(), class = 'list') response$format(json = reqres::format_json()) TRUE }) # Lastly we add the /predict route route$add_handler('get', '/predict', function(request, response, keys, arg_list, ...) { response$body predict( arg_list$model, data.frame(x=as.numeric(request$query$val)), se.fit = TRUE ) response$status 200L response$format(json = reqres::format_json()) TRUE }) # And just to show off reqres file handling, we'll add a route # for getting a model plot route$add_handler('get', '/plot', function(request, response, keys, arg_list, ...) { f_path tempfile(fileext = '.png') png(f_path) plot(arg_list$model) dev.off() response$status 200L response$file f_path TRUE }) # Finally we attach the router to the fiery server app$attach(router) app$ignite(block = FALSE)
## Fire started at 127.0.0.1:8080
## Model loaded

As can be seen, routr makes the request logic nice and compartmentalized,
while reqres makes it easy to work with HTTP messages. What is less apparent is
the work that fiery is doing underneath, but that is exactly the point. While
it is possible to use a lot of the advanced features in fiery, you don’t have
to – often it is as simple as building up a router and attaching it to a fiery
instance. Even WebSocket messaging can be offloaded to the router if you so
wish.

Of course a simple prediction service is easy to build up in most frameworks –
it is the To-Do app of data science web server tutorials. I hope to get the time
to create some more fully fledged example apps soon. Next up in the fiery
stack pipeline is getting routr on CRAN as well and then begin working on some
of the plugins that will facilitate security, authentication, data storage, etc.

To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist – R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Buzzfeed trains an AI to find spy planes

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.

So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier “Spies in the Skies” story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false positives, but further journalistic inquiry into the provenance of the registrations led to several interesting stories.

Using this model, Buzzfeed news identified several surveillance aircraft in action during a four-month period in late 2015. These included a spy plane operated by US Marshals to hunt drag cartels in Mexico; aircraft covertly registered to US Customer and Border Protection patrolling the US-Mexico border; and a US Navy contractor operating planes circling several points over land in the San Francisco Bay Area — ostensibly for harbor porpoise research.

You can learn more about the stories Buzzfeed News uncovered in the flight data here, and for details on the implementation of the AI model in R, follow the link below.

Github (Buzzfeed): BuzzFeed News Trained A Computer To Search For Hidden Spy Planes. This Is What We Found.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Working with air quality and meteorological data Exercises (Part-1)

By Nosha Assare

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports. In this exercise set we will import and analyze data from MY1 station which is located in Marylebone Road in London, UK.

Answers to the exercises are available here.

Please install and load the package openair before starting the exercises.

Exercise 1
Import the MY1 data for the year 2016 and save it into a dataframe called my1data.

Exercise 2
Get basic statistical summaries of myd1 dataframe.

Exercise 3
Calculate monthly means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

  • Work with Spatial data and maps
  • Learn about different tools to develop spatial data next to R
  • And much more

Exercise 4
Calculate daily means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 5
calculate daily maximum of:
b. nox
c. no

Source:: R News

Simple practice: basic maps with the Tidyverse

By Sharp Sight

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

To master data science, you need to practice.

This sounds easy enough, but in reality, many people have no idea how to practice.

A full explanation of how to practice is beyond the scope of this blog post, but I can give you a quick tip here:

You need to master the most important techniques, and then practice those techniques in small scripts until you know them backwards and forwards.

Your goal should be to be able to write code “with your eyes closed.”

With that in mind, I want to give you a few small scripts that you can practice. Here are a few small scripts to create a set of maps. (As you examine these, pay close attention to how simple they are. How many functions and commands do you actually need to know?)

Let’s start with a simple one: create a map of the United States.

# INSTALL PACKAGE: tidyverse
library(tidyverse)

# MAP USA
map_data("usa") %>% 
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()

Three lines of code (four, if you count the library() function).

That’s it. Three lines to make a map of the US.

And if you change just 5 characters (change usa to state) you can generate a map of the states of the United States:

# MAP USA (STATES)
map_data("state") %>% 
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()

If you add an additional line, you can use filter to subset your data and map only a few states.

# MAP CALIFORNIA, NEVADA, OREGON, WASHINGTON
# - to do this, we're using dplyr::filter()
#   ... otherwise, it's almost exactly the same as the previous code
map_data("state") %>% 
  filter(region %in% c("california","nevada","oregon","washington")) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()



What I want to emphasize is how easy this is if you just know how filter works and how the pipe operator (%>%) works.

If you make another simple change, you can create a county level map of those states:

# MAP CALIFORNIA, NEVADA, OREGON, WASHINGTON
#  (Counties)

map_data("county") %>% 
  filter(region %in% c("california","nevada","oregon","washington")) %>%
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()



Finally, a few more changes can get you a map of the world:

# MAP WORLD

map_data("world") %>%
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()



… and then a map of a single country, like Japan:

# MAP JAPAN

map_data("world") %>%
  filter(region == 'Japan') %>%
  ggplot(aes(x = long, y = lat, group = group)) +
    geom_polygon()



I want to point out again that all of these maps were created with very simple variations on our original 3 line script.

It really doesn’t take that much to achieve competence in creating basic maps like this.

Practice small to learn big

You might be asking: why bother practicing these? They’re not terribly useful.

You need to understand that large projects are built from dozens of small snippets of code like these.

Moreover, these little snippets of code are made up of a small set of functions that you can break down and learn one command at a time.

So the path to mastery involves first mastering syntax of small individual commands and functions. After you’ve memorized the syntax of individual functions and commands, little scripts like this give you an opportunity to put those commands together into something a little more complex.

Later, these little scripts can be put together with other code snippets to perform more complicated analyses.

If you can practice (and master) enough small scripts like this, it becomes dramatically easier to execute larger projects quickly and confidently.

For example, by mastering the little scripts above, you put yourself on a path to creating something more like this:



If you want to achieve real fluency, you need to practice small before you practice big. So find little scripts like this, break them down, and drill them until you can type them without hesitation. You’ll thank me later.

(You’re welcome.)

A quick exercise for you

Count up the commands and arguments that we used in the little scripts above.

For the sake of simplicity, don’t count the data features that you might need (like the “region” column, etc), just count the number of functions, commands, and arguments that you need to know.

How many? If you really, really pushed yourself, how long would it take to memorize those commands?

Leave a comment below and tell me.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need a plan.

You need to be highly systematic.

Sign up for our email list right now, and you’ll get our “Data Science Crash Course.”

In it you’ll discover:

  • A step-by-step learning plan
  • How to create the essential data visualizations
  • How to perform the essential data wrangling techniques
  • How to get started with machine learning
  • How much math you need to learn (and when to learn it)
  • And more …

SIGN UP NOW

The post Simple practice: basic maps with the Tidyverse appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Magick 1.0: 🎩 ✨🐇 Advanced Graphics and Image Processing in R

By Jeroen Ooms

drought

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Last week, version 1.0 of the magick package appeared on CRAN: an ambitious effort to modernize and simplify high quality image processing in R. This R package builds upon the Magick++ STL which exposes a powerful C++ API to the famous ImageMagick library.

The best place to start learning about magick is the vignette which gives a brief overview of the overwhelming amount of functionality in this package.

Towards Release 1.0

Last year around this time rOpenSci announced the first release of the magick package: a new powerful toolkit for image reading, writing, converting, editing, transformation, annotation, and animation in R. Since the initial release there have been several updates with additional functionality, and many useRs have started to discover the power of this package to take visualization in R to the next level.

For example Bob Rudis uses magick to visualize California drought data from the U.S. Drought Monitor (click on the image to go find out more):

R-ladies Lucy D’Agostino McGowan and Maëlle Salmon demonstrate how to make a beautiful collage:

collage

And Daniel P. Hadley lets Vincent Vega explains Cars:

travolta

Now, 1 year later, the 1.0 release marks an important milestone: the addition of a new native graphics device (which serves as a hybrid between a magick image object and an R plot) bridges the gap between graphics and image processing in R.

This blog post explains how the magick device allows you to seamlessly combine graphing with image processing in R. You can either use it to post-process your R graphics, or draw on imported images using the native R plotting machinery. We hope that this unified interface will make it easier to produce beautiful, reproducible images with R.

Native Magick Graphics

The image_graph() function opens a new graphics device similar to e.g. png() or x11(). It returns an image object to which the plot(s) will be written. Each page in the plotting device will become a frame (layer) in the image object.

# Produce image using graphics device
fig  image_graph(res = 96)
ggplot2::qplot(mpg, wt, data = mtcars, colour = cyl)
dev.off()

The fig object now contains the image that we can easily post-process. For example we can overlay another image:

logo  image_read("https://www.r-project.org/logo/Rlogo.png")
out  image_composite(fig, image_scale(logo, "x150"), offset = "+80+380")

# Show preview
image_browse(out)

# Write to file
image_write(out, "myplot.png")

out

Drawing Device

The image_draw() function opens a graphics device to draw on top of an existing image using pixel coordinates.

# Open a file
library(magick)
frink  image_read("https://jeroen.github.io/images/frink.png")
drawing  image_draw(frink)

frink

We can now use R’s native low-level graphics functions for drawing on top of the image:

rect(20, 20, 200, 100, border = "red", lty = "dashed", lwd = 5)
abline(h = 300, col = 'blue', lwd = '10', lty = "dotted")
text(10, 250, "Hoiven-Glaven", family = "courier", cex = 4, srt = 90)
palette(rainbow(11, end = 0.9))
symbols(rep(200, 11), seq(0, 400, 40), circles = runif(11, 5, 35),
  bg = 1:11, inches = FALSE, add = TRUE)

At any point you can inspect the current result:

image_browse(drawing)

drawing

Once you are done you can close the device and save the result.

dev.off()
image_write(drawing, 'drawing.png')

By default image_draw() sets all margins to 0 and uses graphics coordinates to match image size in pixels (width x height) where (0,0) is the top left corner. Note that this means the y axis increases from top to bottom which is the opposite of typical graphics coordinates. You can override all this by passing custom xlim, ylim or mar values to image_draw().

Animated Graphics

The graphics device supports multiple frames which makes it easy to create animated graphics. The example below shows how you would implement the example from the very cool gganimate package using the magick.

library(gapminder)
library(ggplot2)
library(magick)
img  image_graph(res = 96)
datalist  split(gapminder, gapminder$year)
out  lapply(datalist, function(data){
  p  ggplot(data, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
    scale_size("population", limits = range(gapminder$pop)) +
    scale_x_log10(limits = range(gapminder$gdpPercap)) +
    geom_point() + ylim(20, 90) +  ggtitle(data$year) + theme_classic()
  print(p)
})
dev.off()
animation  image_animate(img, fps = 2)
image_write(animation, "animation.gif")

animation

We hope that the magick package can provide a more robust back-end for packages like gganimate to produce interactive graphics in R without requiring the user to manually install external image editing software.

Porting ImageMagick Commands to R

The magick 1.0 release now has the core image processing functionality that you expect from an image processing package. But there is still a lot of room for improvement to make magick the image processing package in R.

A lot of R users and packages currently shell out to ImageMagick command line tools for performing image manipulations. The goal is to support all these operations in the magick package, so that the images can be produced (and reproduced!) on any platform without requiring the user to install additional software.

Note that ImageMagick library is over 26 years old and has accumulated an enormous number of features in those years. Porting all of this to R is quite a bit of work, for which feedback from users is important. If there is an imagemagick operation that you like to do in R but you can’t figure out how, please open an issue on GitHub. If the functionality is currently not supported yet, we will try to add it to the next version.

Image Analysis

Currently magick is focused on generating and editing images. There is yet another entirely different set of features which we like to support related to analyzing images. Image analysis can involve anything from calculating color distributions to more sophisticated feature extraction and vision tools. I am not very familiar with this field, so again we could use suggestions from users and experts.

One feature that is already available is the image_ocr() function which extracts text from the image using the rOpenSci tesseract package. Another cool example of using image analysis is the collage package which calculates color histograms to select appropriate tile images for creating a collage.

histogram

As part of supporting supporting analysis tools we plan to extract the bitmap (raster) classes into a separate package. This will enable package authors to write R extensions to analyze and manipulate on the raw image data, without necessarily depending on magick. Yet the user can always rely on magick as a powerful toolkit to import/export images and graphics into such low level bitmaps.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

set_na_where(): a nonstandard evaluation use case

By Higher Order Functions

Offscreens looks occurred every two or three samples.

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

In this post, I describe a recent case where I used rlang’s
tidy evaluation
system to do some data-cleaning. This example is not particularly involved, but
it demonstrates is a basic but powerful idea: That we can capture the
expressions that a user writes, pass them around as data, and make some
:dizzy: magic :sparkles: happen. This technique in R is called
nonstandard evaluation.

Strange eyetracking data

Last week, I had to deal with a file with some eyetracking data from a
sequence-learning experiment. The eyetracker records the participant’s gaze
location at a rate of 60 frames per second—except for this weird file which
wrote out ~80 frames each second. In this kind of data, we have one row per
eyetracking sample, and each sample records a timestamp and the gaze location
:eyes: on the computer screen at each timestamp. In this particular dataset, we
have x and y gaze coordinates in pixels (both eyes averaged together,
GazeX and GazeY) or in screen proportions (for each eye in the EyeCoord
columns.)

library(dplyr)
library(ggplot2)
library(rlang)
# the data is bundled with an R package I wrote
# devtools::install_github("tjmahr/fillgaze")

df  system.file("test-gaze.csv", package = "fillgaze") %>% 
  readr::read_csv() %>% 
  mutate(Time = Time - min(Time)) %>% 
  select(Time:REyeCoordY) %>% 
  round(3) %>% 
  mutate_at(vars(Time), round, 1) %>% 
  mutate_at(vars(GazeX, GazeY), round, 0)
df
#> # A tibble: 14,823 x 8
#>     Time Trial GazeX GazeY LEyeCoordX LEyeCoordY REyeCoordX REyeCoordY
#>    
#>  1   0.0     1  1176   643      0.659      0.589      0.566      0.602
#>  2   3.5     1 -1920 -1080     -1.000     -1.000     -1.000     -1.000
#>  3  20.2     1 -1920 -1080     -1.000     -1.000     -1.000     -1.000
#>  4  36.8     1  1184   648      0.664      0.593      0.570      0.606
#>  5  40.0     1  1225   617      0.685      0.564      0.591      0.579
#>  6  56.7     1 -1920 -1080     -1.000     -1.000     -1.000     -1.000
#>  7  73.4     1  1188   641      0.665      0.587      0.572      0.600
#>  8  76.6     1  1204   621      0.674      0.568      0.580      0.582
#>  9  93.3     1 -1920 -1080     -1.000     -1.000     -1.000     -1.000
#> 10 109.9     1  1189   665      0.666      0.609      0.572      0.622
#> # ... with 14,813 more rows

In this particular eyetracking setup, offscreen looks are coded as negative gaze
coordinates, and what’s extra weird here is that every second or third point is
incorrectly placed offscreen. We see that in the frequent -1920 values in
GazeX. Plotting the first few x and y pixel locations shows the
pattern as well.

p  ggplot(head(df, 40)) + 
  aes(x = Time) + 
  geom_hline(yintercept = 0, size = 2, color = "white") + 
  geom_point(aes(y = GazeX, color = "GazeX")) +
  geom_point(aes(y = GazeY, color = "GazeY")) + 
  labs(x = "Time (ms)", y = "Screen location (pixels)", 
       color = "Variable")

p + 
  annotate("text", x = 50, y = -200, 
           label = "offscreen", color = "grey20") + 
  annotate("text", x = 50, y = 200, 
           label = "onscreen", color = "grey20") 

It is physiologically impossible for a person’s gaze to oscillate so quickly and
with such magnitude (the gaze is tracked on a large screen display), so
obviously something weird was going on with the experiment software.

This file motivated me to develop a general purpose package for interpolating
missing data in eyetracking experiments
.
This package was always something I wanted to do, and this file moved it from
the someday list to the today list.

A function to recode values in many columns as NA

The first step in handling this problematic dataset is to convert the offscreen
values into actual missing (NA) values). Because we have several columns of
data, I wanted a succinct way to recode values in multiple columns into NA
values.

First, we sketch out the code we want to write when we’re done.

set_na_where  function(data, ...) {
  # do things
}

set_na_where(
  data = df,
  GazeX = GazeX  -500 | 2200  GazeX,
  GazeY = GazeY  -200 | 1200  GazeY)

That is, after specifying the data, we list off an arbitrary number of column
names, and with each name, we provide a rule to determine whether a value in
that column is offscreen and should be set to NA. For example, we want every
value in GazeX where GazeX or 2299 is TRUE to be replaced
with NA.

Bottling up magic spells

Lines of computer code are magic spells: We say the incantations and things
happen around us. Put more formally, the code contains expressions that are
evaluated in an environment.

hey  "Hello!"
message(hey)
#> Hello!

exists("x")
#> [1] FALSE

x  pi ^ 2
exists("x")
#> [1] TRUE

print(x)
#> [1] 9.869604

stop("what are you doing?")
#> Error in eval(expr, envir, enclos): what are you doing?

In our function signature, function(data, ...), the expressions are collected
in the special “dots” argument (...). In normal circumstances, we can view the
contents of the dots by storing them in a list. Consider:

hello_dots  function(...) {
  str(list(...))
}
hello_dots(x = pi, y = 1:10, z = NA)
#> List of 3
#>  $ x: num 3.14
#>  $ y: int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ z: logi NA

But we not passing in regular data, but expressions that need to be evaluated in
a particular location. Below the magic words are uttered and we get an error
because they mention things that do not exist in the current environment.

hello_dots(GazeX = GazeX  -500 | 2200  GazeX)
#> Error in str(list(...)): object 'GazeX' not found

What we need to do is prevent these words from being uttered until the time and
place are right. Nonstandard evaluation is a way of bottling up magic spells
and changing how or where they are cast
—sometimes we even change the magic
words themselves. We bottle up or capture the expressions given by the user by
quoting them. quo() quotes a single expression, and quos() (plural) will
quote a list of expressions. Below, we capture the expressions stored in the
dots :speech_balloon: and then make sure that their names match column names in
the dataframe.

set_na_where  function(data, ...) {
  dots  quos(...)
  stopifnot(names(dots) %in% names(data), !anyDuplicated(names(dots)))
  
  dots
  # more to come
}

spells  set_na_where(
  data = df,
  GazeX = GazeX  -500 | 2200  GazeX, 
  GazeY = GazeY  -200 | 1200  GazeY)
spells
#> $GazeX
#> 
#> ~GazeX  
#> $GazeY
#> 
#> ~GazeY  
#> attr(,"class")
#> [1] "quosures"

I call these results spells because it just contains the expressions stored as
data. We can interrogate these results like data. We can query the names of the
stored data, and we can extract values (the quoted expressions).

names(spells)
#> [1] "GazeX" "GazeY"
spells[[1]]
#> 
#> ~GazeX 

Casting spells

We can cast a spell by evaluating an expression. To keep the incantation from
fizzling out, we specify that we want to evaluate the expression inside of the
dataframe. The function eval_tidy(expr, data) lets us do just that: evaluate
an expression expr inside of some data.

# Evaluate the first expression inside of the data
xs_to_set_na  eval_tidy(spells[[1]], data = df)

# Just the first few bc there are 10000+ values
xs_to_set_na[1:20]
#>  [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
#> [12]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

In fact, we can evaluate them all at once with by applying eval_tidy() on each
listed expression.

to_set_na  lapply(spells, eval_tidy, data = df)
str(to_set_na)
#> List of 2
#>  $ GazeX: logi [1:14823] FALSE TRUE TRUE FALSE FALSE TRUE ...
#>  $ GazeY: logi [1:14823] FALSE TRUE TRUE FALSE FALSE TRUE ...

Finishing touches

Now, the rest of the function is straightforward. Evaluate each NA-rule on
the named columns, and then set each row where the rule is TRUE to NA.

set_na_where  function(data, ...) {
  dots  quos(...)
  stopifnot(names(dots) %in% names(data), !anyDuplicated(names(dots)))
  
  set_to_na  lapply(dots, eval_tidy, data = data)
  
  for (col in names(set_to_na)) {
    data[set_to_na[[col]], col]  NA
  }
  
  data
}

results  set_na_where(
  data = df,
  GazeX = GazeX  -500 | 2200  GazeX, 
  GazeY = GazeY  -200 | 1200  GazeY)
results
#> # A tibble: 14,823 x 8
#>     Time Trial GazeX GazeY LEyeCoordX LEyeCoordY REyeCoordX REyeCoordY
#>    
#>  1   0.0     1  1176   643      0.659      0.589      0.566      0.602
#>  2   3.5     1    NA    NA     -1.000     -1.000     -1.000     -1.000
#>  3  20.2     1    NA    NA     -1.000     -1.000     -1.000     -1.000
#>  4  36.8     1  1184   648      0.664      0.593      0.570      0.606
#>  5  40.0     1  1225   617      0.685      0.564      0.591      0.579
#>  6  56.7     1    NA    NA     -1.000     -1.000     -1.000     -1.000
#>  7  73.4     1  1188   641      0.665      0.587      0.572      0.600
#>  8  76.6     1  1204   621      0.674      0.568      0.580      0.582
#>  9  93.3     1    NA    NA     -1.000     -1.000     -1.000     -1.000
#> 10 109.9     1  1189   665      0.666      0.609      0.572      0.622
#> # ... with 14,813 more rows

Visually, we can see that the offscreen values are no longer plotted. Plus, we
are told that our data now has missing values.

# `plot %+% data`: replace the data in `plot` with `data`
p %+% head(results, 40)
#> Warning: Removed 15 rows containing missing values (geom_point).

#> Warning: Removed 15 rows containing missing values (geom_point).

Offscreens are no longer plotted.

One of the quirks about some eyetracking data is that during a blink, sometimes
the device will record the x location but not the y location. (I think this
happens because blinks move vertically so the horizontal detail can still be
inferred in a half-closed eye.) This effect shows up in the data when there are
more NA values for the y values than for the x values:

count_na  function(data, ...) {
  subset  select(data, ...)
  lapply(subset, function(xs) sum(is.na(xs)))
}

count_na(results, GazeX, GazeY)
#> $GazeX
#> [1] 2808
#> 
#> $GazeY
#> [1] 3064

We can equalize these counts by running the function a second time with new rules.

df %>% 
  set_na_where(
    GazeX = GazeX  -500 | 2200  GazeX, 
    GazeY = GazeY  -200 | 1200  GazeY) %>% 
  set_na_where(
    GazeX = is.na(GazeY), 
    GazeY = is.na(GazeX)) %>% 
  count_na(GazeX, GazeY)
#> $GazeX
#> [1] 3069
#> 
#> $GazeY
#> [1] 3069

Alternatively, we can do this all at once by using the same NA-filtering rule
on GazeX and GazeY.

df %>% 
  set_na_where(
    GazeX = GazeX  -500 | 2200  GazeX | GazeY  -200 | 1200  GazeY, 
    GazeY = GazeX  -500 | 2200  GazeX | GazeY  -200 | 1200  GazeY) %>% 
  count_na(GazeX, GazeY)
#> $GazeX
#> [1] 3069
#> 
#> $GazeY
#> [1] 3069

These last examples, where we compare different rules, showcases how nonstandard
evaluation lets us write in a very succinct and convenient manner and quickly
iterate over possible rules. Works like magic, indeed.

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

#9: Compacting your Shared Libraries

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the nineth post in the recognisably rancid R randomness series, or R4 for short. Following on the heels of last week’s post, we aim to look into the shared libraries created by R.

We love the R build process. It is robust, cross-platform, reliable and rather predicatable. It. Just. Works.

One minor issue, though, which has come up once or twice in the past is the (in)ability to fully control all compilation options. R will always recall CFLAGS, CXXFLAGS, … etc as used when it was compiled. Which often entails the -g flag for debugging which can seriously inflate the size of the generated object code. And once stored in ${RHOME}/etc/Makeconf we cannot on the fly override these values.

But there is always a way. Sometimes even two.

The first is local and can be used via the (personal) ~/.R/Makevars file (about which I will have to say more in another post). But something I have been using quite a bite lately uses the flags for the shared library linker. Given that we can have different code flavours and compilation choices—between C, Fortran and the different C++ standards—one can end up with a few lines. I currently use this which uses -Wl, to pass an the -S (or --strip-debug) option to the linker (and also reiterates the desire for a shared library, presumably superfluous):

SHLIB_CXXLDFLAGS = -Wl,-S -shared
SHLIB_CXX11LDFLAGS = -Wl,-S -shared
SHLIB_CXX14LDFLAGS = -Wl,-S -shared
SHLIB_FCLDFLAGS = -Wl,-S -shared
SHLIB_LDFLAGS = -Wl,-S -shared

Let’s consider an example: my most recently uploaded package RProtoBuf. Built under a standard 64-bit Linux setup (Ubuntu 17.04, g++ 6.3) and not using the above, we end up with library containing 12 megabytes (!!) of object code:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so
-rwxr-xr-x 1 edd edd 12M Aug 14 20:22 src/RProtoBuf.so
edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ 

However, if we use the flags shown above in .R/Makevars, we end up with much less:

edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ ls -lh src/RProtoBuf.so 
-rwxr-xr-x 1 edd edd 626K Aug 14 20:29 src/RProtoBuf.so
edd@brad:~/git/rprotobuf(feature/fewer_warnings)$ 

So we reduced the size from 12mb to 0.6mb, an 18-fold decrease. And the file tool still shows the file as ‘not stripped’ as it still contains the symbols. Only debugging information was removed.

What reduction in size can one expect, generally speaking? I have seen substantial reductions for C++ code, particularly when using tenmplated code. More old-fashioned C code will be less affected. It seems a little difficult to tell—but this method is my new build default as I continually find rather substantial reductions in size (as I tend to work mostly with C++-based packages).

The second option only occured to me this evening, and complements the first which is after all only applicable locally via the ~/.R/Makevars file. What if we wanted it affect each installation of a package? The following addition to its src/Makevars should do:

strippedLib: $(SHLIB)
        if test -e "/usr/bin/strip"; then /usr/bin/strip --strip-debug $(SHLIB); fi

.phony: strippedLib

We declare a new Makefile target strippedLib. But making it dependent on $(SHLIB), we ensure the standard target of this Makefile is built. And by making the target .phony we ensure it will always be executed. And it simply tests for the strip tool, and invokes it on the library after it has been built. Needless to say we get the same reduction is size. And this scheme may even pass muster with CRAN, but I have not yet tried.

Lastly, and acknowledgement. Everything in this post has benefited from discussion with my former colleague Dan Dillon who went as far as setting up tooling in his r-stripper repository. What we have here may be simpler, but it would not have happened with what Dan had put together earlier.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

rstudio::conf(2018): Contributed talks, e-posters, and diversity scholarships

By Hadley Wickham

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

rstudio::conf, the conference on all things R and RStudio, will take place February 2 and 3, 2018 in San Diego, California, preceded by Training Days on January 31 and February 1. We are pleased to announce that this year’s conference includes contributed talks and e-posters, and diversity scholarships. More information below!

Contributed talks and e-posters

rstudio::conf() is accepting proposals for contributed talks and e-posters for the first time! Contributed talks are 20 minutes long, and will be scheduled alongside talks by RStudio employees and invited speakers. E-posters will be shown during the opening reception on Thursday evening: we’ll provide a big screen, power, and internet; you’ll provide a laptop with an innovative display or demo.

We are particularly interested in talks and e-posters that:

  • Showcase the use of R and RStudio’s tools to solve real problems.

  • Expand the tidyverse to reach new domains and audiences.

  • Communicate using R, whether it’s building on top of RMarkdown, Shiny, ggplot2, or something else altogether.

  • Discuss how to teach R effectively.

To present you’ll also need to register for the conference.

Apply now!

Applications close Sept 15, and you’ll be notified of our decision by Oct 1.

Diversity scholarships

We’re also continuing our tradition of diversity scholarships, and this year we’re doubling the program to twenty recipients. We will support underrepresented minorities in the R community by covering their registration (including workshops), travel, and accommodation.

Apply now!

Applications close Sept 15, and you’ll be notified of our decision by Oct 1.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Reproducibility: A cautionary tale from data journalism

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?

Image credit: Timo Grossenbacher

The version of R Timo was using had been updated, but that wasn’t the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.

Timo was able to recreate the original results by forcing the script to run with package versions as they existed back in August 2015. This is easy to do with the checkpoint package: just add a line like

library(checkpoiint); checkpoint("2015-08-11")

to the top of your R script. We have been taking daily snapshots of every R package on CRAN since September 2014 to address exactly this situation, and the checkpoint package makes it super-easy to find and install all of the packages you need to make your script reproducible, without changing your main R installation or affecting any other projects you may have. (The checkpoint package is available on CRAN, and also included with all editions of Microsoft R.)

I’ve been including a call to checkpoint on the top of most of my R scripts for several years now, and it’s saved me from failing scripts many times. Likewise, Timo has created a structure and process to support truly reproducible data analysis with R, and it advocates using the checkpoint to manage package versions. You can find a description of the process here: A (truly) reproducible R workflow, and find the template in Github.

By the way, SRF Data — the data journalism arm of the national broadcaster in Switzerland — has published some outstanding stories over the past few years, and has even been nominated for Data Journalism Website of the Year. At the useR!2017 conference earlier this year, Timo presented several fascinating insights into the data journalism process at SRF Data, which you can see in his slides and talk (embedded below):

Timo Grossenbacher: This is what happens when you use different package versions, Larry! and A (truly) reproducible R workflow

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

acs v2.1.1 is now on CRAN

By Ari Lamstein

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

A new version of the acs package is now on CRAN. I recommend that all users of choroplethr update to this version. Here is how to do it:

update.packages()

packageVersion('acs')

[1] ‘2.1.1'

As a reminder, after updating the acs package you might need to reinstall your Census API key with ?api.key.install.

New Performance Issue

Internally, choroplethr uses the acs package to fetch demographic data from the Census Bureau’s API. Unfortunately, this version of the acs package introduces a performance issue (and solution) when fetching data. Here is an example of the problem:


library(choroplethr)

time_demographic_get = function()
{
    start.time = Sys.time()
    df = get_state_demographics()
    end.time = Sys.time()
    end.time - start.time
}

time_demographic_get() # 1.9 minutes

Performance Issue Fix

The fix for this performance issue is simply to call the function ?acs.tables.install. You only need to call this function once. Doing so will dramatically speed up the performance of choroplethr’s various “get_*_demographics” functions:


acs.tables.install()

time_demographic_get() # 9.4 seconds

A big thank you to Ezra Haber Glenn, the author of the acs package, for his continued work maintaining the package.

The post acs v2.1.1 is now on CRAN appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News