Microsoft R Open 3.4.3 now available

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.3 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R (version 3.4.3) and updates the bundled packages (specifically: checkpoint, curl, doParallel, foreach, and iterators) to new versions.

MRO is 100% compatible with all R packages. MRO 3.4.3 points to a fixed CRAN snapshot taken on January 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

MRO 3.4.3 is based on R 3.4.3, a minor update to the R engine (you can see the detailed list of updates to R here). This update is backwards-compatible with R 3.4.2 (and MRO 3.4.2), so you shouldn’t encounter an new issues by upgrading.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

PWFSLSmoke 1.0: Visualizing Wildfire Smoke Data

By Helen Miller

(This article was first published on R – Working With Data, and kindly contributed to R-bloggers)

Mazama Science has released the first official version (1.0) of the PWFSLSmoke R package for working with PM2.5 monitoring data. A beta version was released last year, along with an accompanying blog post. In this post, we discuss the purpose and uses of the PWFSLSmoke package and demonstrate some of the core functionality.

The PWFSLSmoke package provides tools to see how smoke affects communities across the country through analyzing and visualizing PM2.5 data. Its capabilities include:

  • providing a versatile data model for dealing with PM2.5 data with the ws_monitor object
  • loading real-time and archival raw or pre-processed PM2.5 data from permanent and temporary monitors
  • quality-control options for vetting raw data
  • mapping and plotting functions for visualizing extended and short-term data series
  • algorithms for calculating NowCast values, rolling means, daily statistics, etc.
  • aggregation functionality for manipulating and analyzing PM2.5 data

Why study PM2.5 Data?

Mazama Science created the PWFSLSmoke package for the AirFire team at the USFS Pacific Wildland Fire Sciences Lab (PWFSL) as part of their suite of tools to analyze and visualize data from PM2.5 monitoring stations all over the country. PM2.5 refers to particulate matter under 2.5 micrometers in diameter that can come from many different sources: car exhaust, power plants, agricultural burning, etc. Breathing air with high PM2.5 levels is linked to all sorts of cardiovascular diseases and can worsen or trigger conditions like asthma and other chronic respiratory problems. For many communities outside of large metro areas, the main source of PM2.5 is wildfire smoke, and PM2.5 levels regularly reach hazardous levels during wildfire season. 2017 was a particularly bad smoke year across the Pacific Northwest, and even large cities like Seattle, Portland, and San Fransisco felt the effects of wildfire smoke during the height of the fire season.

There are three main organizations that aggregate PM2.5 monitoring data in the United States: AirNow, WRCC, and AIRSIS. The PWFSLSmoke R package includes capabilities for ingesting, parsing, and quality-controlling raw data from these sites or loading RData files of pre-processed, real-time and archival data.

Napa Valley Fires

The 2017 wildfire season in the US was one of the most destructive ever. California was hit particularly hard with uncontrolled wildfires tearing through Napa wine country in October. They were incredibly destructive, destroying thousands of homes, businesses, wineries and vineyards, and taking close to 30 lives. Clouds of smoke choked communities miles away from the direct path of the fires, with concentrations climbing to unhealthy and hazardous levels.

We can see how smoke affected communities near the fires by using the PWFSLSmoke package to explore PM2.5 data from October of 2017. (For readability, we have omitted R code used to generate graphics and only mention relevant functions. However, this blog post was written as an R notebook and the complete source can be found on GitHub.)

Loading the Data

Pre-processed archival data can easily be loaded using airnow_load(). We can select monitors that are within 100 kilometers of the Tubbs fire, the largest of the Napa Valley fires, with monitor_subset() and monitor_subsetByDistance().

Since we are looking at smoke from specific fires, let’s load some fire data. Cal Fire has an open archive of fire data that we can use. There were three large fires in the Napa Valley region which all started on October 9th. Let’s take a look at these fires in particular and data from monitors within 100km of the largest of the three.

suppressPackageStartupMessages(library(PWFSLSmoke))

# Fires started on Oct 9
fireDF %
 monitor_subset(stateCode = "CA") %>%
 monitor_subsetByDistance(fireDF$longitude[1], fireDF$latitude[1],
                          radius = 100)

Now that we’ve got the data loaded into the environment, let’s delve into PWFSLSmoke’s plotting capabilities to see what it can tell us.

Mapping

A good place to start is by mapping monitor and fire locations. There are several different functions for mapping monitors. monitorLeaflet() will generate an interactive leaflet map that will be displayed in RStudio’s ‘Viewer’ tab. There are three functions for creating static maps: monitorMap() will plot monitors over the outlines of states and counties, monitorEsriMap() will plot monitors over a base image from ESRI, and monitorGoogleMap() will plot monitor locations over a base image, sourced, unsurprisingly, from GoogleMaps.

Particulate concentrations can be classified by the Air Quality Index (AQI) category that they fall into. AQI levels are defined by the EPA for regulating and warning people about air quality issues. The AQI cutoffs and the official colors associated with them are built into the package and used in several different plotting functions. All of the various monitor mapping functions color the monitor locations based on the AQI level of the maximum hourly PM2.5 value by default. There are some other built-in images which can be used in mapping, and added with the addIcon() function. For example, you could add the location of fires to a monitor map with little pictures of a fire as in the map below.

This map tells us that smoke reached hazardous levels in those communities closest to the fires. As far away as San Fransisco, smoke levels were very unhealthy. Of course, this does not tell us anything about when the smoke affected these communities or for how long.

Timeseries Plots

monitorPlot_timeseries() is designed to plot timeseries data for visualizing data over time. It has a ‘style’ argument, with a couple of built-in plotting styles for telling different kinds of stories. The plot below uses the ‘gnats’ style which quickly plots many points. Red bars under the plot represent the duration of different fires.

This plot shows that PM2.5 levels were relatively constant around a baseline until shortly after the three fires started on October 9. The fires ignited and quickly grew to sizes large enough to send thick smoke to all neighboring communities. The first couple of days were the smokiest. Some monitors recoreded normal levels again around October 12 while many were still engulfed in smoke. After a brief respite, the baseline started creeping up again around October 16. All monitors returned to baseline levels around October 18, and stayed there for the rest of the month.

This gives us an idea of what air quality was like in the general region surrounding the fires. However, a curious observer of wildfire smoke might like to know how smoke affected a particular community. The city of Napa was directly hit by the Atlas fire, so let’s take a look at smoke levels in Napa to see if it gives us any insight into the effects of the fires there.

During this period, violent winds and flames meant that smoke levels could be jump wildly from hour to hour. One way to smooth out those changes a bit is by using NowCast values. NowCast is used for smoothing data and can be used to estimate values for missing data. The monitor_nowcast() function will calculate hourly NowCast values for a ws_monitor object. Using monitorPlot_timeseries() to plot hourly values, and monitor_nowcast() to calculate and plot NowCast values on top of them, we can get a pretty good idea about what happened in Napa while the wildfires raged nearby.

According to the data, PM2.5 levels were healthy up until close to midnight the night of October 8, when the Atlas Fire ignited and quickly exploded into a raging blaze only kilometers away. Smoke levels swung between moderate and very unhealthy throughout October 9, perhaps corresponding to changes in wind and weather. There is over a full day of missing values between October 10th and 11th, suggesting that the fire became so intense that the monitor was unable to record smoke values. This monitor comes back online during the 11th, recording dangerously high PM2.5 values for several days, indicating that clouds of smoke kept billowing into the town before finally easing off around October 19th.

Plotting Aggregated Data

This gives us a pretty good idea about how smoke affected a community very close to a fire on a pretty detailed level. While examining a specific subset of the data like this might give us some important insights, viewing aggregated data can tell different stories. Let’s say we want to understand how far smoke from these fires traveled and how communities farther from the fires experienced it. One option is to look at how smoke levels differed in locations at varying distances from the fires. The plot below does this, using monitorPlot_dailyBarplot() to plot the mean PM2.5 value for each day at Napa, Vallejo, and San Fransisco, with the distance from the Atlas Fire calculated using monitor_distance().

The shape of the data for the three different locations is pretty similar, with overall PM2.5 levels decreasing as the distance increases. Unfortunately, we are missing a day’s worth of data at Napa. However, the information from other monitors might help us guess what it was. At both Vallejo and San Fransisco, there was an increase in smoke from October 9 to 10, and a decrease from October 10 to 11. If smoke in Napa followed the same pattern, which would make sense if the smoke is coming from the same source in all three cities, we could speculate that smoke levels on October 10 would probably have been between the values for October 11 and October 13.

We hope this small foray into smoke analysis inspires you to look at data with the PWFSLSmoke package the next time wildfires erupt in North America.

Best Hopes for Healthy Air in 2018!

To leave a comment for the author, please follow the link and comment on their blog: R – Working With Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Reshaping with cdata

By John Mount

NewImage

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I’ve just shared a short webcast on data reshaping in R using the cdata package.

(link)

We also have two really nifty articles on the theory and methods:

Please give it a try!

This is the material I recently presented at the January 2017 BARUG Meetup.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

simmer.bricks 0.1.0: new add-on for simmer

By Iñaki Úcar

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The new package simmer.bricks has found its way to CRAN. The simmer package provides a rich and flexible API to build discrete-event simulations. However, there are certain recurring patterns that are typed over and over again, higher-level tasks which can be conceptualised in concrete activity sequences. This new package is intended to capture this kind of patterns into usable bricks, i.e., methods that can be used as simmer activities, but return an arrangement of activities implementing higher-level tasks.

For instance, consider an entity visiting a resource:

library(simmer)

trajectory("customer") %>%
  seize("clerk") %>%
  timeout(10) %>%
  release("clerk")
## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

The simmer.bricks package wraps this pattern into the visit() brick:

library(simmer.bricks)

trajectory("customer") %>%
  visit("clerk", 10)
## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

This is a very naive example though. As a more compelling use case, consider a resource that becomes inoperative for some time after each release; i.e., the clerk above needs to do some paperwork after each customer leaves. There are several ways of programming this with simmer. The most compact implementation requires a clone() activity to let a clone hold the resource for some more time while the original entity continues its way. This package encapsulates all this logic in a very easy-to-use brick called delayed_release():

env "customer") %>%
  log_("waiting") %>%
  seize("clerk") %>%
  log_("being attended") %>%
  timeout(10) %>%
  # inoperative for 5 units of time
  delayed_release(env, "clerk", 5) %>%
  log_("leaving")

env %>%
  add_resource("clerk") %>%
  add_generator("customer", customer, at(0, 1)) %>%
  run() %>% invisible
## 0: customer0: waiting
## 0: customer0: being attended
## 1: customer1: waiting
## 10: customer0: leaving
## 15: customer1: being attended
## 25: customer1: leaving

The reference index lists all the available bricks included in this inital release. The examples included in the help page for each method show the equivalence in plain activities. This is very important if you want to mix bricks with rollbacks to produce loops, since the rollback() activity works in terms of the number of activities. For instance, this is what a delayed_release() does behind the scenes:

customer
## trajectory: customer, 11 activities
## { Activity: Log          | message }
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Log          | message }
## { Activity: Timeout      | delay: 10 }
## { Activity: Clone        | n: 2 }
##   Fork 1, continue,  trajectory: anonymous, 2 activities
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c5b524c0 }
##   { Activity: Release      | resource: clerk, amount: 1 }
##   Fork 2, continue,  trajectory: anonymous, 2 activities
##   { Activity: Timeout      | delay: 5 }
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c59ddc18 }
## { Activity: Synchronize  | wait: 0 }
## { Activity: Log          | message }

As always, we are more than happy to receive feedback and suggestions, either via the mailing list or via GitHub issues and PRs. If you identify any pattern that you frequently use in your simulations and you think it could become a useful simmer brick, please don’t hesitate to share it!

Article originally published in Enchufa2.es: simmer.bricks 0.1.0: new add-on for simmer.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppMsgPack 0.2.1

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Am update of RcppMsgPack got onto CRAN today. It contains a number of enhancements Travers had been working on, as well as one thing CRAN asked us to do in making a suggested package optional.

MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. RcppMsgPack brings both the C++ headers of MessagePack as well as clever code (in both R and C++) Travers wrote to access MsgPack-encoded objects directly from R.

Changes in version 0.2.1 (2018-01-15)

  • Some corrections and update to DESCRIPTION, README.md, msgpack.org.md and vignette (#6).

  • Update to c_pack.cpp and tests (#7).

  • More efficient packing of vectors (#8).

  • Support for timestamps and NAs (#9).

  • Conditional use of microbenchmark in tests/ as required for Suggests: package [CRAN request] (#10).

  • Minor polish to tests relaxing comparison of timestamp, and avoiding a few g++ warnings (#12 addressing #11).

Courtesy of CRANberries, there is also a diffstat report for this release. More information is on the RcppRedis page.

More information may be on the RcppMsgPack page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

fulltext v1: text-mining scholarly works

By rOpenSci – open tools for open science

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

The problem

Text-mining – the art of answering questions by extracting patterns, data, etc. out of the published literature – is not easy.

It’s made incredibly difficult because of publishers. It is a fact that the vast majority of publicly funded research across the globe is published in paywall journals. That is, taxpayers pay twice for research: once for the grant to fund the work, then again to be able to read it. These paywalls mean that every potential person text-mining will have different access: some have access through their university, some may have access through their company, and others may only have access to whatever happens to be open access. On top of that, access for paywall journals often depends on your IP address – something not generally on top of mind for most people.

Another hardship with text-mining is the huge number of publishers together with no standardized way to figure out the URL for full text versions of a scholarly work. There is the DOI (Digital Object Identifier) system used by Crossref, Datacite and others, but those generally help you sort out the location of the scholarly work on a web page – the html version. What one probably wants for text-mining is the PDF or XML version if available. Publishers can optionally choose to include URLs for full text (PDF and/or XML) with Crossref’s metadata (e.g., see this Crossref API call and search for “link” on the page), but the problem is that it’s optional.

fulltext is a package to help R users address the above problems, and get published literature from the web in it’s many forms, and across all publishers.

the fulltext package

fulltext tries to make the following use cases as easy as possible:

  • Search for articles
  • Fetch abstracts
  • Fetch full text articles
  • Get links for full text articles (xml, pdf)
  • Extract text from articles
  • Collect sections of articles that you actually need (e.g., titles)
  • Download supplementary materials

fulltext organizes functions around the above use cases, then provides flexiblity to query many data sources within that use case (i.e. function). For example fulltext::ft_search searches for articles – you can choose among one or more of many data sources to search, passing options to each source as needed.

What does a workflow with fulltext look like?

  • Search for articles with ft_search()
  • Fetch articles with ft_get() using the output of the previous step
  • Collect the text into an object with ft_collect()
  • Extract sections of articles needed with ft_chunks(), or
  • Combine texts into a data.frame ready for quanteda or similar text-mining packages

Package overhaul

fulltext has undergone a re-organization, which includes a bump in the major version to v1 to reinforce the large changes the package has undergone. Changes include:

  • Function name standardization with the ft_ prefix. e,g, chunks is now ft_chunks
  • ft_get has undergone major re-organization – biggest of which may be that all full text XML/plain text/PDF goes to disk to simplify the user interface. Along with this we’ve changed to using DOIs/IDs as the file names
  • We no longer store files as rds – but as the format they are, pdf, txt or xml
  • storr is now imported to manage mapping between real DOIs and file paths that include normalized DOIs – and aids in the function ft_table() for creating a data.frame of text results
  • Note that with the ft_get() overhaul, the only option is to write to disk. Before we attempted to provide many different options for saving XML and PDF data, but it was too complicated. This has implications for using the output of ft_get() – the output includes the paths to the files – use ft_collect() to collect the text if you want to use ft_chunks() or other fulltext functions downstream.
  • A number of functions have been removed to further hone the scope of the package
  • A function ft_abstract is introduced to fetch abstracts for when you just need abstracts
  • A function ft_table has been introduced to gather all your documents into a data.frame to make it easy to do downstream analyses with other packages
  • Two new data sources have been added: Scopus and Microsoft Academic – both of which are available via ft_search() and ft_abstract()
  • New functions have been added for the user to find out what plugins are available: ft_get_ls(), ft_links_ls(), and ft_search_ls()

We’ve battle tested ft_get() on a lot of DOIs – but there still may be errors – let us know if you have any problems.

Documentation

Along with an overhual of the package we have made a new manual for fulltext. Check it out at https://ropensci.github.io/fulltext-book/

Setup

Install fulltext

at the time of writing binaries are not yet available on CRAN, so you’ll have to install from source from CRAN (which shouldn’t provide any problems since there’s no compiled code in the package), or install from GitHub

install.packages("fulltext")

Or get the development version:

devtools::install_github("ropensci/fulltext")
library(fulltext)

Below I’ll discuss some of the new features of the package, and not do an exhaustive tutorial to the package. Check out the manual for more details: https://ropensci.github.io/fulltext-book/

Fetch abstracts: ft_abstract

ft_abstract() is a new function in fulltext. It gives you access to absracts from the following data sources:

  • crossref
  • microsoft
  • plos
  • scopus

A quick example. Search for articles in PLOS.

res 

Now pass the DOIs to ft_abstract() to get abstracts:

ft_abstract(x = res$plos$data$id, from = "plos")
## 
## Found:
##   [PLOS: 90; Scopus: 0; Microsoft: 0; Crossref: 0]

Fetch articles: ft_get

The function ft_get is the workhorse for getting full text articles.

Using this function can be tricky depending on where you want to
get articles from. While searching (ft_search) usually doesn’t present any
barriers or stumbling blocks because search portals are generally open
(except Web Of Science), ft_get can get frustrating because so many
publishers paywall their articles. The combination of paywalls and their
patchwork of who gets to get through them means that we can’t easily predict
who will run into problems with Elsevier, Wiley, Springer, etc. (well, mostly
those big three because they publish such a large portion of the papers).

With this version we’ve tried to bulk up the documentation as much as possible
(see the manual) to make jumping over these barriers as painless as
possible.

Let’s do an example to demonstrate how to use ft_get() and some of the new
features.

Get DOIs from PLOS (excluding partial document types)

library("rplos")
dois 

Once we have DOIs we can go to ft_get():

res 

Internally, ft_get() attemps to write the file to disk if we can successfully
access the file – if an error occurs for any reason (see ft_get errors in the manual) we delete that file so you
don’t end up with partial/empty files.

Since ft_get() writes files to your machine’s disk, even if a function call
to ft_get() fails at some point in the process, the articles that we’ve
successfully retrieved aren’t affected.

In addition, we have fixed reusing cached files on disk. Thus, even if you get a failure
in a call to ft_get() you can rerun it again and those files already retrieved will
make the function call faster.

Having a look at the output of ft_get(), we can see that only
one list element (plos) has data in it because we only searched for articles
from one publisher.

vapply(res, function(x) is.null(x$data), logical(1))
##     plos   entrez    elife  pensoft    arxiv  biorxiv elsevier    wiley 
##    FALSE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE

The output for elements searched for are the following in a list:

  • found: number of works retrieved
  • dois: character vector of DOIs
  • data (a list)
    • backend: the backend
    • cache_path: the root cache path
    • path
      • DOI (a list named by a DOI is repeated for each DOI)
        • path: the complete path to the file on disk
        • id: the id (usually a DOI)
        • type: xml, plain or pdf
        • error: an error message, if any
    • data: this is NULL until you use ft_collect()
  • opts: your options

The backend can only be one value – ext, representing file extension.
We’re retaining that information for now because we may decide to add
additional backends in the future.

In the last version of fulltext you could get the extracted text from
XML or PDF in the output of ft_get(). This is changed. You now only
get metadata and the path to the file on disk. To get text into the object
is a separate function call to ft_collect()

ft_collect(res)

Which returns the same class object as ft_get() returns, but the data
slot is now populated with the text.

Remember that wit this change you now should use ft_collect() before passing
the output of ft_get() to ft_chunks()

ft_get(dois, from = "plos") %>% 
  ft_collect() %>% 
  ft_chunks(c("doi","history")) %>% 
  ft_tabularize()

Extract text: ft_extract

ft_extract() used to have options for extracting text from PDF’s with different pieces of software. To simplify the function it only uses the pdftools package.

path 
## /Library/Frameworks/R.framework/Versions/3.4/Resources/library/fulltext/examples/example1.pdf
##   Title: Suffering and mental health among older people living in nursing homes---a mixed-methods study
##   Producer: pdfTeX-1.40.10
##   Creation date: 2015-07-17

Gather text into a data.frame: ft_table

ft_table() is a new function to gather the text from all your
articles into a data.frame. This should simplify analysis for most users.

(x 
## # A tibble: 192 x 4
##    dois                       ids_norm                   text        paths
##  * 
##  1 10.1002/9783527696109.ch41 10_1002_9783527696109_ch41 "         … /Use…
##  2 10.1002/chin.199038056     10_1002_chin_199038056     "ChemInfor… /Use…
##  3 10.1002/cite.330221605     10_1002_cite_330221605     " Versamml… /Use…
##  4 10.1002/dvg.22402          10_1002_dvg_22402          "C 2013 Wi… /Use…
##  5 10.1002/jctb.5010090209    10_1002_jctb_5010090209    "         … /Use…
##  6 10.1002/qua.560200801      10_1002_qua_560200801      "Internati… /Use…
##  7 10.1002/risk.200590063     10_1002_risk_200590063     "         … /Use…
##  8 10.1002/scin.5591692420    10_1002_scin_5591692420    "Booksn  … /Use…
##  9 10.1006/bbrc.1994.2001     10_1006_bbrc_1994_2001     "http://ap… /Use…
## 10 10.1007/11946465_42        10_1007_11946465_42        " Hoon Cho… /Use…
## # ... with 182 more rows

(you can optionally only extract text from PDFs, or only from XMLs)

We give the the DOI, the normalized DOI that we used for the file path,
the text, and the file path. You can then use this output in quanteda
or other text-mining packages (the function quanteda::kwic() is for locating
keywords in context):

library(quanteda)
z 
##                                                                           
##   [10.1002/9783527696109.ch41, 253] Basic Concepts A lithium-ion battery |
##   [10.1002/9783527696109.ch41, 397]     in a typical lithium-ion battery |
##   [10.1002/9783527696109.ch41, 461]     in a typical lithium-ion battery |
##   [10.1002/9783527696109.ch41, 764]        material 1 Lithium LCO LiCoO2 |
##  [10.1002/9783527696109.ch41, 2744]               of about 3.6-3.8 V per |
##  [10.1002/9783527696109.ch41, 6237]                 nate/ anode and hour |
##                                          
##  cell | consists of a positive electrode 
##  cell | . material. During discharging   
##  cell | . The main reactions occurring   
##  Cell | phones, High capacity cobalt     
##  cell | and highest energy densities with
##  cell | cathode 4. Electrovaya,

Todo

We have lots of ideas to make fulltext even better. Check out what we’ll be working on in the issue tracker.

Feedback!

Please do upgrade/install fulltext v1.0.0 and let us know what you think.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A simple way to set up a SparklyR cluster on Azure

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

All of the details are provided in the link below, but the guide basically provides the Azure Distributed Data Engineering Toolkit shell commands to provision a Spark cluster, connect SparklyR to the cluster, and then interact with it via RStudio Server. This includes the ability to launch the cluster with pre-emptable low-priority VMs, a cost-effective option (up to 80% cheaper!) for non-critical workloads. Check out the details at the link below.

Github (Azure): How to use SparklyR on Azure with AZTK

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

conapomx data package

By En El Margen – R-English

(This article was first published on En El Margen – R-English, and kindly contributed to R-bloggers)

I have created a new data package for R to help users obtain national population statistics of Mexico.

The conapomx package contains the mxpopulation data set, which is a tidy data.frame containing population estimates from the CONAPO (National Population Commission) official agency. The estimates are divided by age groups, gender, municipality and year.

To install, just call CRAN (or github if you are reading this before it is accepted).

install.packages("conapomx") 
# or... 
library(devtools)
install_github("eflores89/conapomx")

# explore the dataset
head(mxpopulation)
y st id_st mun id_mun geoid gender age population
2010 Aguascalientes 1 Aguascalientes 1 1001 0 0-14 124263.71
2010 Aguascalientes 1 Aguascalientes 1 1001 0 15-29 106695.14
2010 Aguascalientes 1 Aguascalientes 1 1001 0 30-44 81088.65
2010 Aguascalientes 1 Aguascalientes 1 1001 0 45-64 60379.29
2010 Aguascalientes 1 Aguascalientes 1 1001 0 65+ 17679.26
2010 Aguascalientes 1 Aguascalientes 1 1001 1 0-14 119535.77

Here is the data for 2018, just out of curiosity:

library(dplyr)

mxpopulation %>% 
  filter(y == "2018") %>%
  group_by(st) %>% 
  summarize("Population" = sum(population))
st Population
Aguascalientes 1337792.5
Baja California 3633772.2
Baja California Sur 832827.2
Campeche 948459.3
Chiapas 5445232.7
Chihuahua 3816865.4
Coahuila 3063662.5
Colima 759686.5
Distrito Federal 8788140.6
Durango 1815965.6
Guanajuato 5952086.5
Guerrero 3625040.0
Hidalgo 2980532.2
Jalisco 8197483.1
Mexico 17604619.1
Michoacan 4687210.5
Morelos 1987595.8
Nayarit 1290518.8
Nuevo Leon 5300618.6
Oaxaca 4084674.0
Puebla 6371380.8
Queretaro 2091823.2
Quintana Roo 1709478.7
San Luis Potosi 2824976.0
Sinaloa 3059321.7
Sonora 3050472.7
Tabasco 2454294.5
Tamaulipas 3661161.7
Tlaxcala 1330142.6
Veracruz 8220321.9
Yucatan 2199617.6
Zacatecas 1612014.2

Hopefully this helps to avoid the notoriously bad datos.gob webpage! Happy data wrangling!

To leave a comment for the author, please follow the link and comment on their blog: En El Margen – R-English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Extracting data from Twitter for @hrbrmstr’s #nom foodie images

By Jasmine Dumas' R Blog

Moroccaninspired lamb meatballs prepped. Naan dough is kneading. Going to be a  sup tonight.

(This article was first published on Jasmine Dumas’ R Blog, and kindly contributed to R-bloggers)

Bob Rudis (@hrbrmstr) is a famed expert, author and developer in Data Security and the Chief Security Data Scientist at Rapid7. Bob also creates the most deliciously vivid images of his meals documented by the #nom hashtag. I’m going to use a similar method used in my previous projects (Hipster Veggies & Machine Learning Flashcards) to wrangle all those images into a nice collection – mostly for me to look at for inspiration in recipe planning.

Yum! Have you ever thought about collecting all these recipes & images into a cookbook?!

— Jasmine Dumas (@jasdumas) January 15, 2018

Source Repository: jasdumas/bobs-noms

Analysis

library(rtweet) # devtools::install_github("mkearney/rtweet")
library(tidyverse)
library(dplyr)
library(stringr)
library(magick)
library(knitr)
library(kableExtra)
# get all of bob's recent tweets
bobs_tweets  get_timeline(user = "hrbrmstr", n = 3200)

#filter noms with images only
bobs_noms  
  bobs_tweets %>% dplyr::filter(str_detect(hashtags, "nom"), !is.na(media_url))
bobs_noms$clean_text  bobs_noms$text
bobs_noms$clean_text  str_replace(bobs_noms$clean_text,"#[a-zA-Z0-9]{1,}", "") # remove the hashtag
bobs_noms$clean_text  str_replace(bobs_noms$clean_text, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "") # remove the url link
bobs_noms$clean_text  str_replace(bobs_noms$clean_text, "[[:punct:]]", "") # remove punctuation
# let's look at these images in a smaller data set
bobs_noms_small  bobs_noms %>% select(created_at, clean_text, media_url)

bobs_noms_small$img_md  paste0("![", bobs_noms_small$clean_text, "](", bobs_noms_small$media_url, ")")
data.frame(images = bobs_noms_small$img_md) %>% 
kable( format = "markdown") %>%
  kable_styling(full_width = F, position = 'center') 

|images |
|:———————————————————————————————————————————————————-|
| |
|Tsukune with tare tonight |
|Lamb roast isnt too shabby either |
|The pain de mie thankfully came out well |
|Sage rosemary & espresso infused salt rubbed roast lamb. Goose fat roasted potatoes _almost _ done |
| |
|Ham amp; turkey frittata time! |
|Postconfit |
|PostPBC |
| |
|<img src="https://i2.wp.com/pbs.twimg.com/media/DO2D1HQVwAEZQzk.jpg?w=456" alt=" is home
#2's Wedding Sunday.
20 ppl over tonight for
#joy
#nom” data-recalc-dims=”1″ /> |
|Definitely an Indonesian spring rolls kind of night |
|Homemade breadsticks for the homemade pasta and meatballs tonight |
| |
|Bonein PBC smoked pork roast |
|Prosciutto de Parma Cacio di Bosco & spinach omelettes this morning |
|Our Friday night is shaping up well How's yours going? |
|Pork tenderloin on the PBC tonight |
|Overnight nutmeg-infused yeast waffles with sautéd local picked Maine apples & Maine maple syrup |

# create a function to save these images!
save_image  function(df){
  for (i in c(1:nrow(df))){
    image  try(image_read(df$media_url[[i]]), silent = F)
  if(class(image)[1] != "try-error"){
    image %>%
      image_scale("1200x700") %>%
      image_write(paste0("../post_data/data/", bobs_noms$clean_text[i],".jpg"))
  }
 
  }
   cat("saved images...n")
}

save_image(bobs_noms)
## saved images...
To leave a comment for the author, please follow the link and comment on their blog: Jasmine Dumas’ R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to make your machine learning model available as an API with the plumber package

By Dr. Shirin Glander

http://localhost:8000/__swagger__/

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

The plumber package for R makes it easy to expose existing R code as a webservice via an API (Trestle Technology, LLC 2017).

You take an existing R script and make it accessible with plumber by simply adding a few lines of comments. If you have worked with Roxygen before, e.g. when building a package, you will already be familiar with the core concepts. If not, here are the most important things to know:

  • you define the output or endpoint
  • you can add additional annotation to customize your input, output and other functionalities of your API
  • you can define every input parameter that will go into your function
  • every such annotation will begin with either #' or #*

With this setup, we can take a trained machine learning model and make it available as an API. With this API, other programs can access it and use it to make predictions.

What are APIs and webservices?

With plumber, we can build so called HTTP APIs. HTTP stands for Hypertext Transfer Protocol and is used to transmit information on the web; API stands for Application Programming Interface and governs the connection between some software and underlying applications. Software can then communicate via HTTP APIs. This way, our R script can be called from other software, even if the other program is not written in R and we have built a tool for machine-to-machine communication, i.e. a webservice.

How to convert your R script into an API with plumber

Training and saving a model

Let’s say we have trained a machine learning model as in this post about LIME. I loaded a data set on chronic kidney disease, did some preprocessing (converting categorical features into dummy variables, scaling and centering), split it into training and test data and trained a Random Forest model with caret. We can use this trained model to make predictions for one test case with the following code:

library(tidyverse)

# load test and train data
load("../../data/test_data.RData")
load("../../data/train_data.RData")

# load model
load("../../data/model_rf.RData")

# take first test case for prediction
input_data %
  select(-class)

# predict test case using model
pred 
## ----------------
## Test case predicted to be ckd 
## ----------------

The input

For our API to work, we need to define the input, in our case the features of the test data. When we look at the model object, we see that it expects the following parameters:

var_names 
##  [1] "age"            "bp"             "sg_1.005"       "sg_1.010"      
##  [5] "sg_1.015"       "sg_1.020"       "sg_1.025"       "al_0"          
##  [9] "al_1"           "al_2"           "al_3"           "al_4"          
## [13] "al_5"           "su_0"           "su_1"           "su_2"          
## [17] "su_3"           "su_4"           "su_5"           "rbc_normal"    
## [21] "rbc_abnormal"   "pc_normal"      "pc_abnormal"    "pcc_present"   
## [25] "pcc_notpresent" "ba_present"     "ba_notpresent"  "bgr"           
## [29] "bu"             "sc"             "sod"            "pot"           
## [33] "hemo"           "pcv"            "wbcc"           "rbcc"          
## [37] "htn_yes"        "htn_no"         "dm_yes"         "dm_no"         
## [41] "cad_yes"        "cad_no"         "appet_good"     "appet_poor"    
## [45] "pe_yes"         "pe_no"          "ane_yes"        "ane_no"

Good practice is to write the input parameter definition into you API Swagger UI, but the code would work without these annotations. We define the parameters by annotating them with name and description in our R-script using @parameter. For this purpose, I want to know the type and min/max values for each of my variables in the training data. Because categorical data has been converted to dummy variables and then scaled and centered, these values will all be numeric and between 0 and 1 in this example. If I would build this script for a real case, I’d use the raw data as input and add a preprocessing function to my script, though!

# show parameter definition for the first three features
for (i in 1:3) {
# if you wanted to see it for all features, use
#for (i in 1:length(var_names)) {
  var 
## Variable: age is of type: numeric 
##  Min value in training data = 0 
##  Max value in training data = 0.9777778 
## ----------
## Variable: bp is of type: numeric 
##  Min value in training data = 0 
##  Max value in training data = 0.7222222 
## ----------
## Variable: sg_1.005 is of type: numeric 
##  Min value in training data = 0 
##  Max value in training data = 1 
## ----------

Unless otherwise instructed, all parameters passed into plumber endpoints from query strings or dynamic paths will be character strings. https://www.rplumber.io/docs/routing-and-input.html#typed-dynamic-routes

This means that we need to convert numeric values before we process them further. Or we can define the parameter type explicitly, e.g. by writing variable_1:numeric if we want to specifiy that variable_1 is supposed to be numeric.

To make sure that the model will perform as expected, it is also advisable to add a few validation functions. Here, I will validate

  • whether every parameter is numeric/integer by checking for NAs (which would have resulted from as.numeric()/as.integer() applied to data of character type)
  • whether every parameter is between 0 and 1

In order for plumber to work with our input, it needs to be part of the HTTP request, which can then be routed to our R function. The plumber documentation describes how to use query strings as inputs. But in our case, manually writing query strings is not practical because we have so many parameters. Of course, there are programs that let us generate query strings but the easiest way to format the input from a line of data I found is to use JSON.

The toJSON() function from the rjson package converts our input line to JSON format:

library(rjson)
test_case_json 
## {"age":0.511111111111111,"bp":0.111111111111111,"sg_1.005":1,"sg_1.010":0,"sg_1.015":0,"sg_1.020":0,"sg_1.025":0,"al_0":0,"al_1":0,"al_2":0,"al_3":0,"al_4":1,"al_5":0,"su_0":1,"su_1":0,"su_2":0,"su_3":0,"su_4":0,"su_5":0,"rbc_normal":1,"rbc_abnormal":0,"pc_normal":0,"pc_abnormal":1,"pcc_present":1,"pcc_notpresent":0,"ba_present":0,"ba_notpresent":1,"bgr":0.193877551020408,"bu":0.139386189258312,"sc":0.0447368421052632,"sod":0.653374233128834,"pot":0,"hemo":0.455056179775281,"pcv":0.425925925925926,"wbcc":0.170454545454545,"rbcc":0.225,"htn_yes":1,"htn_no":0,"dm_yes":0,"dm_no":1,"cad_yes":0,"cad_no":1,"appet_good":0,"appet_poor":1,"pe_yes":1,"pe_no":0,"ane_yes":1,"ane_no":0}

Defining the endpoint and output

In order to convert this very simple script into an API, we need to define the endpoint(s). Endpoints will return an output, in our case it will return the output of the predict() function pasted into a line of text (e.g. “Test case predicted to be ckd”). Here, we want to have the predictions returned, so we annotate the entire function with @get. This endpoint in the API will get a custom name, so that we can call it later; here we call it predict and therefore write #' @get /predict.

According to the design of the HTTP specification, GET (along with HEAD) requests are used only to read data and not change it. Therefore, when used this way, they are considered safe. That is, they can be called without risk of data modification or corruption — calling it once has the same effect as calling it 10 times, or none at all. Additionally, GET (and HEAD) is idempotent, which means that making multiple identical requests ends up having the same result as a single request. http://www.restapitutorial.com/lessons/httpmethods.html

In this case, we could also consider using @post to avoid caching issues, but for this example I’ll leave it as @get.

The POST verb is most-often utilized to create new resources. In particular, it’s used to create subordinate resources. That is, subordinate to some other (e.g. parent) resource. In other words, when creating a new resource, POST to the parent and the service takes care of associating the new resource with the parent, assigning an ID (new resource URI), etc. On successful creation, return HTTP status 201, returning a Location header with a link to the newly-created resource with the 201 HTTP status. POST is neither safe nor idempotent. It is therefore recommended for non-idempotent resource requests. Making two identical POST requests will most-likely result in two resources containing the same information. http://www.restapitutorial.com/lessons/httpmethods.html

We can also customize the output. Keep in mind though, that the output should be “serialized”. By default, the output will be in JSON format. Here, I want to have a text output, so I’ll specify @html without html formatting specifications, although I could add them if I wanted to display the text on a website. If we were to store the data in a database, however, this would not be a good idea. In that case, it would be better to output the result as a JSON object.

Logging with filters

It is also useful to provide some sort of logging for your API. Here, I am using the simple example from the plumber documentation that uses filters and output the logs to the console or your API server logs. You could also write your logging output to a file. In production, it would be better to use a real logging setup that stores information about each request, e.g. the time stamp, whether any errors or warnings occurred, etc. The forward() part of the logging function passes control on to the next handler in the pipeline, here our predict function.

Running the plumber script

We need to save the entire script with annotations as an .R file as seen below. The regular comments # describe what each section does.

# script name:
# plumber.R

# set API title and description to show up in http://localhost:8000/__swagger__/

#' @apiTitle Run predictions for Chronic Kidney Disease with Random Forest Model
#' @apiDescription This API takes as patient data on Chronic Kidney Disease and returns a prediction whether the lab values
#' indicate Chronic Kidney Disease (ckd) or not (notckd).
#' For details on how the model is built, see https://shirinsplayground.netlify.com/2017/12/lime_sketchnotes/
#' For further explanations of this plumber function, see https://shirinsplayground.netlify.com/2018/01/plumber/

# load model
# this path would have to be adapted if you would deploy this
load("/Users/shiringlander/Documents/Github/shirinsplayground/data/model_rf.RData")

#' Log system time, request method and HTTP user agent of the incoming request
#' @filter logger
function(req){
  cat("System time:", as.character(Sys.time()), "n",
      "Request method:", req$REQUEST_METHOD, req$PATH_INFO, "n",
      "HTTP user agent:", req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, "n")
  plumber::forward()
}

# core function follows below:
# define parameters with type and description
# name endpoint
# return output as html/text
# specify 200 (okay) return

#' predict Chronic Kidney Disease of test case with Random Forest model
#' @param age:numeric The age of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param bp:numeric The blood pressure of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param sg_1.005:int The urinary specific gravity of the patient, integer (1: sg = 1.005, otherwise 0)
#' @param sg_1.010:int The urinary specific gravity of the patient, integer (1: sg = 1.010, otherwise 0)
#' @param sg_1.015:int The urinary specific gravity of the patient, integer (1: sg = 1.015, otherwise 0)
#' @param sg_1.020:int The urinary specific gravity of the patient, integer (1: sg = 1.020, otherwise 0)
#' @param sg_1.025:int The urinary specific gravity of the patient, integer (1: sg = 1.025, otherwise 0)
#' @param al_0:int The urine albumin level of the patient, integer (1: al = 0, otherwise 0)
#' @param al_1:int The urine albumin level of the patient, integer (1: al = 1, otherwise 0)
#' @param al_2:int The urine albumin level of the patient, integer (1: al = 2, otherwise 0)
#' @param al_3:int The urine albumin level of the patient, integer (1: al = 3, otherwise 0)
#' @param al_4:int The urine albumin level of the patient, integer (1: al = 4, otherwise 0)
#' @param al_5:int The urine albumin level of the patient, integer (1: al = 5, otherwise 0)
#' @param su_0:int The sugar level of the patient, integer (1: su = 0, otherwise 0)
#' @param su_1:int The sugar level of the patient, integer (1: su = 1, otherwise 0)
#' @param su_2:int The sugar level of the patient, integer (1: su = 2, otherwise 0)
#' @param su_3:int The sugar level of the patient, integer (1: su = 3, otherwise 0)
#' @param su_4:int The sugar level of the patient, integer (1: su = 4, otherwise 0)
#' @param su_5:int The sugar level of the patient, integer (1: su = 5, otherwise 0)
#' @param rbc_normal:int The red blood cell count of the patient, integer (1: rbc = normal, otherwise 0)
#' @param rbc_abnormal:int The red blood cell count of the patient, integer (1: rbc = abnormal, otherwise 0)
#' @param pc_normal:int The pus cell level of the patient, integer (1: pc = normal, otherwise 0)
#' @param pc_abnormal:int The pus cell level of the patient, integer (1: pc = abnormal, otherwise 0)
#' @param pcc_present:int The puc cell clumps status of the patient, integer (1: pcc = present, otherwise 0)
#' @param pcc_notpresent:int The puc cell clumps status of the patient, integer (1: pcc = notpresent, otherwise 0)
#' @param ba_present:int The bacteria status of the patient, integer (1: ba = present, otherwise 0)
#' @param ba_notpresent:int The bacteria status of the patient, integer (1: ba = notpresent, otherwise 0)
#' @param bgr:numeric The blood glucose random level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param bu:numeric The blood urea level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param sc:numeric The serum creatinine level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param sod:numeric The sodium level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param pot:numeric The potassium level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param hemo:numeric The hemoglobin level of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param pcv:numeric The packed cell volume of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param wbcc:numeric The white blood cell count of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param rbcc:numeric The red blood cell count of the patient, numeric (scaled and centered to be btw 0 and 1)
#' @param htn_yes:int The hypertension status of the patient, integer (1: htn = yes, otherwise 0)
#' @param htn_no:int The hypertension status of the patient, integer (1: htn = no, otherwise 0)
#' @param dm_yes:int The diabetes mellitus status of the patient, integer (1: dm = yes, otherwise 0)
#' @param dm_no:int The diabetes mellitus status of the patient, integer (1: dm = no, otherwise 0)
#' @param cad_yes:int The coronary artery disease status of the patient, integer (1: cad = yes, otherwise 0)
#' @param cad_no:int The coronary artery disease status of the patient, integer (1: cad = no, otherwise 0)
#' @param appet_good:int The appetite of the patient, integer (1: appet = good, otherwise 0)
#' @param appet_poor:int The appetite of the patient, integer (1: appet = poor, otherwise 0)
#' @param pe_yes:int The pedal edema status of the patient, integer (1: pe = yes, otherwise 0)
#' @param pe_no:int The pedal edema status of the patient, integer (1: pe = no, otherwise 0)
#' @param ane_yes:int The anemia status of the patient, integer (1: ane = yes, otherwise 0)
#' @param ane_no:int The anemia status of the patient, integer (1: ane = no, otherwise 0)
#' @get /predict
#' @html
#' @response 200 Returns the class (ckd or notckd) prediction from the Random Forest model; ckd = Chronic Kidney Disease
calculate_prediction  1)) {
    res$status 

Note that I am using the “double-assignment” operator in my function, because I want to make sure that objects are overwritten at the top level (i.e. globally). This would have been relevant had I set a global parameter, but to show it the example, I decided to use it here as well.

We can now call our script with the plumb() function, run it with run() and open it on port 800. Calling plumb() creates an environment in which all our functions are evaluated.

library(plumber)
r 

We will now see the following message in our R console:

Starting server to listen on port 8000
Running the swagger UI at http://127.0.0.1:8000/__swagger__/

If you go to *http://localhost:8000/__swagger__/*, you could now try out the function by manually choosing values for all the parameters we defined in the script.

http://localhost:8000/__swagger__/http://localhost:8000/__swagger__/ continued

Because we annotated the calculate_prediction() function in our script with #' @get /predict we can access it via *http://localhost:8000/predict*. But because we have no input specified as of yet, we will only see an error on this site. So, we still need to put our JSON formatted input into the function. To do this, we can use curl from the terminal and feed in the JSON string from above. If you are using RStudio in the latest version, you have a handy terminal window open in your working directory. You find it right next to the Console.

Terminal in RStudio

Terminal in RStudio

curl -H "Content-Type: application/json" -X GET -d '{"age":0.511111111111111,"bp":0.111111111111111,"sg_1.005":1,"sg_1.010":0,"sg_1.015":0,"sg_1.020":0,"sg_1.025":0,"al_0":0,"al_1":0,"al_2":0,"al_3":0,"al_4":1,"al_5":0,"su_0":1,"su_1":0,"su_2":0,"su_3":0,"su_4":0,"su_5":0,"rbc_normal":1,"rbc_abnormal":0,"pc_normal":0,"pc_abnormal":1,"pcc_present":1,"pcc_notpresent":0,"ba_present":0,"ba_notpresent":1,"bgr":0.193877551020408,"bu":0.139386189258312,"sc":0.0447368421052632,"sod":0.653374233128834,"pot":0,"hemo":0.455056179775281,"pcv":0.425925925925926,"wbcc":0.170454545454545,"rbcc":0.225,"htn_yes":1,"htn_no":0,"dm_yes":0,"dm_no":1,"cad_yes":0,"cad_no":1,"appet_good":0,"appet_poor":1,"pe_yes":1,"pe_no":0,"ane_yes":1,"ane_no":0}' "http://localhost:8000/predict"

-H defines an extra header to include in the request when sending HTTP to a server (https://curl.haxx.se/docs/manpage.html#-H).

-X pecifies a custom request method to use when communicating with the HTTP server (https://curl.haxx.se/docs/manpage.html#-X).

-d sends the specified data in a request to the HTTP server, in the same way that a browser does when a user has filled in an HTML form and presses the submit button. This will cause curl to pass the data to the server using the content-type application/x-www-form-urlencoded (https://curl.haxx.se/docs/manpage.html#-d).

This will return the following output:

  • cat() outputs to the R console if you use R interactively; if you use R on a server, it will be included in the server logs.
System time: 2018-01-15 13:34:32 
 Request method: GET /predict 
 HTTP user agent: curl/7.54.0 @ 127.0.0.1 
  • paste outputs to the terminal
----------------
Test case predicted to be ckd 
----------------

Security

This example shows a pretty simply R-script API. But if you plan on deploying your API to production, you should consider the security section of the plumber documentation. It give additional information about how you can make your code (more) secure.

Finalize

If you wanted to deploy this API you would need to host it, i.e. provide the model and run an R environment with plumber, ideally on a server. A good way to do this, would be to package everything in a Docker container and run this. Docker will ensure that you have a working snapshot of the system settings, R and package versions that won’t change. For more information on dockerizing your API, check out https://hub.docker.com/r/trestletech/plumber/.


sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] plumber_0.4.4   rjson_0.2.15    forcats_0.2.0   stringr_1.2.0  
##  [5] dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.7.2    
##  [9] tibble_1.3.4    ggplot2_2.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.3.1          ddalpha_1.3.1       sfsmisc_1.1-1      
##  [4] jsonlite_1.5        splines_3.4.2       foreach_1.4.3      
##  [7] prodlim_1.6.1       modelr_0.1.1        assertthat_0.2.0   
## [10] stats4_3.4.2        DRR_0.0.2           cellranger_1.1.0   
## [13] yaml_2.1.15         robustbase_0.92-8   ipred_0.9-6        
## [16] backports_1.1.1     lattice_0.20-35     glue_1.2.0         
## [19] digest_0.6.12       randomForest_4.6-12 rvest_0.3.2        
## [22] colorspace_1.3-2    recipes_0.1.1       httpuv_1.3.5       
## [25] htmltools_0.3.6     Matrix_1.2-12       plyr_1.8.4         
## [28] psych_1.7.8         timeDate_3042.101   pkgconfig_2.0.1    
## [31] CVST_0.2-1          broom_0.4.3         haven_1.1.0        
## [34] caret_6.0-77        bookdown_0.5        scales_0.5.0       
## [37] gower_0.1.2         lava_1.5.1          withr_2.1.0        
## [40] nnet_7.3-12         lazyeval_0.2.1      cli_1.0.0          
## [43] mnormt_1.5-5        survival_2.41-3     magrittr_1.5       
## [46] crayon_1.3.4        readxl_1.0.0        evaluate_0.10.1    
## [49] nlme_3.1-131        MASS_7.3-47         xml2_1.1.1         
## [52] dimRed_0.1.0        foreign_0.8-69      class_7.3-14       
## [55] blogdown_0.3        tools_3.4.2         hms_0.4.0          
## [58] kernlab_0.9-25      munsell_0.4.3       bindrcpp_0.2       
## [61] compiler_3.4.2      RcppRoll_0.2.2      rlang_0.1.4        
## [64] grid_3.4.2          iterators_1.0.8     rstudioapi_0.7     
## [67] rmarkdown_1.8       gtable_0.2.0        ModelMetrics_1.1.0 
## [70] codetools_0.2-15    reshape2_1.4.2      R6_2.2.2           
## [73] lubridate_1.7.1     knitr_1.17          bindr_0.1          
## [76] rprojroot_1.2       stringi_1.1.6       parallel_3.4.2     
## [79] Rcpp_0.12.14        rpart_4.1-11        tidyselect_0.2.3   
## [82] DEoptimR_1.0-8
To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News