## Forecasting in NYC: 25-27 June 2018

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

In late June, I will be in New York to teach my 3-day workshop on Forecasting using R. Tickets are available at Eventbrite.
This is the first time I’ve taught this workshop in the US, having previously run it in the Netherlands and Australia. It will be based on the 2nd edition of my book “Forecasting: Principles and Practice” with George Athanasopoulos. All participants will get a print version of the book.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Mapping earthquakes off the Irish Coast using the leaflet package in R

By S. Walsh

(This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

Ireland is a typically stable region from a seismic activity perspective as it is distant from major plate boundaries where subduction and sea-floor spreading occur. However, in reading the following article, I was surprised to discover that earthquakes occur quite frequently both off the Irish coast and within the country itself. Most of these events occur as a result of tension and pressure release in the crustal rock.

The Irish National Seismic Network (INSN) maintains a dataset on 144 local earthquakes dating back to February 1980. The variables in the dataset are event date, event time, latitude, longitude, magnitude, region and subjective intensity (with levels such as “felt”, “feltIII”, “feltIV”).

## Distribution of local earthquake magnitudes

The median magnitude recorded since 1980 is 1.5 on the Richter Scale. The largest intensity recorded was magnitude 5.4 on the 19th July 1984 in the Lleyn Peninsula region of Wales. This event occurred near the village of Llanaelhaearn and is the location of the biggest earthquake in the past 50 years.

## Average waiting time between events

How frequently do seismic events occur around Ireland and its coast? It turns out that the median time between events is just 25 days! The greatest interval (1,335 days) is between an event recorded in New Ross, Co. Waterford on the 19th April 2002 and a magnitude 2.8 in the Irish Sea on the 14th December 2005. The distribution of waiting times is plotted below with outliers included and removed.

## Geospatial mapping of earthquake distribution

The leaflet package for R was used to map the data. This wonderful package allows one to create excellent interactive data visualisations. The map below is a static .png with no interactivity but it shows the distribution well.

With the interactive version, each data point can be investigated by clicking on it to bring up a box containing additional information about the event. Zooming in and out is also possible with the interactive version as is changing the base map layer and other aesthetics.

## Packages used

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.

Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A Grammar of Data Manipulation. R package version 0.7.4. https://CRAN.R-project.org/package=dplyr

Joe Cheng, Bhaskar Karambelkar and Yihui Xie (2017). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 1.1.0. https://CRAN.R-project.org/package=leaflet

Ramnath Vaidyanathan, Yihui Xie, JJ Allaire, Joe Cheng and Kenton Russell (2018).
htmlwidgets: HTML Widgets for R. R package version 1.0.
https://CRAN.R-project.org/package=htmlwidgets

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Predatory Journals and R

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

My paper about the penetration of predatory journals in Brazil, Is predatory publishing a real threat? Evidence from a large database study, just got published in Scientometrics!. The working paper version is available in SSRN.

This is a nice example of a data-intensive scientific work cycle, from gathering data to reporting results. Everything was done in R, using web scrapping algorithms, parallel processing, tidyverse packages and more. This was a special project for me, given its implications in science making in Brazil. It took me nearly one year to produce and execute the whole code. It is also a nice case of the capabilities of package ggplot2 in producing publication-ready figures. As a side output, our database of predatory journals is available as a shiny app.

More details about the study itself is available in the paper. Our abstract is as follows:

Using a database of potential, possible, or probable predatory scholarly open-access journals, the objective of this research is to study the penetration of predatory publications in the Brazilian academic system and the profile of authors in a cross-section empirical study. Based on a massive amount of publications from Brazilian researchers of all disciplines during the 2000–2015 period, we were able to analyze the extent of predatory publications using an econometric modeling. Descriptive statistics indicate that predatory publications represent a small overall proportion, but grew exponentially in the last 5 years. Departing from prior studies, our analysis shows that experienced researchers with a high number of non-indexed publications and PhD obtained locally are more likely to publish in predatory journals. Further analysis shows that once a journal regarded as predatory is listed in the local ranking system, the Qualis, it starts to receive more publications than non-predatory ones.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Update: Can we predict flu outcome with Machine Learning in R?

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

Since I migrated my blog from Github Pages to blogdown and Netlify, I wanted to start migrating (most of) my old posts too – and use that opportunity to update them and make sure the code still works.

Here I am updating my very first machine learning post from 27 Nov 2016: Can we predict flu deaths with Machine Learning and R?. Changes are marked as bold comments.

The main changes I made are:

• using the `tidyverse` more consistently throughout the analysis
• focusing on comparing multiple imputations from the `mice` package, rather than comparing different algorithms
• using `purrr`, `map()`, `nest()` and `unnest()` to model and predict the machine learning algorithm over the different imputed datasets

Among the many nice R packages containing data collections is the outbreaks package. It contains a dataset on epidemics and among them is data from the 2013 outbreak of influenza A H7N9 in China as analysed by Kucharski et al. (2014):

A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Distinguishing between reservoir exposure and human-to-human transmission for emerging pathogens using case onset data. PLOS Currents Outbreaks. Mar 7, edition 1. doi: 10.1371/currents.outbreaks.e1473d9bfc99d080ca242139a06c455f.

A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Data from: Distinguishing between reservoir exposure and human-to-human transmission for emerging pathogens using case onset data. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.2g43n.

I will be using their data as an example to show how to use Machine Learning algorithms for predicting disease outcome.

``````library(outbreaks)
library(tidyverse)
library(plyr)
library(mice)
library(caret)
library(purrr)``````

# The data

The dataset contains case ID, date of onset, date of hospitalization, date of outcome, gender, age, province and of course outcome: Death or Recovery.

## Pre-processing

Change: variable names (i.e. column names) have been renamed, dots have been replaced with underscores, letters are all lower case now.

Change: I am using the tidyverse notation more consistently.

First, I’m doing some preprocessing, including:

• renaming missing data as NA
• setting column types
• gathering date columns
• changing factor names of dates (to make them look nicer in plots) and of `province` (to combine provinces with few cases)
``````fluH7N9_china_2013\$age[which(fluH7N9_china_2013\$age == "?")] %
mutate(case_id = paste("case", case_id, sep = "_"),
age = as.numeric(age)) %>%
gather(Group, Date, date_of_onset:date_of_outcome) %>%
mutate(Group = as.factor(mapvalues(Group, from = c("date_of_onset", "date_of_hospitalisation", "date_of_outcome"),
to = c("date of onset", "date of hospitalisation", "date of outcome"))),
province = mapvalues(province, from = c("Anhui", "Beijing", "Fujian", "Guangdong", "Hebei", "Henan", "Hunan", "Jiangxi", "Shandong", "Taiwan"), to = rep("Other", 10)))``````

I’m also

• adding a third gender level for unknown gender
``levels(fluH7N9_china_2013_gather\$gender) ``
``````##   case_id outcome gender age province         Group       Date
## 1  case_1   Death      m  58 Shanghai date of onset 2013-02-19
## 2  case_2   Death      m   7 Shanghai date of onset 2013-02-27
## 3  case_3   Death      f  11    Other date of onset 2013-03-09
## 4  case_4          f  18  Jiangsu date of onset 2013-03-19
## 5  case_5 Recover      f  20  Jiangsu date of onset 2013-03-19
## 6  case_6   Death      f   9  Jiangsu date of onset 2013-03-21``````

For plotting, I am defining a custom `ggplot2` theme:

``my_theme ``

And use that theme to visualize the data:

``````ggplot(data = fluH7N9_china_2013_gather, aes(x = Date, y = age, fill = outcome)) +
stat_density2d(aes(alpha = ..level..), geom = "polygon") +
geom_jitter(aes(color = outcome, shape = gender), size = 1.5) +
geom_rug(aes(color = outcome)) +
scale_y_continuous(limits = c(0, 90)) +
labs(
fill = "Outcome",
color = "Outcome",
alpha = "Level",
shape = "Gender",
x = "Date in 2013",
y = "Age",
title = "2013 Influenza A H7N9 cases in China",
subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)",
caption = ""
) +
facet_grid(Group ~ province) +
my_theme() +
scale_shape_manual(values = c(15, 16, 17)) +
scale_color_brewer(palette="Set1", na.value = "grey50") +
scale_fill_brewer(palette="Set1")``````

``````ggplot(data = fluH7N9_china_2013_gather, aes(x = Date, y = age, color = outcome)) +
geom_point(aes(color = outcome, shape = gender), size = 1.5, alpha = 0.6) +
geom_path(aes(group = case_id)) +
facet_wrap( ~ province, ncol = 2) +
my_theme() +
scale_shape_manual(values = c(15, 16, 17)) +
scale_color_brewer(palette="Set1", na.value = "grey50") +
scale_fill_brewer(palette="Set1") +
labs(
color = "Outcome",
shape = "Gender",
x = "Date in 2013",
y = "Age",
title = "2013 Influenza A H7N9 cases in China",
subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)",
caption = "nTime from onset of flu to outcome."
)``````

# Features

In machine learning-speak features are what we call the variables used for model training. Using the right features dramatically influences the accuracy and success of your model.

For this example, I am keeping age, but I am also generating new features from the date information and converting gender and province into numerical values.

``````dataset %
mutate(hospital = as.factor(ifelse(is.na(date_of_hospitalisation), 0, 1)),
gender_f = as.factor(ifelse(gender == "f", 1, 0)),
province_Jiangsu = as.factor(ifelse(province == "Jiangsu", 1, 0)),
province_Shanghai = as.factor(ifelse(province == "Shanghai", 1, 0)),
province_Zhejiang = as.factor(ifelse(province == "Zhejiang", 1, 0)),
province_other = as.factor(ifelse(province == "Zhejiang" | province == "Jiangsu" | province == "Shanghai", 0, 1)),
days_onset_to_outcome = as.numeric(as.character(gsub(" days", "",
as.Date(as.character(date_of_outcome), format = "%Y-%m-%d") -
as.Date(as.character(date_of_onset), format = "%Y-%m-%d")))),
days_onset_to_hospital = as.numeric(as.character(gsub(" days", "",
as.Date(as.character(date_of_hospitalisation), format = "%Y-%m-%d") -
as.Date(as.character(date_of_onset), format = "%Y-%m-%d")))),
age = age,
early_onset = as.factor(ifelse(date_of_onset %
subset(select = -c(2:4, 6, 8))
rownames(dataset) ``````
``````##   case_id outcome age hospital gender_f province_Jiangsu province_Shanghai
## 1       1   Death  87        0        0                0                 1
## 2       2   Death  27        1        0                0                 1
## 3       3   Death  35        1        1                0                 0
## 4       4      45        1        1                1                 0
## 5       5 Recover  48        1        1                1                 0
## 6       6   Death  32        1        1                1                 0
##   province_Zhejiang province_other days_onset_to_outcome
## 1                 0              0                    13
## 2                 0              0                    11
## 3                 0              1                    31
## 4                 0              0                    NA
## 5                 0              0                    57
## 6                 0              0                    36
##   days_onset_to_hospital early_onset early_outcome
## 1                     NA           1             1
## 2                      4           1             1
## 3                     10           1             1
## 4                      8           1            NA
## 5                     11           1             0
## 6                      7           1             1``````
``summary(dataset\$outcome)``
``````##   Death Recover    NA's
##      32      47      57``````

## Imputing missing values

I am using the `mice` package for imputing missing values

Note: Since publishing this blogpost I learned that the idea behind using `mice` is to compare different imputations to see how stable they are, instead of picking one imputed set as fixed for the remainder of the analysis. Therefore, I changed the focus of this post a little bit: in the old post I compared many different algorithms and their outcome; in this updated version I am only showing the Random Forest algorithm and focus on comparing the different imputed datasets. I am ignoring feature importance and feature plots because nothing changed compared to the old post.

• `md.pattern()` shows the pattern of missingness in the data:
``md.pattern(dataset)``
``````##    case_id hospital province_Jiangsu province_Shanghai province_Zhejiang
## 42       1        1                1                 1                 1
## 27       1        1                1                 1                 1
##  2       1        1                1                 1                 1
##  2       1        1                1                 1                 1
## 18       1        1                1                 1                 1
##  1       1        1                1                 1                 1
## 36       1        1                1                 1                 1
##  3       1        1                1                 1                 1
##  3       1        1                1                 1                 1
##  2       1        1                1                 1                 1
##          0        0                0                 0                 0
##    province_other age gender_f early_onset outcome early_outcome
## 42              1   1        1           1       1             1
## 27              1   1        1           1       1             1
##  2              1   1        1           1       1             0
##  2              1   1        1           0       1             1
## 18              1   1        1           1       0             0
##  1              1   1        1           1       1             0
## 36              1   1        1           1       0             0
##  3              1   1        1           0       1             0
##  3              1   1        1           0       0             0
##  2              1   0        0           0       1             0
##                 0   2        2          10      57            65
##    days_onset_to_outcome days_onset_to_hospital
## 42                     1                      1   0
## 27                     1                      0   1
##  2                     0                      1   2
##  2                     0                      0   3
## 18                     0                      1   3
##  1                     0                      0   3
## 36                     0                      0   4
##  3                     0                      0   4
##  3                     0                      0   5
##  2                     0                      0   6
##                       67                     74 277``````
• `mice()` generates the imputations
``dataset_impute ``
• by default, `mice()` calculates five (m = 5) imputed data sets
• we can combine them all in one output with the `complete("long")` function
• I did not want to impute missing values in the `outcome` column, so I have to merge it back in with the imputed data
``````datasets_complete %
select(-.id)``````
``head(datasets_complete)``
``````##   case_id outcome .imp age hospital gender_f province_Jiangsu
## 1       1   Death    1  87        0        0                0
## 2       2   Death    1  27        1        0                0
## 3       3   Death    1  35        1        1                0
## 4       4        1  45        1        1                1
## 5       5 Recover    1  48        1        1                1
## 6       6   Death    1  32        1        1                1
##   province_Shanghai province_Zhejiang province_other days_onset_to_outcome
## 1                 1                 0              0                    13
## 2                 1                 0              0                    11
## 3                 0                 0              1                    31
## 4                 0                 0              0                    86
## 5                 0                 0              0                    57
## 6                 0                 0              0                    36
##   days_onset_to_hospital early_onset early_outcome
## 1                      1           1             1
## 2                      4           1             1
## 3                     10           1             1
## 4                      8           1             0
## 5                     11           1             0
## 6                      7           1             1``````

Let’s compare the distributions of the five different imputed datasets:

``````datasets_complete %>%
gather(x, y, age:early_outcome) %>%
ggplot(aes(x = y, fill = .imp, color = .imp)) +
facet_wrap(~ x, ncol = 3, scales = "free") +
geom_density(alpha = 0.4) +
scale_fill_brewer(palette="Set1", na.value = "grey50") +
scale_color_brewer(palette="Set1", na.value = "grey50") +
my_theme()``````

# Test, train and validation data sets

Now, we can go ahead with machine learning!

The dataset contains a few missing values in the `outcome` column; those will be the test set used for final predictions (see the old blog post for this).

``train_index ``

The remainder of the data will be used for modeling. Here, I am splitting the data into 70% training and 30% test data.

Because I want to model each imputed dataset separately, I am using the `nest()` and `map()` functions.

``````set.seed(42)
val_data %
group_by(.imp) %>%
nest() %>%
mutate(val_index = map(data, ~ createDataPartition(.\$outcome, p = 0.7, list = FALSE)),
val_train_data = map2(data, val_index, ~ .x[.y, ]),
val_test_data = map2(data, val_index, ~ .x[-.y, ]))``````

# Machine Learning algorithms

## Random Forest

To make the code tidier, I am first defining the modeling function with the parameters I want.

``model_function ``

Next, I am using the nested tibble from before to `map()` the model function, predict the outcome and calculate confusion matrices.

``````set.seed(42)
val_data_model %
mutate(model = map(val_train_data, ~ model_function(.x)),
predict = map2(model, val_test_data, ~ data.frame(prediction = predict(.x, .y[, -2]))),
predict_prob = map2(model, val_test_data, ~ data.frame(outcome = .y[, 2],
prediction = predict(.x, .y[, -2], type = "prob"))),
confusion_matrix = map2(val_test_data, predict, ~ confusionMatrix(.x\$outcome, .y\$prediction)),
confusion_matrix_tbl = map(confusion_matrix, ~ as.tibble(.x\$table)))``````

## Comparing accuracy of models

To compare how the different imputations did, I am plotting

• the confusion matrices:
``````val_data_model %>%
unnest(confusion_matrix_tbl) %>%
ggplot(aes(x = Prediction, y = Reference, fill = n)) +
facet_wrap(~ .imp, ncol = 5, scales = "free") +
geom_tile() +
scale_fill_viridis_c() +
my_theme()``````

• and the prediction probabilities for correct and wrong predictions:
``````val_data_model %>%
unnest(predict_prob) %>%
gather(x, y, prediction.Death:prediction.Recover) %>%
ggplot(aes(x = x, y = y, fill = outcome)) +
facet_wrap(~ .imp, ncol = 5, scales = "free") +
geom_boxplot() +
scale_fill_brewer(palette="Set1", na.value = "grey50") +
my_theme()``````

Hope, you found that example interesting and helpful!

``sessionInfo()``
``````## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
##
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base
##
## other attached packages:
##  [1] bindrcpp_0.2       caret_6.0-78       mice_2.46.0
##  [4] lattice_0.20-35    plyr_1.8.4         forcats_0.3.0
##  [7] stringr_1.3.0      dplyr_0.7.4        purrr_0.2.4
## [13] ggplot2_2.2.1.9000 tidyverse_1.2.1    outbreaks_1.3.0
##
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131.1      lubridate_1.7.3     dimRed_0.1.0
##  [4] RColorBrewer_1.1-2  httr_1.3.1          rprojroot_1.3-2
##  [7] tools_3.4.3         backports_1.1.2     R6_2.2.2
## [10] rpart_4.1-13        lazyeval_0.2.1      colorspace_1.3-2
## [13] nnet_7.3-12         withr_2.1.1.9000    tidyselect_0.2.4
## [16] mnormt_1.5-5        compiler_3.4.3      cli_1.0.0
## [19] rvest_0.3.2         xml2_1.2.0          labeling_0.3
## [22] bookdown_0.7        scales_0.5.0.9000   sfsmisc_1.1-1
## [25] DEoptimR_1.0-8      psych_1.7.8         robustbase_0.92-8
## [28] randomForest_4.6-12 digest_0.6.15       foreign_0.8-69
## [31] rmarkdown_1.8       pkgconfig_2.0.1     htmltools_0.3.6
## [37] rstudioapi_0.7      bindr_0.1           jsonlite_1.5
## [40] ModelMetrics_1.1.0  magrittr_1.5        Matrix_1.2-12
## [43] Rcpp_0.12.15        munsell_0.4.3       stringi_1.1.6
## [46] yaml_2.1.17         MASS_7.3-49         recipes_0.1.2
## [49] grid_3.4.3          parallel_3.4.3      crayon_1.3.4
## [52] haven_1.1.1         splines_3.4.3       hms_0.4.1
## [55] knitr_1.20          pillar_1.2.1        reshape2_1.4.3
## [58] codetools_0.2-15    stats4_3.4.3        CVST_0.2-1
## [61] glue_1.2.0          evaluate_0.10.1     blogdown_0.5
## [64] modelr_0.1.1        foreach_1.4.4       cellranger_1.1.0
## [67] gtable_0.2.0        kernlab_0.9-25      assertthat_0.2.0
## [70] DRR_0.0.3           xfun_0.1            gower_0.1.2
## [73] prodlim_1.6.1       broom_0.4.3         e1071_1.6-8
## [76] class_7.3-14        survival_2.41-3     viridisLite_0.3.0
## [79] timeDate_3043.102   RcppRoll_0.2.2      iterators_1.0.9
## [82] lava_1.6            ipred_0.9-6``````

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Sketchnotes from TWiML&AI #121: Reproducibility and the Philosophy of Data with Clare Gollnick

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Reproducibility and the Philosophy of Data with Clare Gollnick:

Sketchnotes from TWiMLAI talk #121: Reproducibility and the Philosophy of Data with Clare Gollnick

You can listen to the podcast here.

In this episode, i’m joined by Clare Gollnick, CTO of Terbium Labs, to discuss her thoughts on the “reproducibility crisis” currently haunting the scientific landscape. For a little background, a “Nature” survey in 2016 showed that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Clare gives us her take on the situation, and how it applies to data science, along with some great nuggets about the philosophy of data and a few interesting use cases as well. We also cover her thoughts on Bayesian vs Frequentist techniques and while we’re at it, the Vim vs Emacs debate. No, actually I’m just kidding on that last one. But this was indeed a very fun conversation that I think you’ll enjoy! https://twimlai.com/twiml-talk-121-reproducibility-philosophy-data-clare-gollnick/

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Fitting GAMs with brms: part 1

(This article was first published on From the Bottom of the Heap – R, and kindly contributed to R-bloggers)

Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the mgcv package. I use these models all the time in my research but recently we’ve been hitting the limits of the range of models that mgcv can fit. So I’ve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The brms package (Bürkner, 2017) is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using Stan. brms is the perfect package to go beyond the limits of mgcv because brms even uses the smooth functions provided by mgcv, making the transition easier. In this post I take a look at how to fit a simple GAM in brms and compare it with the same model fitted using mgcv.

In this post we’ll use the following packages. If you don’t know schoenberg, it’s a package I’m writing to provide `ggplot` versions of plots that can be produced by mgcv from fitted GAM objects. schoenberg is in early development, but it currentl works well enough to plot the models we fit here. If you’ve never come across this package before, you can install it from Github using `devtools::install_github(‘gavinsimpson/schoenberg')`

To illustrate brms GAM-fittgin chops, we’ll use the `mcycle` data set that comes with the MASS package. It contains a set of measurements of the acceleration force on a rider’s head during a simulated motorcycle collision and the time, in milliseconds, post collision. The data are loaded using `data()` and we take a look at the first few rows

The aim is to model the acceleration force (`accel`) as a function of time post collision (`times`). The plot below shows the data.

We’ll model acceleration as a smooth function of time using a GAM and the default thin plate regression spline basis. This can be done using the `gam()` function in mgcv ad for comparison with the fully bayesian model we’ll fit shortly, we use `method = “REML” to estimate the smoothness parameter for the spline in mixed model form and REML

As we can see from the model summary, the estimated smooth uses about 8.5 effective degrees of freedom and in the test of zero effect, the null hypothesis is strongly rejected. The fitted spline explains about 80% of the variance or deviance in the data.

To plot the fitted smooth we could use the `plot()` method provided by mgcv, but this uses base graphics. Instead we can use the `draw()` method from schoenberg, which can currently handle most of the univariate smooths in mgcv plus 2-d tensor product smooths

The equivalent model can be estimated using a fully-bayesian approach via the `brm()` function in the brms package. In fact, `brm()` will use the smooth specification functions from mgcv, making our lives much easier. The major difference though is that you can’t use `te()` or `ti()` smooths in `brm()` models; you need to use `t2()` tensor product smooths instead. This is because the smooths in the model are going to be treated as random effects and the model estimated as a GLMM, which exploits the duality of splines as random effects. In this representation, the wiggly parts of the spline basis are treated as a random effect and their associated variance parameter controls the degree of wiggliness of the fitted spline. The perfectly smooth parts of the basis are treated as a fixed effect. In this form, the GAM can be estimated using standard GLMM software; it’s what allows the `gamm4()` function for example to fit GAMMs using the lme4 package for example. This is also the reason why we can’t use `te()` or `ti()` smooths; those smooths do not have nicely separable penalties which means they can’t be written in the form required to be fitted using typical mixed model software.

The `brm()` version of the GAM is fitted using the code below. Note that I have changed a few things from their default values as

1. the model required more than the default number of MCMC samples — `iter = 4000`,
2. the samples needed thinning to deal with some strong autocorrelation in the Markov chains — `thin = 10`,
3. the `adapt.delta` parameter, a tuning parameter in he NUTS sampler for Hamiltonian Monte Carlo, potentially needed raising — there was a warning about a potential divergent transition but I should have looked to see if it was one or not; instead I just increased the tuning parameter to `0.99`,
4. four chains fitted by default but I wanted these to be fitted using 4 CPU `cores`,
5. `seed` sets the internal random number generator seed, which allows reproducibility of models, and
6. for this post I didn’t want to print out the progress of the sampler — `refresh = 0` — typically you won’t want to do this so you can see how the sampling is progressing.

The rest of the model is pretty similar to the `gam()` version we fitted earlier. The main difference is that I use the `bf()` function to create a special brms formula specifying the model. You don’t actually need to do this for such a simple model, but in a later post we’ll use this to fit distributional GAMs. Note that I’m leaving all the priors in the model at the default values. I’ll look at defining priors in a later post; for now I’m just going to use the default priors that `brm()` uses

Once the model has finished compiling and sampling we can output the model summary

This outputs details of the model fitted plus parameter estimates (as posterior means), standard errors, (by default) 95% credible intervals and two other diagnostics:

1. `Eff.Sample` is the effective sample size of the posterior samples in the model, and
2. `Rhat` is the potential scale reduction factor or Gelman-Rubin diagnostic and is a measure of how well the chains have converged and ideally should be equal to `1`.

The summary includes two entries for the smooth of `times`:

1. `sds(stimes_1)` is the variance parameter, which has the effect of controlling the wiggliness of the smooth — the larger this value the more wiggly the smooth. We can see that the credible interval doesn’t include 0 so there is evidence that a smooth is required over and above a linear parametric effect of `times`, details of which are given next,
2. `stimes_1` is the fixed effect part of the spline, which is the linear function that is perfectly smooth.

The final parameter table includes information on the variance of the data about the conditional mean of the response.

How does this model compare with the one fitted using `gam()`? We can use the `gam.vcomp()` function to compute the variance component representation of the smooth estimated via `gam()`. To make it comparable with the value shown for the brms model, we don’t undo the rescaling of the penalty matrix that `gam()` performs to help with numeric stability during model fitting.

This gives a posterior mean of 807.89 with 95% confidence interval of 480.66–1357.88, which compares well with posterior mean and credible interval of the `brm()` version of 722.44 (450.17 – 1150.27).

The `marginal_smooths()` function is used to extract the marginal effect of the spline.

This function extracts enough information about the estimated spline to plot it using the `plot()` method

Given the similarity in the variance components of the two models it is not surprising the two estimated smooth also look similar. The `marginal_smooths()` is effectively the equivalent of the `plot()` method for mgcv-based GAMs.

There’s a lot that we can and should do to check the model fit. For now, we’ll look at two posterior predictive check plots that brms, via the bayesplot package (Gabry and Mahr, 2018), makes very easy to produce using the `pp_check()` function.

The default produces a density plot overlay of the original response values (the thick black line) with 10 draws from the posterior distribution of the model. If the model is a good fit to the data, samples of data sampled from it at the observed values of the covariate(s) should be similar to one another.

Another type of posterior predictive check plot is the empirical cumulative distribution function of the observations and random draws from the model posterior, which we can produce with `type = “ecdf_overlay”`

Both plots show significant deviations between the the posterior simulations and the observed data. The poor posterior predictive check results are in large part due to the non-constant variance of the acceleration data conditional upon the covariate. Both models assumed that the observation are distributed Gaussian with means equal to the fitted values (estimated expectation of the response) with the same variance (^2). The observations appear to have different variances, which we can model with a distributional model, which allow all parameters of the distribution of the response to be modelled with linear predictors. We’ll take a look at these models in a future post.

### References

Bürkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. Journal of Statistical Software 80, 1–28. doi:10.18637/jss.v080.i01.

Gabry, J., and Mahr, T. (2018). Bayesplot: Plotting for bayesian models. Available at: https://CRAN.R-project.org/package=bayesplot.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## R/Finance 2018 Registration

This year marks the 10th anniversary of the R/Finance Conference! As in prior years, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 50+ presenters covering all areas of finance with R. The conference will take place on June 1st and 2nd, at UIC in Chicago.

You can find registration information on the conference website, or you can go directly to the Cvent registration page.

Note that registration fees will increase by 50% at the end of early registration on May 21, 2018.

We are very excited about keynote presentations by JJ Allaire, Li Deng, and Norm Matloff. The conference agenda (currently) includes 18 full presentations and 33 shorter “lightning talks”. As in previous years, several (optional) pre-conference seminars are offered on Friday morning. We’re still working on the agenda, but we have another great lineup of speakers this year!

There is also an (optional) conference dinner at Wyndham Grand Chicago Riverfront in the 39th Floor Penthouse Ballroom and Terrace. Situated directly on the riverfront, it is a perfect venue to continue conversations while dining and drinking.

We would to thank our 2018 Sponsors for the continued support enabling us to host such an exciting conference:

On behalf of the committee and sponsors, we look forward to seeing you in Chicago!

Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson, Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my `metis` package which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

The (GitHub only for now) #rstats metis package for wiring up R to @amazonathema via RJDBC now uses & includes the new Simba Athena JDBC Driver 2.0.2 JAR https://t.co/wvwV6IxCNd (cc: @dabdine)

— hrbrmstr (@hrbrmstr) April 20, 2018

If you do scroll through the thread you’ll see @hadleywickham suggested using the `odbc` package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an `odbc` + `dplyr` connection almost instantly:

``````library(odbc)
library(tidyverse)

DBI::dbConnect(
odbc::odbc(),
driver = "Simba Athena ODBC Driver",
Schema = "redacted",
AwsRegion = "us-east-1",
AuthenticationType = "Default Credentials",
S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl ``````

Apologies for the redaction and lack of output but we’ve removed the default example databases from our work Athena environment and I’m not near my personal systems, so a more complete example will have to wait until later.

The TLDR is that I can now use 100% `dplyr` idioms with Athena vs add one to the RJDBC driver I made for `metis`. The `metis` package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill

I also maintain the `sergeant` package which provides REST API and REST query access to Apache Drill along with a REST API `DBI` driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

``````library(odbc)
library(tidyverse)

DBI::dbConnect(
odbc::odbc(),
driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
ConnectionType = "Zookeeper",
AuthenticationType = "No Authentication",
ZKCLusterID = "CLUSTERID",
ZkQuorum = "HOST:2181",
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1
## # ... with more rows, and 9 more variables: department_id , birth_date ,
## #   hire_date , salary , supervisor_id , education_level ,
## #   marital_status , gender , management_role ##

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n
##
##  1 Store Temporary Checker   268
##  2 Store Temporary Stocker   264
##  3 Store Permanent Checker   226
##  4 Store Permanent Stocker   222
##  5 Store Shift Supervisor    52
##  6 Store Permanent Butcher   32
##  7 Store Manager             24
##  8 Store Assistant Manager   24
##  9 Store Information Systems 16
## 10 HQ Finance and Accounting 8
## # ... with more rows##``````

Apart from having to do that `sql(…)` to make the table connection work, it was pretty painless and I had both Athena and Drill working with `dplyr` verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ll be updating my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing `sergeant` package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

### FIN

The `odbc` package is super-slick and it’s refreshing to be able to use `dplyr` verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the `dplyr` verbs. However, those queries can also be put right into `sql()` with the Athena ODBC driver connection and used via the same `dplyr` verb magic afterwards.

Today is, indeed, a good day to query!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Packaging Shiny applications: A deep dive

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

(Or, how to write a Shiny app.R file that only contains a single line of code)

##### Mark Sellors, Head of Data Engineering

This post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.

Last year, during some of my conference talks, I told the story of Mango’s early adoption of Shiny and how it wasn’t always an easy path to production for us. In this post I’d like to fill in some of the technical background and provide some information about Shiny app publishing and packaging that is hopefully useful to a wider audience.

I’ve figured out some of this for myself, but the most pivotal piece of information came from Shiny creator, Joe Cheng. Joe told me some time ago, that all you really need in an app.R file is a function that returns a Shiny application object. When he told me this, I was heavily embedded in the publication side and I didn’t immediately understand the implications.

Over time though I came to understand the power and flexibility that this model provides and, to a large extent, that’s what this post is about.

### What is Shiny?

Hopefully if you’re reading this you already know, but Shiny is a web application framework for Ri. It allows R users to develop powerful web applications entirely in R without having to understand HTML, CSS and JavaScript. It also allows us to embed the statistical power of R directly into those web applications.

Shiny apps generally consist of either a ui.R and a server.R (containing user interface and server-side logic respectively) or a single app.R which contains both.
Why package a Shiny app anyway?

If your app is small enough to fit comfortably in a single file, then packaging your application is unlikely to be worth it. As with any R script though, when it gets too large to be comfortably worked with as a single file, it can be useful to break it up into discrete components.

Publishing a packaged app will be more difficult, but to some extent that will depend on the infrastructure you have available to you.

### Pros of packaging

Packaging is one of the many great features of the R language. Packages are fairly straightforward, quick to create and you can build them with a host of useful features like built-in documentation and unit tests.

They also integrate really nicely into Continuous Integration (CI) pipelines and are supported by tools like Travis. You can also get test coverage reports using things like codecov.io.

They’re also really easy to share. Even if you don’t publish your package to CRAN, you can still share it on GitHub and have people install it with devtools, or build the package and share that around, or publish the package on a CRAN-like system within your organisation’s firewall.

### Cons of packaging

Before you get all excited and start to package your Shiny applications, you should be aware that — depending on your publishing environment — packaging a Shiny application may make it difficult or even impossible to publish to a system like Shiny Server or RStudio Connect, without first unpacking it again.

### A little bit of Mango history

This is where Mango were in the early days of our Shiny use. We had a significant disconnect between our data scientists writing the Shiny apps and the IT team tasked with supporting the infrastructure they used. This was before we’d committed to having an engineering team that could sit in the middle and provide a bridge between the two.

When our data scientists would write apps that got a little large or that they wanted robust tests and documentation for, they would stick them in packages and send them over to me to publish to our original Shiny Server. The problem was: R packages didn’t really mean anything to me at the time. I knew how to install them, but that was about as far as it went. I knew from the Shiny docs that a Shiny app needs certain files (server.R and ui.R or an app.R) file, but that wasn’t what I got, so I’d send it back to the data science team and tell them that I needed those files or I wouldn’t be able to publish it.

More than once I got back a response along the lines of, “but you just need to load it up and then do runApp()”. But, that’s just not how Shiny Server works. Over time, we’ve evolved a set of best practices around when and how to package a Shiny application.

The first step was taking the leap into understanding Shiny and R packages better. It was here that I started to work in the space between data science and IT.

### How to package a Shiny application

If you’ve seen the simple app you get when you choose to create a new Shiny application in RStudio, you’ll be familiar with the basic structure of a Shiny application. You need to have a UI object and a server function.

If you have a look inside the UI object you’ll see that it contains the html that will be used for building your user interface. It’s not everything that will get served to the user when they access the web application — some of that is added by the Shiny framework when it runs the application — but it covers off the elements you’ve defined yourself.

The server function defines the server-side logic that will be executed for your application. This includes code to handle your inputs and produce outputs in response.

The great thing about Shiny is that you can create something awesome quite quickly, but once you’ve mastered the basics, the only limit is your imagination.

For our purposes here, we’re going to stick with the ‘geyser’ application that RStudio gives you when you click to create a new Shiny Web Application. If you open up RStudio, and create a new Shiny app — choosing the single file app.R version — you’ll be able to see what we’re talking about. The small size of the geyser app makes it ideal for further study.

If you look through the code you’ll see that there are essentially three components: the UI object, the server function, and the `shinyApp()` function that actually runs the app.

Building an R package of just those three components is a case of breaking them out into the constituent parts and inserting them into a blank package structure. We have a version of this up on GitHub that you can check out if you want.

The directory layout of the demo project looks like this:

``````|-- DESCRIPTION
|-- NAMESPACE
|-- R
|   |-- launchApp.R
|   |-- shinyAppServer.R
|   `-- shinyAppUI.R
|-- inst
|   `-- shinyApp
|       `-- app.R
|-- man
|   |-- launchApp.Rd
|   |-- shinyAppServer.Rd
|   `-- shinyAppUI.Rd
`-- shinyAppDemo.Rproj``````

Once the app has been adapted to sit within the standard R package structure we’re almost done. The UI object and server function don’t really need to be exported, and we’ve just put a really thin wrapper function around `shinyApp()` — I’ve called it `launchApp()` — which we’ll actually use to launch the app. If you install the package from GitHub with devtools, you can see it in action.

``````library(shinyAppDemo)
launchApp()``````

This will start the Shiny application running locally.

The approach outlined here also works fine with Shiny Modules, either in the same package, or called from a separate package.

And that’s almost it! The only thing remaining is how we might deploy this app to Shiny server (including Shiny Server Pro) or RStudio Connect.

### Publishing your packaged Shiny app

We already know that Shiny Server and RStudio Connect expect either a ui.R and a server.R or an app.R file. We’re running our application out of a package with none of this, so we won’t be able to publish it until we fix this problem.

The solution we’ve arrived at is to create a directory called ‘shinyApp’ inside the inst directory of the package. For those of you who are new to R packaging, the contents of the ‘inst’ directory are essentially ignored during the package build process, so it’s an ideal place to put little extras like this.

The name ‘shinyApp’ was chosen for consistency with Shiny Server which uses a ‘shinyApps’ directory if a user is allowed to serve applications from their home directory.

Inside this directory we create a single ‘app.R’ file with the following line in it:

``shinyAppDemo::launchApp()``

And that really is it. This one file will allow us to publish our packaged application under some circumstances, which we’ll discuss shortly.

Here’s where having a packaged Shiny app can get tricky, so we’re going to talk you through the options and do what we can to point out the pitfalls.

### Shiny Server and Shiny Server Pro

Perhaps surprisingly — given that Shiny Server is the oldest method of Shiny app publication — it’s also the easiest one to use with these sorts of packaged Shiny apps. There are basically two ways to publish on Shiny Server. From your home directory on the server — also known as self-publishing — or publishing from a central location, usually the directory ‘/srv/shiny-server’.

The central benefit of this approach is the ability to update the application just by installing a newer version of the package. Sadly though, it’s not always an easy approach to take.

#### Apps served from home directory (AKA self-publishing)

The first publication method is from a users’ home directory. This is generally used in conjunction with RStudio Server. In the self-publishing model, Shiny Server (and Pro) expect apps to be found in a directory called ‘ShinyApps’, within the users home directory. This means that if we install a Shiny app in a package the final location of the app directory will be inside the installed package, not in the ShinyApps directory. In order to work around this, we create a link from where the app is expected to be, to where it actually is within the installed package structure.

So in the example of our package, we’d do something like this in a terminal session:

``````# make sure we're in our home directory
cd
# change into the shinyApps directory
cd shinyApps
# create a link from our app directory inside the package
ln -s /home/sellorm/R/x86_64-pc-linux-gnu-library/3.4/shinyAppDemo/shinyApp ./testApp``````

Note: The path you will find your libraries in will differ from the above. Check by running `.libPaths()[1]` and then `dir(.libPaths()[1])` to see if that’s where your packages are installed.

Once this is done, the app should be available at ‘http://:3838//’ and can be updated by updating the installed version of the package. Update the package and the updates will be published via Shiny Server straight away.

#### Apps Server from a central location (usually /srv/shiny-server)

This is essentially the same as above, but the task of publishing the application generally falls to an administrator of some sort.

Since they would have to transfer files to the server and log in anyway, it shouldn’t be too much of an additional burden to install a package while they’re there. Especially if that makes life easier from then on.

The admin would need to transfer the package to the server, install it and then create a link — just like in the example above — from the expected location, to the installed location.

The great thing with this approach is that when updates are due to be installed the admin only has to update the installed package and not any other files.

### RStudio Connect

Connect is the next generation Shiny Server. In terms of features and performance, it’s far superior to its predecessor. One of the best features is the ability to push Shiny app code directly from the RStudio IDE. For the vast majority of users, this is a huge productivity boost, since you no longer have to wait for an administrator to publish your app for you.

Since publishing doesn’t require anyone to directly log into the server as part of the publishing process, there aren’t really any straightforward opportunities to install a custom package. This means that, in general, publishing a packaged shiny application isn’t really possible.

There’s only one real workaround for this situation that I’m aware of. If you have an internal CRAN-like repository for your custom packages, you should be able to use that to update Connect, with a little work.

You’d need to have your dev environment and Connect hooked up to the same repo. The updated app package needs to be available in that repo and installed in your dev environment. Then, you could publish and then update the single line app.R for each successive package version you publish.

Connect uses packrat under the hood, so when you publish the app.R the packrat manifest will also be sent to the server. Connect will use the manifest to decide which packages are required to run your app. If you’re using a custom package this would get picked up and installed or updated during deployment.

### shinyapps.io

It’s not currently possible to publish a packaged application to shinyapps.io. You’d need to make sure your app followed the accepted conventions for creating Shiny apps and only uses files, rather than any custom packages.

### Conclusion

Packaging Shiny apps can be a real productivity boon for you and your team. In situations where you can integrate that process into other processes, such as automatically running your unit tests or automated publishing it can also help you adopt devops-style workflows.

However, in some instances, the practice can actually make things worse and really slow you down. It’s essential to understand what the publishing workflow is in your organisation before embarking on any significant Shiny packaging project as this will help steer you towards the best course of action.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Monkeying around with Code and Paying it Forward

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)
``````library(tidyverse)
library(monkeylearn)
``````

This is a story (mostly) about how I started contributing to the rOpenSci package monkeylearn. I can’t promise any life flipturning upside down, but there will be a small discussion about git best practices which is almost as good . The tl;dr here is nothing novel but is something I wish I’d experienced firsthand sooner. That is, that tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source.

We all write code all the time to graft aditional features onto existing tools or reshape output into forms that fit better in our particular pipelines. Chances are, these are improvements our fellow package users could take advantage of. Plus, if they’re integrated into the package source code, then we no longer need our own wrappers and reshapers and speeder-uppers. That means less code and fewer chances of bugs all around . So, tinkering with and improving on the code others have written is more rewarding for you and more valuable to others when you contribute it back to the original source.

### Some Backstory

My first brush with the `monkeylearn` package was at work one day when I was looking around for an easy way to classify groups of texts using R. I made the very clever first move of Googling “easy way to classify groups of texts using R” and thanks to the magic of what I suppose used to be PageRank I landed upon a GitHub README for a package called monkeylearn.

A quick `install.packages("monkeylearn")` and creation of an API key later it started looking like this package would fit my use case. I loved that it sported only two functions, `monkeylearn_classify()` and `monkeylearn_extract()`, which did exactly what they said on the tin. They accept a vector of texts and return a dataframe of classifications or keyword extractions, respectively.

For a bit of background, the `monkeylearn` package hooks into the MonkeyLearn API, which uses natural language processing techniques to take a text input and hands back a vector of outputs (keyword extractions or classifications) along with metadata such as their confidence in relevance of the classification. There are a set of built-in “modules” (e.g., retail classifier, profanity extractor) but users can also create their own “custom” modules1 by supplying their own labeled training data.

The monkeylearn R package serves as a friendly interface to that API, allowing users to process data using the built-in modules (it doesn’t yet support creating and training of custom modules). In the rOpenSci tradition it’s peer-reviewed and was contributed via the onboarding process.

I began using the package to attach classifications to around 70,000 texts. I soon discovered a major stumbling block: I could not send texts to the MonkeyLearn API in batches. This wasn’t because the `monkeylearn_classify()` and `monkeylearn_extract()` functions themselves didn’t accept multiple inputs. Instead, it was because they didn’t explicitly relate inputs to outputs. This became a problem because inputs and outputs are not 1:1; if I send a vector of three texts for classification, my output dataframe might be 10 rows long. However, there was no user-friendly way to know for sure2 whether the first two or the first four output rows, for example, belonged to the first input text.

Here’s an example of what I mean.

``texts ``
``````## # A tibble: 11 x 4
##    category_id probability label                text_md5
##
##  1    18314767      0.0620 Books                af55421029d7236ca6ecbb281…
##  2    18314954      0.0470 Mystery & Suspense   af55421029d7236ca6ecbb281…
##  3    18314957      0.102  Police Procedural    af55421029d7236ca6ecbb281…
##  4    18313210      0.0820 Party & Occasions    602f1ab2654b88f5c7f5c90e4…
##  5    18313231      0.176  "Party Supplies "    602f1ab2654b88f5c7f5c90e4…
##  6    18313235      0.134  Party Decorations    602f1ab2654b88f5c7f5c90e4…
##  7    18313236      0.406  Decorations          602f1ab2654b88f5c7f5c90e4…
##  8    18314767      0.0630 Books                bdb9881250321ce8abecacd4d…
##  9    18314870      0.0460 Literature & Fiction bdb9881250321ce8abecacd4d…
## 10    18314876      0.0400 Mystery & Suspense   bdb9881250321ce8abecacd4d…
## 11    18314878      0.289  Suspense             bdb9881250321ce8abecacd4d…
``````

So we can see we’ve now got classifications for the `texts` we fed in as input. The MD5 hash can be used to disambiguate which outputs correspond to which inputs in some cases (see Maëlle’s fantastic Guardian Experience post!). This works great if you either don’t care about classifying your inputs independently of one another or you know that your inputs will never contain empty strings or other values that won’t be sent to the API. In my case, though, my inputs were independent of one another and also could not be counted on to be well-formed. I determined that each had to be classified separately so that I could guarantee a 1:1 match between input and output.

### Initial Workaround

My first approach to this problem was to simply treat each text as a separate call. I wrapped `monkeylearn_classify()` in a function that would send a vector of texts and return a dataframe relating the input in one column to the output in the others. Here is a simplified version of it, sans the error handling and other bells and whistles:

``````initial_workaround %
mutate(
tags = NA_character_
)

for (i in 1:nrow(df)) {
this_text %
select(!!quo_col) %>%
slice(i) %>% as_vector()

this_classification %
select(-text_md5) %>% list()

out[i, ]\$tags ``````

Since `initial_workaround()` takes a dataframe as input rather than a vector, let’s turn our sample into a tibble before feeding it in.

``texts_df ``

And now we’ll run the workaround:

``initial_out ``
``````## # A tibble: 3 x 2
##   texts                                                           tags
##
## 1 It is a truth universally acknowledged, that a single man in p… ``````

We see that this retains the 1:1 relationship between input and output, but still allows the output list-col to be unnested.

``````(initial_out %>% unnest())
``````
``````## # A tibble: 11 x 4
##    texts                                   category_id probability label
##
##  1 It is a truth universally acknowledged…    18314767      0.0620 Books
##  2 It is a truth universally acknowledged…    18314954      0.0470 Myster…
##  3 It is a truth universally acknowledged…    18314957      0.102  Police…
##  4 When Mr. Bilbo Baggins of Bag End anno…    18313210      0.0820 Party …
##  5 When Mr. Bilbo Baggins of Bag End anno…    18313231      0.176  "Party…
##  6 When Mr. Bilbo Baggins of Bag End anno…    18313235      0.134  Party …
##  7 When Mr. Bilbo Baggins of Bag End anno…    18313236      0.406  Decora…
##  8 I'm not an ambiturner. I can't turn le…    18314767      0.0630 Books
##  9 I'm not an ambiturner. I can't turn le…    18314870      0.0460 Litera…
## 10 I'm not an ambiturner. I can't turn le…    18314876      0.0400 Myster…
## 11 I'm not an ambiturner. I can't turn le…    18314878      0.289  Suspen…
``````

But, the catch: this approach was quite slow. The real bottleneck here isn’t the for loop; it’s that this requires a round trip to the MonkeyLearn API for each individual text. For just these three meager texts, let’s see how long `initial_workaround()` takes to finish.

``(benchmark ``
``````##    user  system elapsed
##   0.036   0.001  15.609
``````

It was clear that if classifying 3 inputs was going to take 15.6 seconds, even classifying my relatively small data was going to take a looong time, like on the order of 4 days, just for the first batch of data . I updated the function to write each row out to an RDS file after it was classified inside the loop (with an addition along the lines of `write_rds(out[i, ], glue::glue("some_directory/{i}.rds"))`) so that I wouldn’t have to rely on the function successfully finishing execution in one run. Still, I didn’t like my options.

This classification job was intended to be run every night, and with an unknown amount of input text data coming in every day, I didn’t want it to run for more than 24 hours one day and either a) prevent the next night’s job from running or b) necessitate spinning up a second server to handle the next night’s data.

### Diving In

Now that I’m starting to think

I’m just about at the point where I have to start making myself useful.

I’d seen in the package docs and on the MonkeyLearn FAQ that batching up to 200 texts was possible2. So, I decide to first look into the mechanics of how text batching is done in the `monkeylearn` package.

Was the MonkeyLearn API returning JSON that didn’t relate each input individual and output? I sort of doubted it. You’d think that an API that was sent a JSON “array” of inputs would send back a hierarchical array to match. My hunch was that either the package was concatenating the input before shooting it off to the API (which would save user on API queries) or rowbinding the output after it was returned. (The rowbinding itself would be fine if each input could somehow be related to its one or many outputs.)

So I fork the package repo and set about rummaging through the source code. Blissfully, everything is nicely commented and the code was quite readable.

I step through `monkeylearn_classify()` in the debugger and narrow in on a call to what looks like a utility function: `monkeylearn_parse()`. I find it in `utils.R`.

The lines in `monkeylearn_parse()` that matter for our purposes are:

``text ``

So this is where the rowbinding happens – after the `fromJSON()` call!

This is good news because it means that the MonkeyLearn API is sending differentiated outputs back in a nested JSON object. The package converts this to a list with `fromJSON()` and only then is the rbinding applied. That’s why the `text_md5` hash is generated during this step: to be able to group outputs that all correspond to a single input (same hash means same input).

I set about copy-pasting `monkeylearn_parse()` and did a bit of surgery on it, emerging with `monkeylearn_parse_each()`. `monkeylearn_parse_each()` skips the rbinding and retains the list structure of each output, which means that its output can be turned into a nested tibble with each row corresponding to one input. That nested tibble can then be related to each corresponding element of the input vector. All that remained was to use create a new enclosing analog to `monkeylearn_classify()` that could use `monkeylearn_parse_each()`.

#### Thinking PR thoughts

At this point, I thought that such a function might be useful to some other people using the package so I started building it with an eye toward making a pull request.

Since I’d found it useful to be able to pass in an input dataframe in `initial_workaround()`, I figured I’d retain that feature of the function. I wanted users to still be able to pass in a bare column name but the package seemed to be light on tidyverse functions unless there was no alternative, so I un-tidyeval’d the function (using `deparse(substitute())` instead of a quosure) and gave it the imaginative name…`monkeylearn_classify_df()`. The rest of the original code was so airtight I didn’t have to change much more to get it working.

A nice side effect of my plumbing through the guts of the package was that I caught a couple minor bugs (things like the remnants of a for loop remaining in what had been revamped into a while loop) and noticed where there could be some quick wins for improving the package.

After a few more checks I wrote up the description for the pull request which outlined the issue and the solution (though I probably should have first opened an issue, waited for a response, and then submitted a PR referencing the issue as Mara Averick suggests in her excellent guide to contributing to the tidyverse).

I checked the list of package contributors to see if I knew anyone. Far and away the main contributor was Maëlle Salmon! I’d heard of her through the magic of #rstats Twitter and the R-Ladies Global Slack. A minute or two after submitting it I headed over to Slack to give her a heads up that a PR would be heading her way.

In what I would come to know as her usual cheerful, perpetually-on-top-of-it form, Maëlle had already seen it and liked the idea for the new function.

### Continuing Work

To make a short story shorter, Maëlle asked me if I’d like to create the extractor counterpart to `monkeylearn_classify_df()` and become an author on the package with push access to the repo. I said yes, of course, and so we began to strategize over Slack about tradeoffs like which package dependencies we were okay with taking on, whether to go the tidyeval or base route, what the best naming conventions for the new functions should be, etc.

On the naming front, we decided to gracefully deprecate `monkeylearn_classify()` and `monkeylearn_extract()` as the newer functions could cover all of the functionality that the older ones did. I don’t know much about cache invalidation, but the naming problem was hard as usual. We settled on naming their counterparts `monkey_classify()` (which replaced the original `monkeylearn_classify_df()`) and `monkey_extract()`.

#### gitflow

Early on in the process we started talking git conventions. Rather than both work off a development branch, I floated a structure that we typically follow at my company, where each ticket (or in this case, GitHub Issue) becomes its own branch off of dev. For instance, issue #33 becomes branch `T33` (T for ticket). Each of these feature branches come off of dev (unless they’re hotfixes) and are merged back into dev and deleted when they pass all the necessary checks. This approach, I am told, stems from the “gitflow” philosophy which, as far as I understand it, is one of many ways to structure a git workflow that mostly doesn’t end in tears.

Like most git strategies, the idea here is to make pull requests as bite-sized as possible; in this case, a PR can only be as big as the issue it’s named from. An added benefit for me, at least, is that this keeps me from wandering off into other parts of the code without first documenting the point in a separate issue, and then creating a branch. At most one person is assinged to each ticket/issue, which minimizes merge conflicts. You also leave a nice paper trail because the branch name directly references the issue front and center in its title. This means you don’t have to explicitly name the issue in the commit or rely on GitHub’s (albeit awesome) keyword branch closing system3.

Finally, since the system is so tied to issues themselves, it encourages very frequent communication between collaborators. Since the issue must necessarily be made before the branch and the accompanying changes to the code, the other contributors have a chance to weigh in on the issue or the approach suggested in its comments before any code is written. In our case, it’s certainly made frequent communication the path of least resistance.

While this branch and PR-naming convention isn’t particular to gitflow (to my knowledge), it did spark a short conversation on Twitter that I think is useful to have:

Thomas Lin Pedersen makes a good point on the topic:

I prefer named PRs as it gives a quick overview over opened PRs. While cross referencing with open issues is possible it is very tedious when you try to get an overview

— Thomas Lin Pedersen (@thomasp85)
March 6, 2018

<!–

<!–

–>

<!–

–>

<!– <a target="_blank" href=" –>

<!– –>

<!–

–>

<!– –>

This insight got me thinking that the best approach might be to explicitly name the issue number and give a description in the branch name, like a slug of sorts. I started using a branch syntax like `T31-fix-bug-i-just-created` which has worked out well for Maëlle and me thus far, making the history a bit more readable.

#### Main Improvements

As I mentioned, the package was so good to begin with it was difficult to find ways to improve it. Most of the subsequent work I did on `monkeylearn` was to improve the new `monkey_` functions.

The original `monkeylearn_` functions discarded inputs such as empty strings that could not be sent to the API. We now retain those empty inputs and return NAs in the response columns for that row. This means that the output is always of the same dimensions as the input. We return an unnested dataframe by default, as the original functions did, but allow the output to be nested if the `unnest` flag is set to FALSE.

The functions also got more informative messages about which batches are currently being processed and which texts those batches corresponded to.

``text_w_empties ``
``````## The following indices were empty strings and could not be sent to the API: 3, 5. They will still be included in the output.
``````
``````## Processing batch 1 of 2 batches: texts 1 to 2
``````
``````## Processing batch 2 of 2 batches: texts 2 to 3
``````
``````## # A tibble: 8 x 4
##   req                                      category_id probability label
##
## 1 It is a truth universally acknowledged,…       64708       0.289 Society
## 2 It is a truth universally acknowledged,…       64711       0.490 Relati…
## 3 When Mr. Bilbo Baggins of Bag End annou…       64708       0.348 Society
## 4 When Mr. Bilbo Baggins of Bag End annou…       64713       0.724 Specia…
## 5 ""                                                NA      NA
## 6 I'm not an ambiturner. I can't turn lef…       64708       0.125 Society
## 7 I'm not an ambiturner. I can't turn lef…       64710       0.377 Parent…
## 8 " "                                               NA      NA     ``````

So even though the empty string inputs like the 3rd and 5th, aren’t sent to the API, we can see they’re still included in the output dataframe and assigned the same column names as all of the other outputs. That means that even if `unnest` is set to FALSE, the output can still be unnested with `tidyr::unnest()` after the fact.

If a dataframe is supplied, there is now a `.keep_all` option which allows for all columns of the input to be retained, not just the column that contains the text to be classified. This makes the `monkey_` functions work even more like a `mutate()`; rather than returning an object that has to be joined on the original input, we do that association for the user.

``````sw %
dplyr::select(name, height) %>%
dplyr::sample_n(length(text_w_empties))

df %
dplyr::bind_cols(sw)

df %>% monkey_classify(text, classifier_id = "cl_5icAVzKR", unnest = FALSE, .keep_all = TRUE, verbose = FALSE)
``````
``````## # A tibble: 5 x 4
##   name           height text                                       res
##
## 1 Ackbar            180 It is a truth universally acknowledged, t… ``````

We see that the input column, `text` is sandwiched between the other columns of the original dataframe (the Starwars ones) and the output column `res`.

The hope is that all of this serves to improve the data safety and user experience of the package.

#### Developing functions in tandem

Something I’ve been thinking about while working on the twin functions `monkey_extract()` and `monkey_classify()` is what the best practice is for developing very similar functions in sync with one another. These two functions are different enough to have different default values (for example, `monkey_extract()` has a default `extractor_id` while `monkey_classify()` has a default `classifier_id`) but are so similar in other regards as to be sort of embarrassingly parallel.

What I’ve been turning over in my head is the question of how in sync these functions should be during development. As soon as you make a change to one function, should you immediately make the same change to the other? Or is it instead better to work on one function at a time, and, at some checkpoints then batch these changes over to the other function in a big copy-paste job? I’ve been tending toward the latter but it’s seemed a little dangerous to me.

Since there are only two functions to worry about here, creating a function factory to handle them seemed like overkill, but might technically be the best practice. I’d love to hear people’s thoughts on how they go about navigating this facet of package development.

### Last Thoughts

My work on the `monkeylearn package` so far has been rewarding to say the least. It’s inspired me to be not just a consumer but more of an active contributor to open source. Some wise words from Maëlle on this front:

You too can become a contributor to an rOpenSci package! Have a look at the issues tracker of your favorite rOpenSci package(s) e.g. rgbif. Browse issues suitable for beginners over many rOpenSci repos thanks to Lucy D’Agostino McGowan’s contributr Shiny app. Always first ask in a new or existing issue whether your contribution would be welcome, plan a bit with the maintainer, and then have fun! We’d be happy to have you.

Maëlle’s been a fantastic mentor, providing guidance in at least four languages – English, French, R, and emoji, despite the time difference and (!). When it comes to `monkeylearn`, the hope is to keep improving the core package features, add some more niceties, and look into building out an R-centric way for users to create and train their own custom modules on MonkeyLearn.

On y va!

1. Custom, to a point. As of this writing, two types of classifier models you can create use either Naive Bayes or Support Vector Machines, though you can specify other parameters such as `use_stemmer` and `strip_stopwords`. Custom extractor modules are coming soon.
2. That MD5 hash almost provided the solution; each row of the output gets a hash that corresponds to a single input row, so it seemed like the hash was meant to be used to be able to map inputs to outputs. Provided that I knew that all of my inputs were non-empty strings, which are filtered out before they can be sent to the API, and could be classified I could have nested the output based on its MD5 sum and mapped the indices of the inputs and the outputs 1:1. The trouble was that I knew that my input data would be changing and I wasn’t convinced that all of my inputs would be receive well-formed responses from the API. If some of the text couldn’t receive a corresponding set of classification, such a nested output would have fewer rows than the input vector’s length. There would be no way to tell which input corresponded to which nested output.
3. Keywords in commits don’t automatically close issues until they’re merged into master, and since we were working off of dev for quite a long time, if we relied on keywords to automatically close issues our Open iIsues list wouldn’t accurately reflect the issues that we actually still had to address. Would be cool for GitHub to allow flags like maybe “fixes #33 –dev” could close issue #33 when the PR with that phrase in the commit was merged into dev.