11 Jobs for R users from around the world (2018-06-19)

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  4. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  5. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018

All New R Jobs

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
    Anywhere
    18 Jun 2018
  4. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  5. Full-Time
    Research Data Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    15 Jun 2018
  6. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  7. Full-Time
    Market Research Analyst: Mobility for RSG RSG – Posted by patricia.holland@rsginc.com
    Anywhere
    25 May 2018
  8. Full-Time
    Data Scientist @ New Delhi, India Amulozyme Inc. – Posted by Amulozyme
    New Delhi Delhi, India
    25 May 2018
  9. Full-Time
    Data Scientist/Programmer @ Milwaukee, Wisconsin, U.S. ConsensioHealth – Posted by ericadar
    Milwaukee Wisconsin, United States
    25 May 2018
  10. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018
  11. Full-Time
    Lead Data Scientist @ Washington, District of Columbia, U.S. AFL-CIO – Posted by carterkalchik
    Washington District of Columbia, United States
    27 Apr 2018

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

Source:: R News

Copy/paste t-tests Directly to Manuscripts

By Dominique Makowski

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

One of the most time-consuming part of data analysis in psychology is the copy-pasting of specific values of some R output to a manuscript or a report. This task is frustrating, prone to errors, and increases the variability of statistical reporting. This is an important issue, as standardizing practices of what and how to report might be a key to overcome the reproducibility crisis of psychology. The psycho package was designed specifically to do this job. At first, for complex Bayesian mixed models, but the package is now compatible with basic methods, such as t-tests.

Do a t-test

# Load packages
library(tidyverse)

# devtools::install_github("neuropsychology/psycho.R")  # Install the latest psycho version
library(psycho)

df  psycho::affective  # Load the data


results  t.test(df$Age ~ df$Sex)  # Perform a simple t-test

APA formatted output

You simply run the analyze() function on the t-test object.

psycho::analyze(results)
The Welch Two Sample t-test suggests that the difference of df$Age by df$Sex (mean in group F = 26.78, mean in group M = 27.45, difference = -0.67) is not significant (t(360.68) = -0.86, 95% CI [-2.21, 0.87], p > .1).

Flexibility

It works for all kinds of different t-tests versions.

t.test(df$Adjusting ~ df$Sex,
       var.equal=TRUE, 
       conf.level = .90) %>% 
  psycho::analyze()
The  Two Sample t-test suggests that the difference of df$Adjusting by df$Sex (mean in group F = 3.72, mean in group M = 4.13, difference = -0.41) is significant (t(1249) = -4.13, 90% CI [-0.58, -0.25], p 
t.test(df$Adjusting,
       mu=0,
       conf.level = .90) %>% 
  psycho::analyze()
The One Sample t-test suggests that the difference between df$Adjusting (mean = 3.80) and mu = 0 is significant (t(1250) = 93.93, 90% CI [3.74, 3.87], p 

Dataframe of Values

It is also possible to have all the values stored in a dataframe by running a summary on the analyzed object.

library(tidyverse)

t.test(df$Adjusting ~ df$Sex) %>% 
  psycho::analyze() %>% 
  summary()
effect statistic df p CI_lower CI_higher
-0.4149661 -4.067008 377.8364 5.8e-05 -0.6155884 -0.2143439

Contribute

Of course, these reporting standards should change, depending on new expert recommandations or official guidelines. The goal of this package is to flexibly adapt to new changes and accompany the evolution of best practices. Therefore, if you have any advices, opinions or ideas, we encourage you to let us know by opening an issue or, even better, to try to implement changes yourself by contributing to the code.

Credits

This package helped you? Don’t forget to cite the various packages you used 🙂

You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

Previous blogposts

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R vs Python: Image Classification with Keras

By Dmitry Kisler

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Even though the libraries for R from Python, or Python from R code execution existed since years and despite of a recent announcement of Ursa Labs foundation by Wes McKinney who is aiming to join forces with RStudio foundation, Hadley Wickham in particularly, (find more here) to improve data scientists workflow and unify libraries to be used not only in Python, but in any programming language used by data scientists, some data professionals are still very strict on the language to be used for ANN models limiting their dev. environment exclusively to Python.

As a continuation of my R vs. Python comparison, I decided to test performance of both languages in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post by Dr. Shirin Glander on how easy it is to build a CNN model in R using Keras.

A few words about Keras. It is a Python library for artificial neural network ML models which provides high level fronted to various deep learning frameworks with Tensorflow being the default one.
Keras has many pros with the following among the others:

  • Easy to build complex models just in few lines of code => perfect for dev. cycle to quickly experiment and check your ideas
  • Code recycling: one can easily swap the backend framework (let’s say from CNTK to Tensorflow or vice versa) => DRY principle
  • Seamless use of GPU => perfect for fast model tuning and experimenting

Since Keras is written in Python, it may be a natural choice for your dev. environment to use Python. And that was the case until about a year ago when RStudio founder J.J.Allaire announced release of the Keras library for R in May’17. I consider this to be a turning point for data scientists; now we can be more flexible with dev. environment and be able to deliver result more efficiently with opportunity to extend existing solutions we may have written in R.

It brings me to the point of this post.
My hypothesis is, when it comes to ANN ML model building with Keras, Python is not a must, and depending on your client’s request, or tech stack, R can be used without limitations and with similar efficiency.

Image Classification with Keras

In order to test my hypothesis, I am going to perform image classification using the fruit images data from kaggle and train a CNN model with four hidden layers: two 2D convolutional layers, one pooling layer and one dense layer. RMSProp is being used as the optimizer function.

Tech stack

Hardware:
CPU: Intel Core i7-7700HQ: 4 cores (8 threads), 2800 – 3800 (Boost) MHz core clock
GPU: Nvidia Geforce GTX 1050 Ti Mobile: 4Gb vRAM, 1493 – 1620 (Boost) MHz core clock
RAM: 16 Gb

Software:
OS: Linux Ubuntu 16.04
R: ver. 3.4.4
Python: ver. 3.6.3
Keras: ver. 2.2
Tensorflow: ver. 1.7.0
CUDA: ver. 9.0 (note that the current tensorflow version supports ver. 9.0 while the up-to-date version of cuda is 9.2)
cuDNN: ver. 7.0.5 (note that the current tensorflow version supports ver. 7.0 while the up-to-date version of cuDNN is 7.1)

Code

The R and Python code snippets used for CNN model building are present bellow. Thanks to fruitful collaboration between F. Chollet and J.J. Allaire, the logic and functions names in R are alike the Python ones.

R

## Courtesy: Dr. Shirin Glander. Code source: https://shirinsplayground.netlify.com/2018/06/keras_fruits/

library(keras)
start % image_data_generator(
  rescale = 1/255
)

# Validation data shouldn't be augmented! But it should also be scaled.
valid_data_gen % 
  layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = 'same', input_shape = c(img_width, img_height, channels)) %>%
  layer_activation('relu') %>%
  
  # Second hidden layer
  layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = 'same') %>%
  layer_activation_leaky_relu(0.5) %>%
  layer_batch_normalization() %>%
  
  # Use max pooling
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%
  
  # Flatten max filtered output into feature vector 
  # and feed into dense layer
  layer_flatten() %>%
  layer_dense(100) %>%
  layer_activation('relu') %>%
  layer_dropout(0.5) %>%
  
  # Outputs from dense layer are projected onto output layer
  layer_dense(output_n) %>% 
  layer_activation('softmax')

# compile
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6),
  metrics = 'accuracy'
)
# fit
hist % 
  {data.frame(acc = .$acc[epochs], val_acc = .$val_acc[epochs], elapsed_time = as.integer(Sys.time()) - as.integer(start))}

Python

## Author: D. Kisler  - adoptation of R code from https://shirinsplayground.netlify.com/2018/06/keras_fruits/

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import (Conv2D,
                          Dense,
                          LeakyReLU,
                          BatchNormalization, 
                          MaxPooling2D, 
                          Dropout,
                          Flatten)
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, TensorBoard
import PIL.Image
from datetime import datetime as dt

start = dt.now()

# fruits categories
fruit_list = ["Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange",
                "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate"]
# number of output classes (i.e. fruits)
output_n = len(fruit_list)
# image size to scale down to (original images are 100 x 100 px)
img_width = 20
img_height = 20
target_size = (img_width, img_height)
# image RGB channels number
channels = 3
# path to image folders
path = "path/to/folder/with/data"
train_image_files_path = path + "fruits-360/Training"
valid_image_files_path = path + "fruits-360/Test"

## input data augmentation/modification
# training images
train_data_gen = ImageDataGenerator(
  rescale = 1./255
)
# validation images
valid_data_gen = ImageDataGenerator(
  rescale = 1./255
)

## getting data
# training images
train_image_array_gen = train_data_gen.flow_from_directory(train_image_files_path,                                                            
                                                           target_size = target_size,
                                                           classes = fruit_list, 
                                                           class_mode = 'categorical',
                                                           seed = 42)

# validation images
valid_image_array_gen = valid_data_gen.flow_from_directory(valid_image_files_path, 
                                                           target_size = target_size,
                                                           classes = fruit_list,
                                                           class_mode = 'categorical',
                                                           seed = 42)

## model definition
# number of training samples
train_samples = train_image_array_gen.n
# number of validation samples
valid_samples = valid_image_array_gen.n
# define batch size and number of epochs
batch_size = 32
epochs = 10

# initialise model
model = Sequential()

# add layers
# input layer
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', input_shape = (img_width, img_height, channels), activation = 'relu'))
# hiddel conv layer
model.add(Conv2D(filters = 16, kernel_size = (3,3), padding = 'same'))
model.add(LeakyReLU(.5))
model.add(BatchNormalization())
# using max pooling
model.add(MaxPooling2D(pool_size = (2,2)))
# randomly switch off 25% of the nodes per epoch step to avoid overfitting
model.add(Dropout(.25))
# flatten max filtered output into feature vector
model.add(Flatten())
# output features onto a dense layer
model.add(Dense(units = 100, activation = 'relu'))
# randomly switch off 25% of the nodes per epoch step to avoid overfitting
model.add(Dropout(.5))
# output layer with the number of units equal to the number of categories
model.add(Dense(units = output_n, activation = 'softmax'))

# compile the model
model.compile(loss = 'categorical_crossentropy', 
              metrics = ['accuracy'], 
              optimizer = RMSprop(lr = 1e-4, decay = 1e-6))

# train the model
hist = model.fit_generator(
  # training data
  train_image_array_gen,

  # epochs
  steps_per_epoch = train_samples // batch_size, 
  epochs = epochs, 

  # validation data
  validation_data = valid_image_array_gen,
  validation_steps = valid_samples // batch_size,

  # print progress
  verbose = 2,
  callbacks = [
    # save best model after every epoch
    ModelCheckpoint("fruits_checkpoints.h5", save_best_only = True),
    # only needed for visualising with TensorBoard
    TensorBoard(log_dir = "fruits_logs")
  ]
)

df_out = {'acc': hist.history['acc'][epochs - 1], 'val_acc': hist.history['val_acc'][epochs - 1], 'elapsed_time': (dt.now() - start).seconds}

Experiment

The models above were trained 10 times with R and Pythons on GPU and CPU, the elapsed time and the final accuracy after 10 epochs were measured. The results of the measurements are presented on the plots below (click the plot to be redirected to plotly interactive plots).

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

Let’s perform unpaired t-test assuming that all our observations are normally distributed.

library(dplyr)
library(data.table)
# fetch the data used to plot graphs
d %
    group_by(lang, eng) %>%
    summarise(el_mean = mean(elapsed_time),
              el_std = sd(elapsed_time),
              n = n()) %>% data.frame() %>%
    group_by(eng) %>%
    summarise(t_score = round(diff(el_mean)/sqrt(sum(el_std^2/n)), 2))
eng t_score
cpu 11.38
gpu 9.64

T-score reflects a significant difference between the time required to train a CNN model in R compared to Python as we saw on the plot above.

Summary

Building and training CNN model in R using Keras is as “easy” as in Python with the same coding logic and functions naming convention. Final accuracy of your Keras model will depend on the neural net architecture, hyperparameters tuning, training duration, train/test data amount etc., but not on the programming language you would use for your DS project. Training a CNN Keras model in Python may be up to 15% faster compared to R

P.S.

If you would like to know more about Keras and to be able to build models with this awesome library, I recommend you these books:

As well as this Udemy course to start your journey with Keras.

Thanks a lot for your attention! I hope this post would be helpful for an aspiring data scientist to gain an understanding of use cases for different technologies and to avoid being biased when it comes to the selection of the tools for DS projects accomplishment.

    Related Post

    1. Update: Can we predict flu outcome with Machine Learning in R?
    2. Evaluation of Topic Modeling: Topic Coherence
    3. Natural Language Generation with Markovify in Python
    4. Anomaly Detection in R – The Tidy Way
    5. Forecasting with ARIMA – Part I
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Statistics Sunday: Accessing the YouTube API with tuber

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    I haven’t had a lot of time to play with this but yesterday, I discovered the tuber R package, which allows you to interact with the YouTube API.

    To use the tuber package, not only do you need to install it in R – you’ll need a Google account and will have to authorize 4 APIs through Developer Console: all 3 YouTube APIs (though the Data API will be doing the heavy lifting) and the Freebase API. Before you authorize the first API, Google will have you create a project to tie the APIs to. Then, you’ll find the APIs in the API library to add to this project. Click on each API and on the next screen, select Enable. You’ll need to create credentials for each of the YouTube APIs. When asked to identify the type of app that will accessing the YouTube API, select “Other”.

    The tuber package requires two pieces of information from you, which you’ll get when you set up credentials for the OAuth 2.0 Client: client ID and client secret. Once you set those up, you can download them at any time in JSON format by going to the Credentials dashboard and clicking the download button on the far right.

    In R, load the tuber package, then call the yt_oauth function, using the client ID (which should end with something like “apps.googleusercontent.com”) and client secret (a string of letters and numbers). R will launch a browser window to authorize tuber to access the APIs. Once that’s done, you’ll be able to use the tuber package to, for instance, access data about a channel or get information about a video. My plan is to use my Facebook dataset to pull out the head songs I’ve shared and get the video information to generate a dataset of my songs. Look for more on that later. In the meantime, this great post on insightr can give you some details about using the tuber package.

    [Apologies for the short and late post – I’ve been sick and haven’t had as much time to write recently. Hopefully I’ll get back to normal over the next week.]

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Effectively scaling Shiny in enterprise

    By Mango Solutions

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    James Blair, RStudio

    Scalability is a hot word these days, and for good reason. As data continues to grow in volume and importance, the ability to reliably access and reason about that data increases in importance. Enterprises expect data analysis and reporting solutions that are robust and allow several hundred, even thousands, of concurrent users while offering up-to-date security options.

    Shiny is a highly flexible and widely used framework for creating web applications using R. It enables data scientists and analysts to create dynamic content that provides straightforward access to their work for those with no working knowledge of R. While Shiny has been around for quite some time, recent introductions to the Shiny ecosystem make Shiny simpler and safer to deploy in an enterprise environment where security and scalability are paramount. These new tools in connection with RStudio Connect provide enterprise grade solutions that make Shiny an even more attractive option for data resource creation.

    Develop and Test

    Most Shiny applications are developed either locally on a personal computer or using an instance of RStudio Server. During development it can be helpful to understand application performance, specifically if there are any concerning bottlenecks. The profvis package provides functions for profiling R code and can profile the performance of Shiny applications. profvis provides a breakdown of code performance and can be useful for identifying potential areas for improving application responsiveness.

    The recently released promises package provides asynchronous capabilities to Shiny applications. Asynchronous programming can be used to improve application responsiveness when several concurrent users are accessing the same application. While there is some overhead involved in creating asynchronous applications, this method can improve application responsiveness.

    Once an application is fully developed and ready to be deployed, it’s useful to establish a set of behavioral expectations. These expectations can be used to ensure that future updates to the application don’t break or unexpectedly change behavior. Traditionally most testing of Shiny applications has been done by hand, which is both time consuming and error prone. The new shinytest package provides a clean interface for testing Shiny applications. Once an application is fully developed, a set of tests can be recorded and stored to compare against future application versions. These tests can be run programatically and can even be used with continuous integration (CI) platforms. Robust testing for Shiny applications is a huge step forward in increasing the deployability and dependability of such applications.

    Deploy

    Once an application has been developed and tested to satisfaction, it must be deployed to a production environment in order to provide other users with application access. Production deployment of data resources within an enterprise centers on control. For example, access control and user authentication are important for controlling who has access to the application. Server resource control and monitoring are important for controlling application performance and server stability. These control points enable trustworthy and performant deployment.

    There are a few current solutions for deploying Shiny applications. Shiny Server provides both an open source and professional framework for publishing Shiny applications and making them available to a wide audience. The professional version provides features that are attractive for enterprise deployment, such as user authentication. RStudio Connect is a recent product from RStudio that provides several enhancements to Shiny Server. Specifically, RStudio Connect supports push button deployment and natively handles application dependencies, both of which simplify the deployment process. RStudio Connect also places resource control in the hands of the application developer, which lightens the load on system administrators and allows the developer to tune app performance to align with expectations and company priorities.

    Scale

    In order to be properly leveraged, a deployed application must scale to meet user demand. In some instances, applications will have low concurrent users and will not need any additional help to remain responsive. However, it is often the case in large enterprises that applications are widely distributed and concurrently accessed by several hundred or even thousands of users. RStudio Connect provides the ability to set up a cluster of servers to provide high availability (HA) and load balanced configurations in order to scale applications to meet the needs of concurrent users. Shiny itself has been shown to effectively scale to meet the demands of 10,000 concurrent users!

    As businesses continue searching for ways to efficiently capture and digest growing stores of data, R in connection with Shiny continues to establish itself as a robust and enterprise ready solution for data analysis and reporting.

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    How Much Money Should Machines Earn?

    By @aschinchon

    (This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

    Every inch of sky’s got a star
    Every inch of skin’s got a scar
    (Everything Now, Arcade Fire)

    I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.

    That’s precisely what I’ve done today. I combined two interesting datasets:

    • The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
    • Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.

    Apart from using dplyr to manipulate data and highcharter to do the visualization, I used tabulizer package to extract the table of probabilities of computerisation from the pdf: it makes this task extremely easy.

    This is the resulting plot:

    If you want to examine it in depth, here you have a full size version.

    These are some of my insights (its corresponding figures are obtained directly from the dataset):

    • There is a moderate negative correlation between wages and probability of computerisation.
    • Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
    • In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
    • Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
    • Don’t study for librarian or archivist: it seems a bad way to invest your time
    • Mathematicians will survive to machines

    The code of this experiment is available here.

    To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Version 0.6-11 of NIMBLE released

    By Chris Paciorek

    (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

    We’ve released the newest version of NIMBLE on CRAN and on our website.

    Version 0.6-11 has important new features, notably support for Bayesian nonparametric mixture modeling, and more are on the way in the next few months.

    New features include:

    • support for Bayesian nonparametric mixture modeling using Dirichlet process mixtures, with specialized MCMC samplers automatically assigned in NIMBLE’s default MCMC (See Chapter 10 of the manual for details);
    • additional resampling methods available with the auxiliary and bootstrap particle filters;
    • user-defined filtering algorithms can be used with NIMBLE’s particle MCMC samplers;
    • MCMC thinning intervals can be modified at MCMC run-time;
    • both runMCMC() and nimbleMCMC() now drop burn-in samples before thinning, making their behavior consistent with each other;
    • increased functionality for the ‘setSeed’ argument in nimbleMCMC() and runMCMC();
    • new functionality for specifying the order in which sampler functions are executed in an MCMC; and
    • invalid dynamic indexes now result in a warning and NaN values but do not cause execution to error out, allowing MCMC sampling to continue.

    Please see the NEWS file in the installed package for more details

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Prediction Interval, the wider sister of Confidence Interval

    By Roos Colman

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let’s start with discussing the large difference between a confidence interval and a prediction interval.

    Prediction interval versus Confidence interval

    Very often a confidence interval is misinterpreted as a prediction interval, leading to unrealistic “precise” predictions. As you will see, prediction intervals (PI) resemble confidence intervals (CI), but the width of the PI is by definition larger than the width of the CI.

    Let’s assume that we measure the whole blood hemoglobin concentration in a random sample of 100 persons. We obtain the estimated mean (Est_mean), limits of the confidence interval (CI_Lower and CI_Upper) and limits of the prediction interval (PI_Lower and PI_Upper): (The R-code to do this is in the next paragraph)

    Est_mean CI_Lower CI_Upper PI_Lower PI_Upper
    140 138 143 113 167

    The graphical presentation:

    A Confidence interval (CI) is an interval of good estimates of the unknown true population parameter. About a 95% confidence interval for the mean, we can state that if we would repeat our sampling process infinitely, 95% of the constructed confidence intervals would contain the true population mean. In other words, there is a 95% chance of selecting a sample such that the 95% confidence interval calculated from that sample contains the true population mean.

    Interpretation of the 95% confidence interval in our example:

    • The values contained in the interval [138g/L to 143g/L] are good estimates of the unknown mean whole blood hemoglobin concentration in the population. In general, if we would repeat our sampling process infinitely, 95% of such constructed confidence intervals would contain the true mean hemoglobin concentration.

    A Prediction interval (PI) is an estimate of an interval in which a future observation will fall, with a certain confidence level, given the observations that were already observed. About a 95% prediction interval we can state that if we would repeat our sampling process infinitely, 95% of the constructed prediction intervals would contain the new observation.

    Interpretation of the 95% prediction interval in the above example:

    • Given the observed whole blood hemoglobin concentrations, the whole blood hemoglobin concentration of a new sample will be between 113g/L and 167g/L with a confidence of 95%. In general, if we would repeat our sampling process infinitely, 95% of the such constructed prediction intervals would contain the new hemoglobin concentration measurement.

    Remark: Very often we will read the interpretation “The whole blood hemogblobin concentration of a new sample will be between 113g/L and 167g/L with a probability of 95%.” (for example on wikipedia). This interpretation is correct in the theoretical situation where the parameters (true mean and standard deviation) are known.

    Estimating a prediction interval in R

    First, let’s simulate some data. The sample size in the plot above was (n=100). Now, to see the effect of the sample size on the width of the confidence interval and the prediction interval, let’s take a “sample” of 400 hemoglobin measurements using the same parameters:

    set.seed(123)
    hemoglobin

    Although we don’t need a linear regression yet, I’d like to use the lm() function, which makes it very easy to construct a confidence interval (CI) and a prediction interval (PI). We can estimate the mean by fitting a “regression model” with an intercept only (no slope). The default confidence level is 95%.

    Confidence interval:

    CI
    ##      fit      lwr      upr 
    ## 139.2474 137.8425 140.6524
    

    The CI object has a length of 400. But since there is no slope in our “model”, each row is exactly the same.

    Prediction interval:

    PI
    ## Warning in predict.lm(lm(df$hemoglobin ~ 1), interval = "predict"): predictions on current data refer to _future_ responses
    
    PI[1,]
    ##      fit      lwr      upr 
    ## 139.2474 111.1134 167.3815
    

    We get a “warning” that “predictions on current data refer to future responses”. That’s exactly what we want, so no worries there. As you see, the column names of the objects CI and PI are the same. Now, let’s visualize the confidence and the prediction interval.

    The code below is not very elegant but I like the result (tips are welcome :-))

    library(ggplot2)
    limits_CI 

    Gives this plot:

    The width of the confidence interval is very small, now that we have this large sample size (n=400). This is not surprising, as the estimated mean is the only source of uncertainty. In contrast, the width of the prediction interval is still substantial. The prediction interval has two sources of uncertainty: the estimated mean (just like the confidence interval) and the random variance of new observations.

    Example: comparing a new with a reference measurement method.

    A prediction interval can be useful in the case where a new method should replace a standard (or reference) method.
    If we can predict well enough what the measurement by the reference method would be, (given the new method) than the two methods give similar information and the new method can be used.

    For example in (Tian, 2017) a new spectral method (Near-Infra-Red) to measure hemoglobin is compared with a Golden Standard. In contrast with the Golden Standard method, the new spectral method does not require reagents. Moreover, the new method is faster. We will investigate whether we can predict well enough, based on the measured concentration of the new method, what the measurement by the Golden Standard would be. (note: the measured concentrations presented below are fictive)

    Hb
    New Reference
    84.96576 87.24013
    99.91483 103.38143
    111.79984 116.71593
    116.95961 116.72065
    118.11140 113.51541
    118.21411 121.70586
    plot(Hb$New, Hb$Reference, 
         xlab="Hemoglobin concentration (g/L) - new method", 
         ylab="Hemoglobin concentration (g/L) - reference method")
    

    Gives this plot:

    Prediction Interval based on linear regression

    Now, let’s fit a linear regression model predicting the hemoglobin concentrations measured by the reference method, based on the concentrations measured with the new method.

    fit.lm 

    Adding the regression line:

    abline (a=fit.lm$coefficients[1], b=fit.lm$coefficients[2] )
    cat ("Adding the identity line:")
    

    Adding the identity line:

    abline (a=0, b=1, lty=2)
    

    Th plot:

    If both measurement methods would exactly correspond, the intercept would be zero and the slope would be one (=“identity line”, dotted line). Now, let’s calculated the confidence interval for this linear regression.

    CI_ex 

    And the prediction interval:

    PI_ex 
    ## Warning in predict.lm(fit.lm, interval = "prediction"): predictions on current data refer to _future_ responses
    
    colnames(PI_ex)

    We can combine those results in one data frame and plot both the confidence interval and the prediction interval.

    Hb_results
    New Reference fit_CI lwr_CI upr_CI fit_PI lwr_PI upr_PI
    85 87 91 87 94 91 82 99

    Visualizing the CI (in green) and the PI (in blue):

    plot(Hb$New, Hb$Reference, 
         xlab="Hemoglobin concentration (g/L) - new method", 
         ylab="Hemoglobin concentration (g/L) - reference method")
    Hb_results_s 

    Gives this plot:

    In (Bland, Altman 2003) it is proposed to calculate the average width of this prediction interval, and see whether this is acceptable. Another approach is to compare the calculated PI with an “acceptance interval”. If the PI is inside the acceptance interval for the measurement range of interest (see Francq, 2016).

    In the above example, both methods do have the same measurement scale (g/L), but the linear regression with prediction interval is particularly useful when the two methods of measurement have different units.

    However, the method has some disadvantages:

    • Predictions intervals are very sensitive to deviations from the normal distribution.
    • In “standard” linear regression (or Ordinary Least Squares (OLS) regression),the presence of measurement error is allowed for the Y-variable (here, the reference method) but not for the X-variable (the new method). The absence of errors on the x-axis is one of the assumptions. Since we can expect some measurement error for the new method, this assumption is violated here.

    Taking into account errors on both axes

    In contrast to Ordinary Least Square (OLS) regression, Bivariate Least Square (BLS) regression takes into account the measurement errors of both methods (the New method and the Reference method). Interestingly, prediction intervals calculated with BLS are not affected when the axes are switched (del Rio, 2001).

    In 2017, a new R-package BivRegBLS was released. It offers several methods to assess the agreement in method comparison studies, including Bivariate Least Square (BLS) regression.

    If the data are unreplicated but the variance of the measurement error of the methods is known, The BLS() and XY.plot() functions can be used to fit a bivariate Least Square regression line and corresponding confidence and prediction intervals.

    library (BivRegBLS)
    Hb.BLS = BLS (data = Hb, xcol = c("New"), 
    ycol = c("Reference"), var.y=10, var.x=8, conf.level=0.95)
    XY.plot (Hb.BLS,
             yname = "Hemoglobin concentration (g/L) - reference method",
             xname = "Hemoglobin concentration (g/L) - new method",
             graph.int = c("CI","PI"))
    

    Gives this plot:

    Now we would like to decide whether the new method can replace the reference method. We allow the methods to differ up to a given threshold, which is not clinically relevant. Based on this threshold an “acceptance interval” is created. Suppose that differences up to 10 g/L (=threshold) are not clinically relevant, then the acceptance interval can be defined as Y=X±??, with ?? equal to 10. If the PI is inside the acceptance interval for the measurement range of interest then the two measurement methods can be considered to be interchangeable (see Francq, 2016).

    The accept.int argument of the XY.plot() function allows for a visualization of this acceptance interval

    XY.plot (Hb.BLS,
             yname = "Hemoglobin concentration (g/L) - reference method",
             xname = "Hemoglobin concentration (g/L) - new method",
             graph.int = c("CI","PI"), 
             accept.int=10)
    

    Givbes this plot:

    For the measurement region 120g/L to 150 g/L, we can conclude that the difference between both methods is acceptable. If the measurement regions below 120g/l and above 150g/L are important, the new method cannot replace the reference method.

    Regression on replicated data

    In method comparison studies, it is advised to create replicates (2 or more measurements of the same sample with the same method). An example of such a dataset:

    Hb_rep 
    New_rep1 New_rep2 Ref_rep1 Ref_rep2
    88 95 90 84

    When replicates are available, the variance of the measurement errors are calculated for both the new and the reference method and used to estimate the prediction interval. Again, the BLS() function and the XY.plot() function are used to estimate and plot the BLS regression line, the corresponding CI and PI.

    Hb_rep.BLS = BLS (data = Hb_rep, 
                      xcol = c("New_rep1", "New_rep2"), 
                      ycol = c("Ref_rep1", "Ref_rep2"), 
                      qx = 1, qy = 1, 
                      conf.level=0.95, pred.level=0.95)
    
    XY.plot (Hb_rep.BLS,
             yname = "Hemoglobin concentration (g/L) - reference method",
             xname = "Hemoglobin concentration (g/L) - new method",
             graph.int = c("CI","PI"),
             accept.int=10)
    

    Giuves this plot:

    It is clear that the prediction interval is not inside the acceptance interval here. The new method cannot replace the reference method. A possible solution is to average the repeats. The BivRegBLS package can create prediction intervals for the mean of (2 or more) future values, too! More information in this presentation (presented at useR!2017).

    In the plot above, averages of the two replicates are calculated and plotted. I’d like to see the individual measurements:

    plot(x=c(Hb_rep$New_rep1, Hb_rep$New_rep2),
         y=c(Hb_rep$Ref_rep1, Hb_rep$Ref_rep2),
         xlab="Hemoglobin concentration (g/L) - new method", 
         ylab="Hemoglobin concentration (g/L) - reference method")
    lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), 
           y=as.numeric(Hb_rep.BLS$Pred.BLS[,2]), 
           lwd=2)
    lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), 
           y=as.numeric(Hb_rep.BLS$Pred.BLS[,3]), 
           col="#8AB63F", lwd=2)
    lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), 
           y=as.numeric(Hb_rep.BLS$Pred.BLS[,4]), 
           col="#8AB63F", lwd=2)
    lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), 
           y=as.numeric(Hb_rep.BLS$Pred.BLS[,5]), 
           col="#1A425C", lwd=2)
    lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), 
           y=as.numeric(Hb_rep.BLS$Pred.BLS[,6]), 
           col="#1A425C", lwd=2)
    abline (a=0, b=1, lty=2)
    

    Gives this plot:

    Remarks

    • Although not appropriate in the context of method comparison studies, Pearson correlation is still frequently used. See Bland & Altman (2003) for an explanation on why correlations are not adviced.
    • Methods presented in this blogpost are not applicable to time-series

    References

    • Confidence interval and prediction interval: Applied Linear Statistical Models, 2005, Michael Kutner, Christopher Nachtsheim, John Neter, William Li. Section 2.5
    • Prediction interval for method comparison:
      Bland, J. M. and Altman, D. G. (2003), Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol, 22: 85-93. doi:10.1002/uog.12
      section: “Appropriate use of regression”.
    • Francq, B. G., and Govaerts, B. (2016) How to regress and predict in a Bland-Altman plot? Review and contribution based on tolerance intervals and correlated-errors-in-variables models. Statist. Med., 35: 2328-2358. doi: 10.1002/sim.6872.
    • del Río, F. J., Riu, J. and Rius, F. X. (2001), Prediction intervals in linear regression taking into account errors on both axes. J. Chemometrics, 15: 773-788. doi:10.1002/cem.663
    • Example of a method comparison study: H. Tian, M. Li, Y. Wang, D. Sheng, J. Liu, and L. Zhang, “Optical wavelength selection for portable hemoglobin determination by near-infrared spectroscopy method,” Infrared Phys. Techn 86, 98-102 (2017). doi.org/10.1016/j.infrared.2017.09.004.
    • the predict() and lm() functions of R: Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.

      Related Post

      1. Six Sigma DMAIC Series in R – Part 2
      2. Six Sigma DMAIC Series in R – Part 1
      3. Implementation and Interpretation of Control Charts in R
      4. Time series model of forecasting future power demand
      5. Tennis Grand Slam Tournaments Champions Basic Analysis
      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

      Source:: R News

      Interpreting machine learning models with the lime package for R

      By David Smith

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it’s difficult to answer the question, “why that label, and not one of the others?”.

      One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it’s impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).

      You can read about how the lime package works in the introductory vignette Understanding Lime, but this limerick by Mara Averick sums also things up nicely:

      There once was a package called lime,
      Whose models were simply sublime,
      It gave explanations for their variations,
      One observation at a time.

      “One observation at a time” is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.

      The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment “negative”, “neutral” or “positive”. In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.

      To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.

      GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

      Source:: R News

      R Tip: Be Wary of “…”

      By John Mount

      (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

      R Tip: be wary of “...“.

      The following code example contains an easy error in using the R function unique().

      vec1 
      

      Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

      In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.

      Data science coding is often done in a rush (deadlines, wanting to see results, and so on). Instead of moving fast, let's take the time to think a bit about programming theory using a very small concrete issue. This lets us show how one can work a bit safer (saving time in the end), without sacrificing notational power or legibility.

      A confounding issue is: unique() failed to alert me of my mistake because, unique()‘s function signature (like so many R functions) includes a “...” argument. I might have been careless or in a rush, but it seems like unique was also in a rush and did not care to supply argument inspection.

      In R a function that includes a “...” in its signature will accept arbitrary arguments and values in addition to the ones explicitly declared and documented. There are three primary uses of “...” in R: accepting unknown arguments that are to be passed to later functions, building variadic functions, and forcing later arguments to be bound by name (my favorite use). Unfortunately, “...” is also often used to avoid documenting arguments and turns off a lot of very useful error checking.

      An example of the “accepting unknown arguments” use is lapply(). lapply() passes what it finds in “...” to whatever function it is working with. For example:

      lapply(c("a", "b"), paste, "!", sep = "")
      # [[1]]
      # [1] "a!"
      # 
      # [[2]]
      # [1] "b!"
      

      Notice the arguments “”!”, sep = “”” were passed on to paste(). Since lapply() can not know what function the user will use ahead of time it uses the “...” abstraction to forward arguments. Personally I never use this form and tend to write the somewhat more explicit and verbose style shown below.

      lapply(c("a", "b"), 
         function(vi) { paste(vi, "!", sep = "") })
      

      I feel this form is more readable as the arguments are seen where they are actually used. (Note: this, is a notional example- in practice we would use “paste0(c("a", "b"), "!")” to directly produce the result as a vector.)

      An example of using “...” to supply a variadic interface is paste() itself.

      paste("a", "b", "c")
      # [1] "a b c"
      

      Other important examples include list() and c(). In fact I like list() and c() as they only take a “...” and no other arguments. Being variadic is so important to list() and c() is that is essentially all they do. One can often separate out the variadic intent with lists or vectors as in:

      paste(c("a", "b", "c"), collapse = " ")
      # [1] "a b c"
      

      Even I don’t write code such as the above (that is too long even for me), unless the values are coming from somewhere else (such as a variable). However with wrapr‘s reduce/expand operator we can completely separate the collection of variadic arguments and their application. The notation looks like the following:

      library("wrapr")
      
      values 
      

      Essentially reduce/expand calls variadic functions with items taken from a vector or list as individual arguments (allowing one to program easily over variadic functions). %.|% is intended to place values in the “...” slot of a function (the variadic term). It is designed for a more perfect world where when a function declares “...” in its signature it is then the only user facing part of the signature. This is hardly ever actually the case in R as common functions such as paste() and sum() have additional optional named arguments (which we are here leaving at their default values), whereas c() and list() are pure (take only “...“).

      With a few non-standard (name capturing) and variadic value constructors one does not in fact need other functions to be name capturing or variadic. With such tools one can have these conveniences everywhere. For example we can convert our incorrect use of unique() into correct code using c().

      unique(c(vec1, vec2))
      # [1] "a" "b" "c" "d"
      

      In the above code roles are kept separate: c() is collecting values and unique() is applying a calculation. We don’t need a powerful “super unique” or “super union” function, unique() is good enough if we remember to use c().

      In the spirit of our earlier article on function argument names we have defined a convenience function wrapr::uniques() that enforces the use of value carrying arguments. With wrapr::uniques() if one attempts the mistake I have been discussing one immediately gets a signaling error (instead of propagating incorrect calculations forward).

      library("wrapr")
      
      uniques(c(vec1, vec2))
      # [1] "a" "b" "c" "d"
      
      uniques(vec1, vec2)
      # Error: wrapr::uniques unexpected arguments: vec2 
      

      With uniques() we either get the right answer, or we get immediately stopped at the mistaken step. This is a good way to work.

      To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

      Source:: R News