Le Monde puzzle [#1053]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

An easy arithmetic Le Monde mathematical puzzle again:

  1. If coins come in units of 1, x, and y, what is the optimal value of (x,y) that minimises the number of coins representing an arbitrary price between 1 and 149?
  2. If the number of units is now four, what is the optimal choice?

The first question is fairly easy to code

coinz 

and returns M=12 as the maximal number of coins, corresponding to x=4 and y=22. And a price tag of 129. For the second question, one unit is necessarily 1 (!) and there is just an extra loop to the above, which returns M=8, with other units taking several possible values:

[1] 40 11  3
[1] 41 11  3
[1] 55 15  4
[1] 56 15  4

A quick search revealed that this problem (or a variant) is solved in many places, from stackexchange (for an average—why average?, as it does not make sense when looking at real prices—number of coins, rather than maximal), to a paper by Shalit calling for the 18¢ coin, to Freakonomics, to Wikipedia, although this is about finding the minimum number of coins summing up to a given value, using fixed currency denominations (a knapsack problem). This Wikipedia page made me realise that my solution is not necessarily optimal, as I use the remainders from the larger denominations in my code, while there may be more efficient divisions. For instance, running the following dynamic programming code

coz=function(x,y){
  minco=1:149
  if (x

returns the lower value of M=11 (with x=7,y=23) in the first case and M=7 in the second one.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

PYPL Language Rankings: Python ranks #1, R at #7 in popularity

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7.

Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you’d expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic).

Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot.

Kdnuggets

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

By John Mount

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Intro To Time Series Analysis Part 2 :Exercises

By Biswarup Ghosh

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below, we will explore more in Time Series analysis.The previous exercise is here,Please follow this in sequence
Answers to these exercises are available here.

Exercise 1

load the AirPassangers data,check its class and see the start and end of the series .

Exercise 2
check the cycle of the TimeSeries AirPassangers .

Exercise 3

create the lagplot using the gglagplot from the forecast package,check how the relationship changes as the lag increases

Exercise 4

Also plot the correlation for each of the lags , you can see when the lag is above 6 the correlation drops and again climbs up in 12 and again drops in 18 .
Exercise 5

Plot the histogram of the AirPassengers using gghistogram from forecast

Exercise 6

Use tsdisplay to plot autocorrelation , timeseries and partial autocorrelation together in a same plot

Exercise 7

Find the outliers in the timeseries .

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

a chain of collapses

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

A quick riddler resolution during a committee meeting (!) of a short riddle: 36 houses stand in a row and collapse at times t=1,2,..,36. In addition, once a house collapses, the neighbours if still standing collapse at the next time unit. What are the shortest and longest lifespans of this row?

Since a house with index i would collapse on its own by time i, the longest lifespan is 36, which can be achieved with the extra rule when the collapsing times are perfectly ordered. For the shortest lifespan, I ran a short R code implementing the rules and monitoring its minimum. Which found 7 as the minimal number for 10⁵ draws. However, with an optimal ordering, one house plus one or two neighbours of the most recently collapsed, leading to a maximal number of collapsed houses after k time units being

1+2(k-1)+1+2(k-2)+….=k+k(k-1)=k²

which happens to be equal to 36 for k=6. (Which was also obtained in 10⁶ draws!) This also gives the solution for any value of k.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Searching For Unicorns (And Other NBA Myths)

By Sam Marks

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

A visual exploration of the 2017-2018 NBA landscape

The modern NBA landscape is rapidly changing.

Steph Curry has redefined the lead guard prototype with jaw-dropping shooting range coupled with unprecedented scoring efficiency for a guard. The likes of Marc Gasol, Al Horford and Kristaps Porzingis are paving the way for a younger generation of modern big men as defensive rim protectors who can space the floor on offense as three-point threats. Then there are the new-wave facilitators – LeBron James, Draymond Green, Ben Simmons – enormous athletes who can guard any position on defense and push the ball down court in transition.

For fans, analysts and NBA front offices alike, these are the prototypical players that make our mouths water. So what do they have in common?

For one, they are elite statistical outliers in at least two categories, and this serves as the primary motivation for my exploratory analysis tool: To identify NBA players in the 2017-2018 season that exhibited unique skill sets based on statistical correlations.

To access the tool, click here.

The Data

The tool uses box score data from the 2017-2018 NBA season (source: Kaggle) and focuses on the following categories: Points, rebounds, assists, turnovers, steals, blocks, 3-pointers made, FG% and FT%. I also used Dean Oliver’s formula for estimating a players total possessions (outlined here).

To assess all players on an equal scale, I normalized the box score data for each player. For ease of interpretability, I chose to use “per 36 minute” normalization, which take a player’s per-minute production and extrapolates it to 36 minutes of playing time. In this way, the values displayed in the scatterplot represent each player’s production per 36 minutes of playing time.

To ensure that the per-36 minute calculations did not generate any outliers due to small statistical samples, I removed all players with fewer than nine games in the season, as well as players who averaged three minutes or less per game.

Using the tool: A demonstration

The tool is a Shiny application intended to be used for exploratory analysis and player discovery. To most effectively understand and interpret the charts, you can follow these steps:

Step 1: Assess the correlation matrix

The correlation matrix uses the Pearson correlation coefficient as a reference to guide your use of the dynamic scatter plot. Each dot represents the league-wide correlation between two statistical categories.

The color scale indicates the direction of the correlation. That is, blue dots represent negatively correlated statistics, and red dots positively correlated statistics. The size of the dot indicates the magnitude of the correlation – that is, how strong the relationship is between the two statistics across the entire league. Large dots represent high correlation between two statistics, while small dots indicate that the two statistics do not have a linear relationship.

Step 2: Select two statistics to plot for exploration

We can get a flavor of these relationships as we move to the scatterplot. (Follow along using the app.) For the purpose of identifying truly unique players, let’s look at a pairing of negatively correlated statistics with high magnitude (i.e. a blue, large dot): 3-pointers made (“3PM”) vs. Field goal percentage (“FG%”).

Step 3: Explore

It makes sense intuitively why these are negatively correlated – a player making a lot of threes is also attempting a lot of long-distance, low-percentage shots. Given the value of floor-spacing in today’s NBA, a high-volume 3-point shooter who is also an efficient scorer possesses unique abilities. So, let’s select FG% for our x-axis and 3PM for our y-axis (using the dropdowns in the menu bar), and see what we find…

The two dotted lines within the scatterplot represent the 50th percentile for each statistic. In the case of FG% vs. 3PM, we turn to the upper right quadrant, which represents the players who are above average in both FG% and 3-pointers made. To focus our analysis, we can zoom in on this quadrant for a close look.

To zoom, simply select and drag across the plotted space you want to zoom in to, in this case the upper right quadrant. You can also filter by position by simply selecting specific positions in the legend.

Scroll over a point to see who the player is, as well as their per-36 statistics. At the top of our plot, no surprises here: Steph Curry. While his 4.7 threes per 36 minutes leads the league, what truly separates him is his 50% efficiency from the field. But we already know that Steph is an exceptional anomaly, so who what else can we find?

While several superstars can also be found at the top of our plot – Kevin Durant, Kyrie Irving, and Klay Thompson stand out – we have quite a few role players up there as well: Kyle Korver, J.J. Redick, Kelly Olynyk and Joe Ingles. These are quality reserves who may not wow us with their overall statistical profiles, but play a crucial, high-value role on teams by spacing the floor without sacrificing scoring efficiency.

Step 4: Repeat

I recommend starting your exploration on the blue-dots of the correlation matrix – blocks vs. threes, rebounds vs. threes, assists vs. blocks, for example. These are where you can identify players with the most unique skill pairings across the league. (Note: When plotting turnovers, be sure to focus below the median line, as it is better to have low turnovers than high.)

For fantasy basketball enthusiasts, this is a great tool to identify players with specific statistical strengths to construct a well-balanced team, or complement your roster core.

Conclusion

I really enjoyed building this tool and exploring its visualization of the NBA landscape. From an interpretability standpoint, however, it is not ideal that we can only focus on one player at time. To improve on this, I plan include an additional table that provides a deeper look at players that fall above the median line for both X and Y statistics. In this way, we can further analyze these players across a larger range of performance variables.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Dear data scientists, how to ease your job

By eoda GmbH

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

You have got the best job of the 21st century. Like a treasure hunter, looking for a data treasure chest while sailing through data lakes. In many companies you are a digital maker – armed with skills that turn you to a modern-day polymath and a toolset which is so holistic and complex at once, even astronauts feel dizzy.

However, there is still something you carry every day to work: the burden of high expectations and demands of others – whether it is a customer, your supervisor or colleagues. Being a data scientist is a dream job, but also very stressful as it requires creative approaches and new solutions every day.

Would it not be great if there is something that makes your daily work easier?

Many requirements and your need for a solution

R, Python and Julia – does your department work with several programming languages? Would it not be great to have a solution that supports various languages and thus encourage what is so essential for data analysis: teamwork. Additionally, connectivity packages could enable you to work in a familiar surrounding such as RStudio.

Performance is everything: you create the most complex neuronal networks and process data quantities that make big data really deserve the word “big”. Imagine, you could transfer your analyses in a controlled and scalable environment where your scripts perform not only reliable, but also an optimal load distribution. All this, including a horizontal scale and improvement of system performance.

Data sources, user and analysis scripts – in search of a tool that can bring together all components in a bundled analysis project to manage ressources more effeciently, raise transparency and develop a compliant workflow. The best possible solution is a role management which can be easily expanded to the specialist department.

Time is money. Of course, that also applies to your working time. A solution that can free you from time-consuming routine tasks, such as monitoring and parameterization of script execution, as well as for implementing analyses via temporal trigger. Additionally, the dynamic load distribution and the logging of script output ensure the operationalization of script execution in a business-critical environment.

Keep an eye on the big picture: your performant analysis will not bring you the deserved satisfaction if you are not able to embed it into the existing IT-landscape. A tool that has consistent Interfaces to integrate your analysis scripts via REST-API neatly in any existing application would be perfect to ease your daily workload.

eoda | data science core: a solution from data scientists to data scientists

Imagine a data science tool that incorporates the experience to leverage your potential in bringing data science to the enterprise environment.

Based on many years of experience from analysis projects and the knowledge about your daily challenges, we have developed a solution for you: the eoda | data science core. You can manage your analysis projects flexibly, performant and secure. It gives you the space you need to deal with expectations and keep the love for the profession– as it is, after all, the best job in the world.

The eoda | data science environment provides a framework for creating and managing different containers with several setups for various applications.
The eoda | data science environment provides a framework for creating and managing different containers with several setups for various applications.
In the eoda | data science environment scripts can access common intermediate results despite different languages like Rstats, Python or Julia and be managed in one data science project.
In the eoda | data science environment scripts can access common intermediate results despite different languages like Rstats, Python or Julia and be managed in one data science project.

The eoda | data science core is the first main component of the eoda | data science environment. This will be complemented with the second component, the eoda | data science portal. How does the portal enable collaborative working, explorative analyses and a user-friendly visualization of results? Read all about it in the next article and find out.

For more information: www.data-science-environment.com

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

11 Jobs for R users from around the world (2018-06-19)

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  4. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  5. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018

All New R Jobs

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
    Anywhere
    18 Jun 2018
  4. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  5. Full-Time
    Research Data Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    15 Jun 2018
  6. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  7. Full-Time
    Market Research Analyst: Mobility for RSG RSG – Posted by patricia.holland@rsginc.com
    Anywhere
    25 May 2018
  8. Full-Time
    Data Scientist @ New Delhi, India Amulozyme Inc. – Posted by Amulozyme
    New Delhi Delhi, India
    25 May 2018
  9. Full-Time
    Data Scientist/Programmer @ Milwaukee, Wisconsin, U.S. ConsensioHealth – Posted by ericadar
    Milwaukee Wisconsin, United States
    25 May 2018
  10. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018
  11. Full-Time
    Lead Data Scientist @ Washington, District of Columbia, U.S. AFL-CIO – Posted by carterkalchik
    Washington District of Columbia, United States
    27 Apr 2018

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

Source:: R News

Copy/paste t-tests Directly to Manuscripts

By Dominique Makowski

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

One of the most time-consuming part of data analysis in psychology is the copy-pasting of specific values of some R output to a manuscript or a report. This task is frustrating, prone to errors, and increases the variability of statistical reporting. This is an important issue, as standardizing practices of what and how to report might be a key to overcome the reproducibility crisis of psychology. The psycho package was designed specifically to do this job. At first, for complex Bayesian mixed models, but the package is now compatible with basic methods, such as t-tests.

Do a t-test

# Load packages
library(tidyverse)

# devtools::install_github("neuropsychology/psycho.R")  # Install the latest psycho version
library(psycho)

df  psycho::affective  # Load the data


results  t.test(df$Age ~ df$Sex)  # Perform a simple t-test

APA formatted output

You simply run the analyze() function on the t-test object.

psycho::analyze(results)
The Welch Two Sample t-test suggests that the difference of df$Age by df$Sex (mean in group F = 26.78, mean in group M = 27.45, difference = -0.67) is not significant (t(360.68) = -0.86, 95% CI [-2.21, 0.87], p > .1).

Flexibility

It works for all kinds of different t-tests versions.

t.test(df$Adjusting ~ df$Sex,
       var.equal=TRUE, 
       conf.level = .90) %>% 
  psycho::analyze()
The  Two Sample t-test suggests that the difference of df$Adjusting by df$Sex (mean in group F = 3.72, mean in group M = 4.13, difference = -0.41) is significant (t(1249) = -4.13, 90% CI [-0.58, -0.25], p 
t.test(df$Adjusting,
       mu=0,
       conf.level = .90) %>% 
  psycho::analyze()
The One Sample t-test suggests that the difference between df$Adjusting (mean = 3.80) and mu = 0 is significant (t(1250) = 93.93, 90% CI [3.74, 3.87], p 

Dataframe of Values

It is also possible to have all the values stored in a dataframe by running a summary on the analyzed object.

library(tidyverse)

t.test(df$Adjusting ~ df$Sex) %>% 
  psycho::analyze() %>% 
  summary()
effect statistic df p CI_lower CI_higher
-0.4149661 -4.067008 377.8364 5.8e-05 -0.6155884 -0.2143439

Contribute

Of course, these reporting standards should change, depending on new expert recommandations or official guidelines. The goal of this package is to flexibly adapt to new changes and accompany the evolution of best practices. Therefore, if you have any advices, opinions or ideas, we encourage you to let us know by opening an issue or, even better, to try to implement changes yourself by contributing to the code.

Credits

This package helped you? Don’t forget to cite the various packages you used 🙂

You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

Previous blogposts

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R vs Python: Image Classification with Keras

By Dmitry Kisler

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Even though the libraries for R from Python, or Python from R code execution existed since years and despite of a recent announcement of Ursa Labs foundation by Wes McKinney who is aiming to join forces with RStudio foundation, Hadley Wickham in particularly, (find more here) to improve data scientists workflow and unify libraries to be used not only in Python, but in any programming language used by data scientists, some data professionals are still very strict on the language to be used for ANN models limiting their dev. environment exclusively to Python.

As a continuation of my R vs. Python comparison, I decided to test performance of both languages in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post by Dr. Shirin Glander on how easy it is to build a CNN model in R using Keras.

A few words about Keras. It is a Python library for artificial neural network ML models which provides high level fronted to various deep learning frameworks with Tensorflow being the default one.
Keras has many pros with the following among the others:

  • Easy to build complex models just in few lines of code => perfect for dev. cycle to quickly experiment and check your ideas
  • Code recycling: one can easily swap the backend framework (let’s say from CNTK to Tensorflow or vice versa) => DRY principle
  • Seamless use of GPU => perfect for fast model tuning and experimenting

Since Keras is written in Python, it may be a natural choice for your dev. environment to use Python. And that was the case until about a year ago when RStudio founder J.J.Allaire announced release of the Keras library for R in May’17. I consider this to be a turning point for data scientists; now we can be more flexible with dev. environment and be able to deliver result more efficiently with opportunity to extend existing solutions we may have written in R.

It brings me to the point of this post.
My hypothesis is, when it comes to ANN ML model building with Keras, Python is not a must, and depending on your client’s request, or tech stack, R can be used without limitations and with similar efficiency.

Image Classification with Keras

In order to test my hypothesis, I am going to perform image classification using the fruit images data from kaggle and train a CNN model with four hidden layers: two 2D convolutional layers, one pooling layer and one dense layer. RMSProp is being used as the optimizer function.

Tech stack

Hardware:
CPU: Intel Core i7-7700HQ: 4 cores (8 threads), 2800 – 3800 (Boost) MHz core clock
GPU: Nvidia Geforce GTX 1050 Ti Mobile: 4Gb vRAM, 1493 – 1620 (Boost) MHz core clock
RAM: 16 Gb

Software:
OS: Linux Ubuntu 16.04
R: ver. 3.4.4
Python: ver. 3.6.3
Keras: ver. 2.2
Tensorflow: ver. 1.7.0
CUDA: ver. 9.0 (note that the current tensorflow version supports ver. 9.0 while the up-to-date version of cuda is 9.2)
cuDNN: ver. 7.0.5 (note that the current tensorflow version supports ver. 7.0 while the up-to-date version of cuDNN is 7.1)

Code

The R and Python code snippets used for CNN model building are present bellow. Thanks to fruitful collaboration between F. Chollet and J.J. Allaire, the logic and functions names in R are alike the Python ones.

R

## Courtesy: Dr. Shirin Glander. Code source: https://shirinsplayground.netlify.com/2018/06/keras_fruits/

library(keras)
start % image_data_generator(
  rescale = 1/255
)

# Validation data shouldn't be augmented! But it should also be scaled.
valid_data_gen % 
  layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = 'same', input_shape = c(img_width, img_height, channels)) %>%
  layer_activation('relu') %>%
  
  # Second hidden layer
  layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = 'same') %>%
  layer_activation_leaky_relu(0.5) %>%
  layer_batch_normalization() %>%
  
  # Use max pooling
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25) %>%
  
  # Flatten max filtered output into feature vector 
  # and feed into dense layer
  layer_flatten() %>%
  layer_dense(100) %>%
  layer_activation('relu') %>%
  layer_dropout(0.5) %>%
  
  # Outputs from dense layer are projected onto output layer
  layer_dense(output_n) %>% 
  layer_activation('softmax')

# compile
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6),
  metrics = 'accuracy'
)
# fit
hist % 
  {data.frame(acc = .$acc[epochs], val_acc = .$val_acc[epochs], elapsed_time = as.integer(Sys.time()) - as.integer(start))}

Python

## Author: D. Kisler  - adoptation of R code from https://shirinsplayground.netlify.com/2018/06/keras_fruits/

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import (Conv2D,
                          Dense,
                          LeakyReLU,
                          BatchNormalization, 
                          MaxPooling2D, 
                          Dropout,
                          Flatten)
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, TensorBoard
import PIL.Image
from datetime import datetime as dt

start = dt.now()

# fruits categories
fruit_list = ["Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange",
                "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate"]
# number of output classes (i.e. fruits)
output_n = len(fruit_list)
# image size to scale down to (original images are 100 x 100 px)
img_width = 20
img_height = 20
target_size = (img_width, img_height)
# image RGB channels number
channels = 3
# path to image folders
path = "path/to/folder/with/data"
train_image_files_path = path + "fruits-360/Training"
valid_image_files_path = path + "fruits-360/Test"

## input data augmentation/modification
# training images
train_data_gen = ImageDataGenerator(
  rescale = 1./255
)
# validation images
valid_data_gen = ImageDataGenerator(
  rescale = 1./255
)

## getting data
# training images
train_image_array_gen = train_data_gen.flow_from_directory(train_image_files_path,                                                            
                                                           target_size = target_size,
                                                           classes = fruit_list, 
                                                           class_mode = 'categorical',
                                                           seed = 42)

# validation images
valid_image_array_gen = valid_data_gen.flow_from_directory(valid_image_files_path, 
                                                           target_size = target_size,
                                                           classes = fruit_list,
                                                           class_mode = 'categorical',
                                                           seed = 42)

## model definition
# number of training samples
train_samples = train_image_array_gen.n
# number of validation samples
valid_samples = valid_image_array_gen.n
# define batch size and number of epochs
batch_size = 32
epochs = 10

# initialise model
model = Sequential()

# add layers
# input layer
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', input_shape = (img_width, img_height, channels), activation = 'relu'))
# hiddel conv layer
model.add(Conv2D(filters = 16, kernel_size = (3,3), padding = 'same'))
model.add(LeakyReLU(.5))
model.add(BatchNormalization())
# using max pooling
model.add(MaxPooling2D(pool_size = (2,2)))
# randomly switch off 25% of the nodes per epoch step to avoid overfitting
model.add(Dropout(.25))
# flatten max filtered output into feature vector
model.add(Flatten())
# output features onto a dense layer
model.add(Dense(units = 100, activation = 'relu'))
# randomly switch off 25% of the nodes per epoch step to avoid overfitting
model.add(Dropout(.5))
# output layer with the number of units equal to the number of categories
model.add(Dense(units = output_n, activation = 'softmax'))

# compile the model
model.compile(loss = 'categorical_crossentropy', 
              metrics = ['accuracy'], 
              optimizer = RMSprop(lr = 1e-4, decay = 1e-6))

# train the model
hist = model.fit_generator(
  # training data
  train_image_array_gen,

  # epochs
  steps_per_epoch = train_samples // batch_size, 
  epochs = epochs, 

  # validation data
  validation_data = valid_image_array_gen,
  validation_steps = valid_samples // batch_size,

  # print progress
  verbose = 2,
  callbacks = [
    # save best model after every epoch
    ModelCheckpoint("fruits_checkpoints.h5", save_best_only = True),
    # only needed for visualising with TensorBoard
    TensorBoard(log_dir = "fruits_logs")
  ]
)

df_out = {'acc': hist.history['acc'][epochs - 1], 'val_acc': hist.history['val_acc'][epochs - 1], 'elapsed_time': (dt.now() - start).seconds}

Experiment

The models above were trained 10 times with R and Pythons on GPU and CPU, the elapsed time and the final accuracy after 10 epochs were measured. The results of the measurements are presented on the plots below (click the plot to be redirected to plotly interactive plots).

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

Let’s perform unpaired t-test assuming that all our observations are normally distributed.

library(dplyr)
library(data.table)
# fetch the data used to plot graphs
d %
    group_by(lang, eng) %>%
    summarise(el_mean = mean(elapsed_time),
              el_std = sd(elapsed_time),
              n = n()) %>% data.frame() %>%
    group_by(eng) %>%
    summarise(t_score = round(diff(el_mean)/sqrt(sum(el_std^2/n)), 2))
eng t_score
cpu 11.38
gpu 9.64

T-score reflects a significant difference between the time required to train a CNN model in R compared to Python as we saw on the plot above.

Summary

Building and training CNN model in R using Keras is as “easy” as in Python with the same coding logic and functions naming convention. Final accuracy of your Keras model will depend on the neural net architecture, hyperparameters tuning, training duration, train/test data amount etc., but not on the programming language you would use for your DS project. Training a CNN Keras model in Python may be up to 15% faster compared to R

P.S.

If you would like to know more about Keras and to be able to build models with this awesome library, I recommend you these books:

As well as this Udemy course to start your journey with Keras.

Thanks a lot for your attention! I hope this post would be helpful for an aspiring data scientist to gain an understanding of use cases for different technologies and to avoid being biased when it comes to the selection of the tools for DS projects accomplishment.

    Related Post

    1. Update: Can we predict flu outcome with Machine Learning in R?
    2. Evaluation of Topic Modeling: Topic Coherence
    3. Natural Language Generation with Markovify in Python
    4. Anomaly Detection in R – The Tidy Way
    5. Forecasting with ARIMA – Part I
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News