ggplot2 2.2.0 coming soon!

By hadleywickham

unnamed-chunk-3-1

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version 2.1.0.9001. Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")
devtools::install_github("hadley/ggplot2")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:

install.packages("ggplot2")

ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Subtitles and captions

Thanks to Bob Rudis, you can now add subtitles and captions:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

These are controlled by the theme settings plot.subtitle and plot.caption.

The plot title is now aligned to the left by default. To return to the previous centering, use theme(plot.title = element_text(hjust = 0.5)).

Facets

The facet and layout implementation has been moved to ggproto and received a large rewrite and refactoring. This will allow others to create their own facetting systems, as descrbied in the Extending ggplot2 vignette. Along with the rewrite a number of features and improvements has been added, most notably:

  • Functions in facetting formulas, thanks to Dan Ruderman.
    ggplot(diamonds, aes(carat, price)) + 
      geom_hex(bins = 20) + 
      facet_wrap(~cut_number(depth, 6))
    
  • Axes were dropped when the panels in facet_wrap() did not completely fill the rectangle. Now, an axis is drawn underneath the hanging panels:
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      facet_wrap(~class)

    unnamed-chunk-5-1

  • It is now possible to set the position of the axes through the position argument in the scale constructor:
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      scale_x_continuous(position = "top") + 
      scale_y_continuous(position = "right")

    unnamed-chunk-6-1

  • You can display a secondary axis that is a one-to-one transformation of the primary axis with the sec.axis argument:
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      scale_y_continuous(
        "mpg (US)", 
        sec.axis = sec_axis(~ . * 1.20, name = "mpg (UK)")
      )

    unnamed-chunk-7-1

  • Strips can be placed on any side, and the placement with respect to axes can be controlled with the strip.placement theme option.
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      facet_wrap(~ drv, strip.position = "bottom") + 
      theme(
        strip.placement = "outside",
        strip.background = element_blank(),
        strip.text = element_text(face = "bold")
      ) +
      xlab(NULL)

    unnamed-chunk-8-1

Theming

  • Blank elements can now be overridden again so you get the expected behavior when setting e.g. axis.line.x.
  • element_line() gets an arrow argument that lets you put arrows on axes.
    arrow <- arrow(length = unit(0.4, "cm"), type = "closed")
    
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      theme_minimal() + 
      theme(
        axis.line = element_line(arrow = arrow)
      )

    unnamed-chunk-9-1

  • Control of legend styling has been improved. The whole legend area can be aligned according to the plot area and a box can be drawn around all legends:
    ggplot(mpg, aes(displ, hwy, shape = drv, colour = fl)) + 
      geom_point() + 
      theme(
        legend.justification = "top", 
        legend.box.margin = margin(3, 3, 3, 3, "mm"), 
        legend.box.background = element_rect(colour = "grey50")
      )

    unnamed-chunk-10-1

  • panel.margin and legend.margin have been renamed to panel.spacing and legend.spacing respectively as this better indicates their roles. A new legend.margin has been actually controls the margin around each legend.
  • When computing the height of titles ggplot2, now inclues the height of the descenders (i.e. the bits g and y that hang underneath). This makes improves the margins around titles, particularly the y axis label. I have also very slightly increased the inner margins of axis titles, and removed the outer margins.
  • The default themes has been tweaked by Jean-Olivier Irisson making them better match theme_grey().
  • Lastly, the theme() function now has named arguments so autocomplete and documentation suggestions are vastly improved.

Stacking bars

position_stack() and position_fill() now stack values in the reverse order of the grouping, which makes the default stack order match the legend.

avg_price <- diamonds %>% 
  group_by(cut, color) %>% 
  summarise(price = mean(price)) %>% 
  ungroup() %>% 
  mutate(price_rel = price - mean(price))

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = color))

unnamed-chunk-11-1

(Note also the new geom_col() which is short-hand for geom_bar(stat = "identity"), contributed by Bob Rudis.)

Additionally, you can now stack negative values:

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price_rel, fill = color))

unnamed-chunk-12-1

The overall ordering cannot necessarily be matched in the presence of negative values, but the ordering on either side of the x-axis will match.

If you want to stack in the opposite order, try forcats::fct_rev():

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = fct_rev(color)))
unnamed-chunk-13-1

Source:: R News

All the R Ladies

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Two groups are making and impact in improving the gender diversity of R users worldwide. The R-Ladies organization is creating chapters worldwide to facilitate female R programmers meeting and working together, and the Taskforce on Women in the R Community is working to improve the participation and experience of women in the R community.

R has more participation by women than many programming communities, but there’s still a long way to go towards equity: the R Foundation’s Women in R Taskforce estimates that between 11% and 15% of R package authors are women. (The count is based on package author first names; some manual corrections were needed because Hadley is categorized by genderizeR as a female name.)

In 2012, Gabriela de Queiroz founded the first women-focused R user group in San Francisco. Since then, “R Ladies” concept has expanded to a global franchise in eleven cities:

This is a great first step to increasing the participation by women in a currently male-dominated field. As the R-Ladies leadership note in their grant application to the R Consortium (which was funded in July):

The R community suffers from an underrepresentation of women in every role and area of participation, whether as leaders, package developers, conference speakers, conference participants, educators, or users. The R community needs to promote the growth of this major untapped demographic by pro­actively supporting women to fulfill their potential, thus enabling and achieving greater participation.

In addition to the groups created by R-Ladies, the Women in R Task Force is making strides towards achieving these goals. For example, the next useR! conference in Brussels will strive for gender balance amongst invited speakers, tutors, and committee and session chairs, and ensure particpation by women on panel discussions. Childcare will also be provided at the conference, and gender statistics will be published on the website.

For more information on R-Ladies (including contacts to help you create a local chapter in your area), and the Women in R Taskforce, follow the links below.

R-Ladies Global : Women in R Taskforce

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

One Way Analysis of Variance Exercises

By Sammy Ngugi

normaldistribution

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

When we are interested in finding if there is a statistical difference in the mean of two groups we use the t test. When we have more than two groups we cannot use the t test, instead we have to use analysis of variance (ANOVA). In one way ANOVA we have one continuous dependent variable and one independent grouping variable or factor. When we have two groups the t test and one way ANOVA are equivalent.

For our one way ANOVA results to be valid there are several assumptions that need to be satisfied. These assumptions are listed below.

  1. The dependent variable is required to be continuous
  2. The independent variable is required to be categorical with or more categories.
  3. The dependent and independent variables have values for each row of data.
  4. Observations in each group are independent.
  5. The dependent variable is approximately normally distributed in each group.
  6. There is approximate equality of variance in all the groups.
  7. We should not have any outliers

When our data shows non-normality, unequal variance or presence of outliers you can transform your data or use a non-parametric test like Kruskal-Wallis. It is good to note Kruskal-Wallis does not require normality of data but still requires equal variance in your groups.

For this exercise we will use data on patients having stomach, colon, ovary, brochus, or breast cancer. The objective of the study was to identify if the number of days a patient survived was influenced by the organ affected. Our dependent variable is Survival measured in days. Our independent variable is Organ. The data is available here http://lib.stat.cmu.edu/DASL/Datafiles/CancerSurvival.html and a cancer-survival file has been uploaded

Solutions to these exercises can be found here

Exercise 1

Load the data into R

Exercise 2

Create summary statistics for each organ

Exercise 3

Check if we have any outliers using boxplot

Exercise 4

Check for normality using Shapiro.wilk test

Exercise 5

Check for equality of variance

Exercise 6

Transform your data and check for normality and equality of variance.

Exercise 7

Run one way ANOVA test

Exercise 8

Perform a Tukey HSD post hoc test

Exercise 9

Interpret results

Exercise 10

Use a Kruskal-Wallis test

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

GoodReads: Machine Learning (Part 3)

By Florent Buisson

ROC curve

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In the first installment of this series, we scraped reviews from Goodreads. In the second one, we performed exploratory data analysis and created new variables. We are now ready for the “main dish”: machine learning!

Setup and general data prep

Let’s start by loading the libraries and our dataset.

library(data.table)
library(dplyr)
library(caret)
library(RTextTools)
library(xgboost)
library(ROCR)

setwd("C:/Users/Florent/Desktop/Data_analysis_applications/GoodReads_TextMining")
data <- read.csv("GoodReadsCleanData.csv", stringsAsFactors = FALSE)

To recap, at this point, we have the following features in our dataset:

review.id
book
rating
review
review.length
mean.sentiment
median.sentiment
count.afinn.positive
count.afinn.negative
count.bing.negative
count.bing.positive

For this example, we’ll simplify the analysis by collapsing the 1 to 5 stars rating into a binary variable: whether the book was rated a “good read” (4 or 5 stars) or not (1 to 3 stars). This will allow us to use classification algorithms, and to have less unbalanced categories.

set.seed(1234)
# Creating the outcome value
data$good.read <- 0
data$good.read[data$rating == 4 | data$rating == 5] <- 1

The “good reads”, or positive reviews, represent about 85% of the dataset, and the “bad reads”, or negative reviews, with good.read == 0, about 15%. We then create the train and test subsets. The dataset is still fairly unbalanced, so we don’t just randomly assign data points to the train and test datasets; we make sure to preserve the percentage of good reads in each subset by using the caret function `createDataPartition` for stratified sampling.

trainIdx <- createDataPartition(data$good.read, 
                                p = .75, 
                                list = FALSE, 
                                times = 1)
train <- data[trainIdx, ]
test <- data[-trainIdx, ]

Creating the Document-Term Matrices (DTM)

Our goal is to use the frequency of individual words in the reviews as features in our machine learning algorithms. In order to do that, we need to start by counting the number of occurrence of each word in each review. Fortunately, there are tools to do just that, that will return a convenient “Document-Term Matrix”, with the reviews in rows and the words in columns; each entry in the matrix indicates the number of occurrences of that particular word in that particular review.

A typical DTM would look like this:

Reviews about across ado adult
Review 1 0 2 1 0
Review 2 1 0 0 1

We don’t want to catch every single word that appears in at least one review, because very rare words will increase the size of the DTM while having little predictive power. So we’ll only keep in our DTM words that appear in at least a certain percentage of all reviews, say 1%. This is controlled by the “sparsity” parameter in the following code, with sparsity = 1-0.01 = 0.99.

There is a challenge though. The premise of our analysis is that some words appear in negative reviews and not in positive reviews, and reversely (or at least with a different frequency). But if we only keep words that appear in 1% of our overall training dataset, because negative reviews represent only 15% of our dataset, we are effectively requiring that a negative word appears in 1%/15% = 6.67% of the negative reviews; this is too high a threshold and won’t do.

The solution is to create two different DTM for our training dataset, one for positive reviews and one for negative reviews, and then to merge them together. This way, the effective threshold for negative words is to appear in only 1% of the negative reviews.

# Creating a DTM for the negative reviews
sparsity <- .99
bad.dtm <- create_matrix(train$review[train$good.read == 0], 
                         language = "english", 
                         removeStopwords = FALSE, 
                         removeNumbers = TRUE, 
                         stemWords = FALSE, 
                         removeSparseTerms = sparsity) 
#Converting the DTM in a data frame
bad.dtm.df <- as.data.frame(as.matrix(bad.dtm), 
                            row.names = train$review.id[train$good.read == 0])

# Creating a DTM for the positive reviews
good.dtm <- create_matrix(train$review[train$good.read == 1], 
                          language = "english",
                          removeStopwords = FALSE, 
                          removeNumbers = TRUE, 
                          stemWords = FALSE, 
                          removeSparseTerms = sparsity) 

good.dtm.df <- data.table(as.matrix(good.dtm), 
                          row.names = train$review.id[train$good.read == 1])

# Joining the two DTM together
train.dtm.df <- bind_rows(bad.dtm.df, good.dtm.df)
train.dtm.df$review.id <- c(train$review.id[train$good.read == 0],
                            train$review.id[train$good.read == 1])
train.dtm.df <- arrange(train.dtm.df, review.id)
train.dtm.df$good.read <- train$good.read

We also want to use in our analyses our aggregate variables (review length, mean and median sentiment, count of positive and negative words according to the two lexicons), so we join the DTM to the train dataset, by review id. We also convert all NA values in our data frames to 0 (these NA have been generated where words were absent of reviews, so that’s the correct of dealing with them here; but kids, don’t convert NA to 0 at home without thinking about it first).

train.dtm.df <- train %>%
  select(-c(book, rating, review, good.read)) %>%
  inner_join(train.dtm.df, by = "review.id") %>%
  select(-review.id)

train.dtm.df[is.na(train.dtm.df)] <- 0
# Creating the test DTM
test.dtm <- create_matrix(test$review, 
                          language = "english", 
                          removeStopwords = FALSE, 
                          removeNumbers = TRUE, 
                          stemWords = FALSE, 
                          removeSparseTerms = sparsity) 
test.dtm.df <- data.table(as.matrix(test.dtm))
test.dtm.df$review.id <- test$review.id
test.dtm.df$good.read <- test$good.read

test.dtm.df <- test %>%
  select(-c(book, rating, review, good.read)) %>%
  inner_join(test.dtm.df, by = "review.id") %>%
  select(-review.id)

A challenge here is to ensure that the test DTM has the same columns as the train dataset. Obviously, some words may appear in the test dataset while being absent of the train dataset, but there’s nothing we can do about them as our algorithms won’t have anything to say about them. The trick we’re going to use relies on the flexibility of the data.tables: when you join by rows two data.tables with different columns, the resulting data.table automatically has all the columns of the two initial data.tables, with the missing values set as NA. So we are going to add a row of our training data.table to our test data.table and immediately remove it after the missing columns will have been created; then we’ll keep only the columns which appear in the training dataset (i.e. discard all columns which appear only in the test dataset).

test.dtm.df <- head(bind_rows(test.dtm.df, train.dtm.df[1, ]), -1)
test.dtm.df <- test.dtm.df %>% 
  select(one_of(colnames(train.dtm.df)))
test.dtm.df[is.na(test.dtm.df)] <- 0

With this, we have our training and test datasets and we can start crunching numbers!

Machine Learning

We’ll be using XGboost here, as it yields the best results (I tried Random Forests and Support Vector Machines too, but the resulting accuracy is too instable with these to be reliable).

We start by calculating our baseline accuracy, what would get by always predicting the most frequent category, and then we calibrate our model.

baseline.acc <- sum(test$good.read == "1") / nrow(test)

XGB.train <- as.matrix(select(train.dtm.df, -good.read),
                       dimnames = dimnames(train.dtm.df))
XGB.test <- as.matrix(select(test.dtm.df, -good.read),
                      dimnames=dimnames(test.dtm.df))
XGB.model <- xgboost(data = XGB.train, 
                     label = train.dtm.df$good.read,
                     nrounds = 400, 
                     objective = "binary:logistic")

XGB.predict <- predict(XGB.model, XGB.test)

XGB.results <- data.frame(good.read = test$good.read,
                          pred = XGB.predict)

The XGBoost algorithm yields a probabilist prediction, so we need to determine a threshold over which we’ll classify a review as good. In order to do that, we’ll plot the ROC (Receiver Operating Characteristic) curve for the true negative rate against the false negative rate.

ROCR.pred <- prediction(XGB.results$pred, XGB.results$good.read)
ROCR.perf <- performance(ROCR.pred, 'tnr','fnr') 
plot(ROCR.perf, colorize = TRUE)

Things are looking pretty good. It seems that by using a threshold of about 0.8 (where the curve becomes red), we can correctly classify more than 50% of the negative reviews (the true negative rate) while misclassifying as negative reviews less than 10% of the positive reviews (the false negative rate).

XGB.table <- table(true = XGB.results$good.read, 
                   pred = as.integer(XGB.results$pred >= 0.80))
XGB.table
XGB.acc <- sum(diag(XGB.table)) / nrow(test)

Our overall accuracy is 87%, so we beat the benchmark of always predicting that a review is positive (which would yield a 83.4% accuracy here, to be precise), while catching 61.5% of the negative reviews. Not bad for a “black box” algorithm, without any parameter optimization or feature engineering!

Directions for further analyses

If we wanted to go deeper in the analysis, a good starting point would be to look at the relative importance of features in the XGBoost algorithm:

### Feature analysis with XGBoost
names <- colnames(test.dtm.df)
importance.matrix <- xgb.importance(names, model = XGB.model)
xgb.plot.importance(importance.matrix[1:20, ])

As we can see, there are a few words, such as “colleen” or “you” that are unlikely to be useful in a more general setting, but overall, we find that the most predictive words are negative ones, which was to be expected. We also see that two of our aggregate variables, review.length and count.bing.negative, made the top 10.

There are several ways we could improve on the analysis at this point, such as:

  • using N-grams (i.e. sequences of words, such as “did not like”) in addition to single words, to better qualify negative terms. “was very disappointed” would obviously have a different impact compared to “was not disappointed”, even though on a word-by-word basis they could not be distinguished.
  • fine-tuning the parameters of the XGBoost algorithm.
  • looking at the negative reviews that have been misclassified, in order to determine what features to add to the analysis.

Conclusion

We have covered a lot of ground in this series: from webscraping to sentiment analysis to predictive analytics with machine learning. The main conclusion I would draw from this exercise is that we now have at our disposal a large number of powerful tools that can be used “off-the-shelf” to build fairly quickly a complete and meaningful analytical pipeline.

As for the first two installments, the complete R code for this part is available on my github.

    Related Post

    1. Machine Learning for Drug Adverse Event Discovery
    2. GoodReads: Exploratory data analysis and sentiment analysis (Part 2)
    3. GoodReads: Webscraping and Text Analysis with R (Part 1)
    4. Euro 2016 analytics: Who’s playing the toughest game?
    5. Integrating R with Apache Hadoop
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Danger, Caution H2O steam is very hot!!

    By Longhow Lam

    blog_steam

    H2O has recently released its steam AI engine, a fully open source engine that support the management and deployment of machine learning models. Both H2O on R and H2O steam are easy to set up and use. And both complement each other perfectly.

    A very simple example

    Use H2O on R to create some predictive models. Well, due to lack of inspiration I just used the iris set to create some binary classifiers.

    blogcode

    Once these models are trained, they are available for use in the H2O steam engine. A nice web interface allows you to set up a project in H2O steam to manage and display summary information of the models.

    blogsteam2

    In H2O steam you can select a model that you want to deploy. It becomes a service with a REST API, a page is created to test the service.

    blogsteam3

    And that is it! Your predictive model is up and running and waiting to be called from any application that can make REST API calls.

    There is a lot more to explore in H2O steam, but be careful H2O steam is very hot!

    Source:: R News

    R+H2O for marketing campaign modeling

    By Bogumił Kamiński

    (This article was first published on R snippets, and kindly contributed to R-bloggers)

    My last post about telco churn prediction with R+H2O attracted unexpectedly high response. It seems that R+H2O combo has currently a very good momentum :). Therefore Wit Jakuczun decided to publish a case study that he uses in his R boot camps that is based on the same technology stack.

    You can find the shortened version of the case study material along with the source codes and data sets on GitHub.

    I think it is nice not only because it is another example how to use H2O in R, but it is also a basic introduction to how to combine segmentation and prediction modeling for marketing campaign targeting.

    To leave a comment for the author, please follow the link and comment on their blog: R snippets.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Watch: Highlights of the Microsoft Data Science Summit

    By David Smith

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    I just got back from Atlanta, the host of the Microsoft Machine Learning and Data Science Summit. This was the first year for this new conference, and it was a blast: the energy from the 1,000 attendees was palpable. I covered Joseph Sirosh’s keynote presentation yesterday, but today I wanted to highlight a few other talks from the program now that the recordings are available to stream.

    On deep learning and machine intelligence:

    On data science techniques and processes:

    On production-scale data analysis with R:

    But for me personally, the highlight of the event was the keynote presentation by Dr Edward Tufte on the Future of Data Analysis. I had the distinct honour of introducing Dr Tufte to the stage, who has long been a hero of mine. Please enjoy his presentation, embedded below.

    There are many other great talks from the conference available for streaming, too many to mention here. Check out the full program at the link below.

    Channel 9: Microsoft Machine Learning & Data Science Summit 2016

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    A simple workflow for deep learning

    By Brian Lee Yung Rowe

    (This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

    As a follow-up to my Primer On Universal Function Approximation with Deep Learning, I’ve created a project on Github that provides a working example of building, training, and evaluating a neural network. Included are helper functions in Lua that I wrote to simplify creating the data and using some functional programming techniques.

    The basic workflow for the example is this:

    1. Create/acquire a training set;
    2. Analyze the data for traits, distributions, noise, etc.;
    3. Design a deep learning architecture including the layers and activation functions. Also make sure you understand the type of problem you are trying to solve);
    4. Choose hyper parameters, such as cost function, optimizer, and learning rate;
    5. Train the model;
    6. Evaluate in-sample and out-of-sample performance.

    My personal preference is to limit the use of a deep learning framework to building and training models. To construct the datasets and analyze performance, it’s easier to use R (YMMV of course). What’s nice about this approach is that if you primarily work in Python or R, then you can continue to use the tools you’re most familiar with. It also means that it’s easy to swap out one deep learning framework with another without having to start over. These frameworks are also a bit of a bear to setup (I’m looking at you, TensorFlow), particularly if you want to leverage GPUs. It’s also convenient to use a Docker image for this purpose to isolate the effects of a specialized configuration and make it repeatable if you want to work on *gasp* a second computer.

    Which Deep Learning Framework?

    Having some experience with TensorFlow, Theano, and Torch, I find Torch to have the friendliest high-level semantics. Theano and TensorFlow are much more low-level, which is not as well suited to practitioners or applied researchers. That means it’s a little harder to get started. The trade-off with Torch is that you have to learn Lua, which is a simple scripting language but also has some awkward paradigms (I’ve never been a fan of the Prototype object model).

    On the other hand, Theano and TensorFlow are built on Python, so most people will be familiar with the language. However, my time could be spent better if I didn’t have to write my own mini-batch algorithm. As an alternative, Keras provides a semantically rich high-level interface that works with both Theano and TensorFlow. I will be adding a corresponding function approximation example in the deep_learning_ex project to make it easier to compare. At that point, it will be easier to compare compute performance, as well as how close the optimizers are to each other.

    Deep learning can be painfully slow

    In terms of TensorFlow, unless you plan on working for Google, I wouldn’t recommend using it. The fact that it requires Google’s proprietary Bazel build system means it’s DOA for me. When I want to work with deep learning, I really don’t have the patience to wait for a 1.1 GB download of just the build system. I mean, I only have Time Warner Cable for crissakes. Others have reported that even using pre-built models, like SyntaxNet are slow, so unless you have the compute power and storage capacity of Google’s data centers along with the bandwidth of Google Fiber, you’re better off watching YouTube.

    Learning Deep Learning Frameworks

    Torch/Lua

    Learning Torch can be split into two tasks: learning Lua, and then understanding the Torch framework, specifically the nn package. Most people will find that learning Lua will take the majority of the time, as nn is nicely organized and easy to use.

    If you already are comfortable with programming languages, then this 15 minute tutorial is good. Alternatively, this other 15 minute tutorial is a bit more terse but rather comprehensive. This will cover the basics. Beyond that, you need to understand how to work with data, which is less well covered. The simplecsv module can simplify I/O.

    The actual data format that the optimzer needs is a table object with an attached size method. Each element of this table is itself a table with two elements: input and corresponding output. So this can be considered a row-major matrix representation of the data. To use the provided StochasticGradient optimizer, the data must be constructed this way, as shown in ex_fun_approx.lua. It is up to you to reserve some data for testing.

    From a practical perspective, you don’t need to know much about Torch itself. It’s probably more efficient to familiarize yourself with the nn package first. I spend most of my time in this documentation. At some later point, it might be worthwhile learning how Torch itself works, in which case their github repo is flush with documentation and examples. I haven’t needed to look elsewhere.

    Keras/Theano

    If Theano is like Torch, then Keras is like the nn package. Unless you need to descend into the bits, it’s probably best to stay high-level. Unlike nn, there are alternatives to Keras for Theano, which I won’t cover. Like Torch, Keras comes pre-installed in the Docker image provided in the deep_learning_ex repository. The best way to get started is to read the Keras documentation, which includes a working example of a simple neural network.

    As with nn, the trick is understanding the framework’s interface, particularly around what expectations it has for the data. Keras essentially expects a 4-tuple of (input training, input testing, output training, output testing). Their built-in datasets all return data organized like this (actually two pairs representing input and output).

    Conclusion

    Deep learning doesn’t need to be hard to learn. By following the prescribed workflow, using the provided Docker image, and streamlining your learning of deep learning frameworks to the essentials, you can get up to speed quickly.

    Have any resources you’d like to share? Add them in the comments!

    To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    gcbd 0.2.6

    By Thinking inside the box

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A pure maintenance release 0.2.6 of the gcbd package is now on CRAN. The gcbd proposes a benchmarking framework for LAPACK and BLAS operations (as the library can exchanged in a plug-and-play sense on suitable OSs) and records result in local database. Recent / upcoming changes to DBMI and RSQLite let me to update the package; there are no actual functionality changes in this release.

    CRANberries also provides a diffstat report for the latest release.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    RcppCNPy 0.2.6

    By Thinking inside the box

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new version of the RcppCNPy package arrived on CRAN a few days ago.

    RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.

    This new release reflects all the suggestions and comments I received during the review process for the Journal of Open Source Software submission. I am happy to say that about twenty-nine days after I submitted, the paper was accepted and is now published.

    Changes in version 0.2.6 (2016-09-25)

    • Expanded documentation in README.md

    • Added examples to help page

    • Added CITATION file for JOSS paper

    CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News