Rocker – explanation and motivation for Docker containers usage in applications development

By Marcin Kosiński

(This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

What is R? I was asked at the end of my presentation on the 10th Cracow R Users Meetup that was held last Friday (30.09.2016). I felt strange but absolutely confirmed that R is the language of Data Science and is designed to performed the statistical data analysis. Later I found out that few of listeners came to the meetup to listen more about Docker than R, as my topic was Rocker – explanation and motivation for Docker containers usage in applications development. In this post I present the overview of my presentation. If you are not familiar with using Dockers in R applications development, then this is a must read for you!

Presentation

The presentation is available at my website and (as many people ask) was prepared with the help of revealjs::revealjs_presentation, a great tool from RStudio. The front page wasn’t entirely specified in regular rmarkdown YAML, as I put few font awesome icons so I think you would also like to see the code of the presentation at my GitHub’s repository.

Overview

Below I present the description of the presentation.

As R users we mostly perform analysis, produce reports and create interactive shiny applications. Those are rather one-time performances. Sometimes, however, the R developer enters the world of the real software development, where R applications should be built once and then distributed and maintained on many machines.

What are the best practices in distributing the software? How can we ensure the code will always run the same regardless of its environment? Is there a way to skip the manual and long-lasting installation process if the software have been built once? Why using Docker in shipping the software is crucial?

Docker is the world’s leading software containerization platform. Docker containers wrap a piece of software in a complete file system that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment.

In my presentation I’ll give a brief introduction to Docker and provide a full motivation for using this technology in a regular R work. I strongly belief Docker can improve the data analysis to be more reproducible and can facilitate the software development so that the distribution and maintenance can be easier.

The presentation will be said by the person with the R background, assuming that the audience is also experienced in R. Get motivated: http://r-addict.com/2016/05/13/Docker-Motivation.html

### Acknowledgement

I would like to say thank you to Zygmunt Zawadzki and Bartosz Sękiewicz, organizators of this meetup and Marta Karaś, the second speaker with a great presentation on Convex Clustering and Biclustering with application in R that can be found here

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Proofing statistics in papers

By John Mount

NewImage

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for statistical errors. Please read on for how that is possible, some tools, and commentary.

Early automated analysis:
Trial model of a part of the Analytical Engine, built by Babbage, as displayed at the Science Museum (London) (Wikipedia).


From the abstract of Nuijten et.al. paper we have:

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion.

How did they do that? Has science been so systematized it is finally mechanically reproducible? Did they get access to one of the new open information extraction systems (please see Open Information Extraction: the Second Generation Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam for some discussion)?

No, they used the fact that the American Psychological Association defines a formal style for reporting statistical significances, just like they define a formal style for citations. Roughly it looks for text like the following:

The results of the regression indicated the two predictors explained 35.8% of the variance (R2=.38, F(2,55)=5.56, p < .01).
(From a derived style guide found at the University of Connecticut.)

The software looks for fragments like: “(R2=.38, F(2,55)=5.56, p < .01)”. So really we are looking at statistics in psychology papers because they have standards clear enough to facilitate inspection.

These statistical summaries are often put into research papers by cutting and pasting from multiple sources as not all stat packages report all these pieces in one contiguous string. So there are many chances for human error and therefore there is a very high chance they eventually get out of sync. Think of a researcher using Microsoft Word, Microsoft Excel, and some complicated graphical interface driven software again and again as data and treatment change throughout a study. Eventually something gets out of sync. We can try to check for inconsistency as both the reported p-value and R-squared are derivable from the F(numdf,dendf)=Fvalue portion.

In fact the cited example has errors. The “explained 35.8% of the variance” should likely be 38% (to match the R2 / coefficient of determination) and the “F(2,55)=5.56” bit would entail an R-squared closer to the following: F Test summary: (R2=0.17, F(2,55)=5.56, p≤0.00632) (we chose to show the actual p-value, but cutting off at a sensible limit is part of the guidelines). Likely this is a notional example itself built by copying and pasting to show the format (so we have no intent of mocking it). We derived this result by writing our own R function that takes the F-summaries and re-calculates the R-squared and p-value. In our case we performed the calculation by pasting the following into R: “formatAPAR2fromCite(numdf=2,dendf=55,FValue=5.56)” which performs the calculation and formats the result close to APA style.

Really this helps point out why scientists should strongly prefer workflows that support reproducible research (a topic we teach using R, RStudio, knitr, Sweave, and optionally Latex). It would be better to have correct conclusions automatically transcribed into reports, instead of hoping to catch some fraction wrong ones later. This is one reason Charles Babbage specified a printer on both his Difference Engine 2 and Analytical Engine (circa 1847)- to avoid errors!

That being said we recommend reading the original paper. The ability to detect errors gives the ability to collect statistics on errors over time, so there
are a number of interesting observations to be made. For more work in this spirit we suggest An empirical study of FORTRAN programs Knuth, Donald E., Software: Practice and Experience, Vol. 1, No. 2, 1971, doi: 10.1002/spe.4380010203.

We can even trying running statcheck on the guide; it confirms the relation between the F-value and p-value and doesn’t seem to check the R-squared (probably not part of the intended check):


x
Source 1
Statistic F
df1 2
df2 55
Test.Comparison =
Value 5.56
Reported.Comparison <
Reported.P.Value 0.01
Computed 0.006321509
Raw F(2,55)=5.56, p < .01
Error FALSE
DecisionError FALSE
OneTail FALSE
OneTailedInTxt FALSE
APAfactor 1

Our R code demonstrating how to automatically produce ready to go APA style F-summaries can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Network Analysis Part 2 Exercises

By Miodrag Sljukic

diameter

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises we shall practice the functions for network statistics, using package igraph.If you don’t have package already installed, install it using the following code:


install.packages("igraph")

and load it into the session using the following code:


library("igraph")

before proceeding. You can find more info about the package and graphs in general here

Answers to the exercises are available here.

If you have different solution, feel free to post it.

A number of employees in a factory was interviewed on a question: “Do you like to work with your co-worker?”. Possible answers are 1 for yes and 0 for no. Each employee gave an answer for each other employee thus creating an adjecancy matrix. You can download the dataset from here.

Exercise 1

Load the data and create an unweighted directed graph from the adjecancy matrix. Name the nodes as letters A to Y. Set node color to yellow and shape to sphere. Set the edge’s color to gray and arrow size to 0.2.

Exercise 2

Plot the graph.

Exercise 3

Calculate network diameter and average closeness.

Exercise 4

Calculate average network betweenness.

Exercise 5

Calculate network density and average degree.

Exercise 6

Calculate network reciprocity and average transitivity.

Exercise 7

Calculate average eccentricity of the vertices. What is the average distance between two nodes?

Exercise 8

Find the hubs and plot graph with node’s size according to their hubs index. Which employee is the biggest hub?

Exercise 9

Find the authorities and plot graph with node’s size according to their authority index. Which employee is the biggest authority?

Exercise 10

Show the nodes that make diameter. Plot these nodes larger and in red. Plot edges on this path thicker in red.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

approximate lasso

By xi’an

approxasso

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

Here is a representation of the precision of a kernel density estimate (second axis) against the true value of the density (first axis), which looks like a lasso of sorts, hence the title. I am not sure this tells much, except that the estimated values are close to the true values and that a given value of f(x) is associated with two different estimates, predictably…

Filed under: pictures, R, Statistics Tagged: density, kernel density estimator, plot, R

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppAnnoy 0.0.8

By Thinking inside the box

An new version 0.0.8 of RcppAnnoy, our Rcpp-based R integration of the nifty Annoy library by Erik, is now on CRAN. Annoy is a small, fast, and lightweight C++ template header library for approximate nearest neighbours.

This release pulls in a few suggested changes which had piled up since the last release.

Changes in this version are summarized here:

Changes in version 0.0.8 (2016-10-01)

  • New functions getNNsByItemList and getNNsByVectorList, by Michael Phan-Ba in #12

  • Added destructor (PR #14 by Michael Phan-Ba)

  • Extended templatization (PR #11 by Dan Dillon)

  • Switched to run.sh for Travis (PR #17)

  • Added test for admissible value to addItem (PR #18 closing issue #13)

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Source:: R News

R code to accompany Real-World Machine Learning (Chapter 2)

By data prone – R

Plot generated by above code

(This article was first published on data prone – R, and kindly contributed to R-bloggers)

Abstract

Introduces my Github repo providing R code to accompany the book “Real-World Machine Learning”.

Introducing rwml-R

The book “Real-World Machine Learning” attempts to prepare the reader for the
realities of machine learning. It covers a basic framework for
machine-learning projects, then it dives into extended examples that show how
that basic framework can be applied in realistic situations. It attempts to
provide the “hidden wisdom” on how to go about implementing products and
solutions based on machine learning. The book is a relatively easy read and
definitely worth the investment in time, but all of the supplied code is
contained in
iPython notebooks. I’m working through the book, reproducing all of the code
listings and figures using R markdown, and I’m posting
the results in a github repo: rwml-R.
If you find this project helpful, find any errors, or have any suggestions,
please leave a comment below or use the Tweet button.

Example: Mosaic Plot in Figure 2.12

To reproduce the mosaic plot in Figure 2.12 of the book, I use the vcd
package which contains a plethora of excellent tools for exploring categorical
data. The below mosaic plot shows the relationship between passenger gender
and survival in the supplied Titanic Passengers dataset.

Feedback welcome

I’d love to hear from you if you find this project helpful or if you
have any suggestions. Please leave a comment below or use the Tweet button.
Also, feel free to fork the rwml-R repo and
submit a pull request if you want to contribute.

Download
Fork

To leave a comment for the author, please follow the link and comment on their blog: data prone – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Course Finder update

By Onno Dijt

eye-15699__180

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A month ago we launched R course finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.

If you haven’t looked at it yet, go to the R Course Finder now by clicking here.

With your help and input we have improved it over the past month. As of today it consists of 93 courses on 12 different platforms on the web, and 1 offline learning institute.

This month we added courses from:

  • Pluralsight
  • NYC Data Science Academy

The NYC Data Science Academy is our first addition of an offline course. If this is something you are interested in and want to see more of, please let us know!

The R Course Finder allows to filter courses based on these filters:

8 Filters

The R Course Finder allows to filter courses based on these filters:

  • Level
  • Content
  • Duration
  • Price
  • Learning tools
  • Institute
  • Platform
  • Online/offline

Clever search box

Perhaps even better is a clever search box, which shows results immediately while you type. Just give it a try!

But we want to keep going! If you miss a course or know of a different platform we want to know, so we can keep adding to the most complete directory of R courses available online.

How you can help to make R Course Finder better

  • If you miss an important filter or search functionality, please let us know in the comments below.
  • If you already took one of the courses, please let all of us know about your experiences in the review section, an example is available here.
  • If you miss a course that is not included yet, please post a reminder in the comments and we’ll add it.

And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News