How to use H2O with R on HDInsight

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

H2O.ai is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft’s fully-managed Apache Hadoop platform in the cloud, which makes it easy to spin up and manage Azure clusters of any size. It’s also easy to to run H2O on HDInsight: H2O AI Platform is available as an application on HDInsight, which pre-installs everything you need as the cluster is created.

You can also drive H2O from R, but the R packages don’t come auto-installed on HDInsight. To make this easy, the Azure HDInsight team has provided a couple of scripts that will install the necessary components on the cluster for you. These include RStudio (to provide an R IDE on the cluster) and the rsparkling package. With these components installed, from R you can:

  • Query data in Spark using the dplyr interface, and add new columns to existing data sets.
  • Convert data for training, validation, and testing to “H2O Frames” in preparation for modeling.
  • Apply any of the machine learning models provided by Sparkling Water to your data, using the distributed computing capabilities provided by the HDInsight platform.

For details on how to install the R components on HDInsight, follow the link below.

Azure Data Lake & Azure HDInsight Blog: Run H2O.ai in R on Azure HDInsight

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Counterfactual estimation on nonstationary data, be careful!!!

By insightr

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In a recent paper that can be downloaded here, Carvalho, Masini and Medeiros show that estimating counterfactuals in a non-stationary framework (when I say non-stationary it means integrated) is a tricky task. It is intuitive that the models will not work properly in the absence of cointegration (spurious case), but what the authors show is that even with cointegration, the average treatment effect (ATE) converges to a non-standard distribution. As a result, standard tests on the ATE will identify treatment effects in several cases that there is no effect at all.

For those who are not familiar with counterfactual models: normally, these models have a treated unit (we want to know the effects of the treatment) and several untreated units that we call peers. There units may be cities, countries, companies, etc. Assuming that only one unit was treated and that there is no contamination, we can use information from the peers to project the behaviour of the treated unit as if it was not treated, which is the counterfactual. The models should also be able to identify if the treatment had no effects on the treated unit.

Simulation tests

Here I am going to generate cointegrated data with no treatment to show the behaviour of Synthetic Control and the ArCo (Other techniques will have the same features). The data is very simple and it is based on the mentioned article’s example. We are going to generate 100 observations of 10 time-series, 9 random walks and one that is the sum of these random walks plus an error, which is the treated unit. However, I will not include any treatment and I want the models to show that there was no treatment. The hypothesis to be tested is that at we had some treatment in the treated unit.

library(ArCo)
library(glmnet)
library(Synth)

T=100 # Number of Observations
n=10 # Number of Time Series
set.seed(1) # Seed for replication

# Generate random walk peers
peers=matrix(0,T,n-1)
for(i in 2:T){
  peers[i,]=peers[i-1,]+rnorm(n-1)
}

# Generate the treated unit
y1=rowSums(peers)+rnorm(T)

# Put all the TS together
Y=cbind(y1,peers)

# Plot all TS. The black line is the "treated unit"
matplot(Y,type="l")

plot of chunk unnamed-chunk-16

First, I am going to estimate the ArCo. The $delta show the average treatment effect with 95% confidence interval. As you can see, the test indicated that the treatment was statistically significant (the zero is very far from the confidence interval). The plot also show a dislocated counterfactual (blue line).

# Estimate the ArCo in the non-stationary data
arco=fitArCo(list(Y),cv.glmnet,predict,treated.unity = 1,t0=51)
arco$delta
##                  LB     delta        UB
## Variable1 -6.157713 -5.236705 -4.315697
plot(arco)

plot of chunk unnamed-chunk-17

Now I am going to estimate the Synthetic Control. The Synth package does not have an implementation of a formal test. However, we can have a good idea from the counterfactual plot, which also indicates that the treatment was effective (even though there was no treatment).

# Estimate the Synthetic Control in the nonstationary data

# Put the data in the Synth package notation
sY=as.vector(Y)
timeid=rep(1:T,ncol(Y))
unitid=sort(rep(1:ncol(Y),T))
synthdata=data.frame(time=timeid,unit=unitid,Y=sY)
synthdataprep<-
  dataprep(
    foo = synthdata,
    predictors = c("Y"),
    predictors.op = "mean",
    dependent = "Y",
    unit.variable = "unit",
    time.variable = "time",
    treatment.identifier = 1,
    controls.identifier = c(2:10),
    time.predictors.prior = c(1:50),
    time.optimize.ssr = c(1:50),
    time.plot = 1:100
)
SC=synth(synthdataprep)
path.plot(SC,synthdataprep)
abline(v=50,col="blue",lty=2)

plot of chunk unnamed-chunk-18

As you can see, both the ArCo and the Synthetic Control wrongly rejected the null hypothesis of no treatment. The best way to solve the problem is to take the first difference of all non-stationary time-series. However, there is an important thing to point out here. Imagine the random walk:

displaystyle y_t = y_{t-1} + varepsilon_t

Suppose we have a treatment in t=t_0 that simply makes y_{t_0} = c + y_{t_0-1}+varepsilon_{t_0} only for t=t_0. For any tneq t_0 we will have Delta y_t=varepsilon_t and for t=t_0 we will have Delta y_{t_0}=c+varepsilon_{t_0}. Therefore, if we take the first difference of y_t the treatment will have an impact only at t=t_0, which makes it impossible for any counterfactual model to identify an intervention. In this example, we would need an intervention that changed the drift of the Random Walk in a way that for all tgeq t_0 we have:

displaystyle y_t=d+y_{t-1}+varepsilon_t

This will be better illustrated in an example. The plot below shows the difference between the two treatments. The red line is the case we can’t identify if we take the first difference. The Blue line changes the drift of the Random Walk, and therefore we can identify. It is impossible, with the tools I am using here, to identify if we are in the black line or in the red line.

y=rep(0,T)
yt1=yt2=y
d=1
c=10
for(i in 2:T){
 e=rnorm(1)
 y[i]=y[i-1]+e
 if(i==51){
   yt1[i]=c+yt1[i-1]+e
 }else{
   yt1[i]=yt1[i-1]+e
 }
 if(i>=51){
   yt2[i]=d+yt2[i-1]+e
 }else{
   yt2[i]=yt2[i-1]+e
 }
}
plot(yt2,type="l",col=4,ylab="")
lines(yt1,col=2)
lines(y,col=1)
legend("topleft",legend=c("untreated","treated by a constant in t0","treated in the drift"),col=c(1,2,4),bty="n",seg.len=0.5,cex=0.8,lty=1)

plot of chunk unnamed-chunk-19

Now let’s get back to the first example. The results clearly indicated an intervention where there was none. Now I will show what happens if we take the first difference.

# Take the first difference
dY=diff(Y,1)
# Estimate the ArCo
darco=fitArCo(list(dY),cv.glmnet,predict,treated.unity = 1,t0=50)
darco$delta
##                   LB      delta        UB
## Variable1 -0.6892162 0.02737802 0.7439722
plot(darco)

plot of chunk unnamed-chunk-20

# Adjust the data to the Synth and estimate the model
dsY=as.vector(dY)
timeid=rep(2:T,ncol(Y))
unitid=sort(rep(1:ncol(Y),T-1))
dsynthdata=data.frame(time=timeid,unit=unitid,Y=dsY)

dsynthdataprep<-
  dataprep(
    foo = dsynthdata,
    predictors = c("Y"),
    predictors.op = "mean",
    dependent = "Y",
    unit.variable = "unit",
    time.variable = "time",
    treatment.identifier = 1,
    controls.identifier = c(2:10),
    time.predictors.prior = c(2:50),
    time.optimize.ssr = c(2:50),
    time.plot = 2:100
)

dSC=synth(dsynthdataprep)
##
## X1, X0, Z1, Z0 all come directly from dataprep object.
##
##
## ****************
##  optimization over w weights: computing synthtic control unit
##
##
##
## ****************
## ****************
## ****************
##
## MSPE (LOSS V): 8.176593
##
## solution.v:
##  1
##
## solution.w:
##  2.9e-09 1.4e-08 3.7e-09 0.9999999 7.4e-09 2.9e-09 1.16e-08 6.9e-09 4e-09
path.plot(dSC,dsynthdataprep)
abline(v=50,col="blue",lty=2)

plot of chunk unnamed-chunk-21

As you can see, the $delta in the ArCo showed that the treatment effect was statistically zero, which is true. The Synthetic Control plot also showed that there was no difference cause by the intervention even though the model adjustment was not very good.

Finally, note that we have two scenarios. If there is no intervention and we do not take the first difference we will probably wrongly reject the null hypothesis of no intervention. However, if we are in the “red line” case of the plot and we take the first difference, we will not reject the null when it may be false depending on the type of variable we are dealing with.

References

Carvalho, Carlos, Ricardo Masini, and Marcelo C. Medeiros. “The perils of counterfactual analysis with integrated processes.” (2016).

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

15 Jobs for R users (2017-07-31) – from all over the world

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Freelance
    Data Scientists – PhD Paradise – Germany Data Science Talent – Posted by datasciencetalent
    Frankfurt am Main Hessen, Germany
    22 Jul2017
  2. Full-Time
    Technical Project Manager Virginia Tech Applied Research Corporation – Posted by miller703
    Arlington Virginia, United States
    15 Jul2017
  3. Full-Time
    Software Developer Virginia Tech Applied Research Corporation – Posted by miller703
    Arlington Virginia, United States
    15 Jul2017
  4. Full-Time
    Marketing Manager RStudio – Posted by bill@rstudio.com
    Anywhere
    14 Jul2017
  5. Full-Time
    Solutions Engineer RStudio – Posted by nwstephens
    Anywhere
    14 Jul2017
  6. Full-Time
    Lead Data Scientist Golden Rat Studios – Posted by goldenrat
    Los Angeles California, United States
    13 Jul2017
  7. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    12 Jul2017
  8. Full-Time
    Data analyst – infectious disease mapping John Drake (University of Georgia) – Posted by John Drake
    Athens Georgia, United States
    16 Jun2017

More New Jobs

  1. Full-Time
    Financial Analyst/Modeler @ Mesa, Arizona, U.S. MD Helicopters – Posted by swhalen
    MesaArizona, United States
    31 Jul2017
  2. Full-Time
    Research volunteer in Cardiac Surgery @ Philadelphia, Pennsylvania, U.S. Thomas Jefferson University – Posted by CVSurgery
    PhiladelphiaPennsylvania, United States
    31 Jul2017
  3. Freelance
    Fit GLM distribution models to data using R nandini arora
    Anywhere
    31 Jul2017
  4. Freelance
    Financial services startup looking for freelance Shiny/R/UI developer incuhed
    Anywhere
    29 Jul2017
  5. Freelance
    Data Scientists – PhD Paradise – Germany Data Science Talent – Posted by datasciencetalent
    Frankfurt am MainHessen, Germany
    22 Jul2017
  6. Full-Time
    Technical Project Manager Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  7. Full-Time
    Software Developer Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    15 Jul2017
  8. Full-Time
    Software Developer (with R experience) @ Arlington, Virginia, U.S. Virginia Tech Applied Research Corporation – Posted by miller703
    ArlingtonVirginia, United States
    14 Jul2017
  9. Full-Time
    Marketing Manager RStudio – Posted by bill@rstudio.com
    Anywhere
    14 Jul2017
  10. Full-Time
    Solutions Engineer RStudio – Posted by nwstephens
    Anywhere
    14 Jul2017
  11. Full-Time
    Lead Data Scientist Golden Rat Studios – Posted by goldenrat
    Los AngelesCalifornia, United States
    13 Jul2017
  12. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    12 Jul2017
  13. Freelance
    R/Shiny and JavaScript Developer Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB) – Posted by vbremerich
    Anywhere
    10 Jul2017
  14. Part-Time
    Data Science Bootcamp Mentor Thinkful – Posted by baronowicz
    Anywhere
    9 Jul2017
  15. Part-Time
    Call for Help: R/Shiny Developer Fantasy Football Analytics – Posted by val.pinskiy
    Anywhere
    9 Jul2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

Source:: R News

Machine Learning Explained: Dimensionality Reduction

By EnhanceDataScience

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

Dealing with a lot of dimensions can be painful for machine learning algorithms. High dimensionality will increase the computational complexity, increase the risk of overfitting (as your algorithm has more degrees of freedom) and the sparsity of the data will grow. Hence, dimensionality reduction will project the data in a space with less dimension to limit these phenomena.
In this post, we will first try to get an intuition of what is dimensionality reduction, then we will focus on the most widely used techniques.

Spaces and variables with many dimensions

Let’s say we own a shop and want to collect some data on our clients. We can collect their ages, how frequently they come to our shop, how much they spend on average and when was the last time they came to our shop. Hence each of our clients can be represented by four variables (ages, frequency, spendings and last purchase date) and can be seen as a point in a four dimension space. Later on, variables and dimensions will have the same meaning.

Now, let’s add some complexity and some variables by using images. How many dimensions are used to represent the image below?

There are 8 by 8 pixels, hence 64 pixels are needed and each of these pixels is represented by a quantitative variable. 64 variables are needed!
Even bigger, how many variables do you need to represent the picture below?

ExempleImageDimension

Its resolution is 640 by 426 and you need three color channels (Red, Green, and Blue). So this picture requires no more than … 640*426*3= 817,920 variables. That’s how you go from four to almost one million of dimensions (and you can go well above!)

Dimensionality reduction in everyday life

Before seeing any algorithm, everyday life provides us a great example of dimensionality reduction.

ShadowProjection

Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.

Why should you reduce the number of dimensions?

Dimensionality reduction has several advantages from a machine learning point of view.

  • Since your model has fewer degrees of freedom, the likelihood of overfitting is lower. The model will generalize more easily on new data.
  • If you use feature selection or linear methods (such as PCA), the reduction will promote the most important variables which will improve the interpretability of your model.
  • Most of features extraction techniques are unsupervised. You can train your autoencoder or fit your PCA on unlabeled data. This can be really helpful is you have a lot of unlabeled data (for instance text or images) and labeling is time-consuming and expensive.

Features selection as a basic reduction

The most obvious way to reduce dimensionality is to remove some dimensions and to select the more suitable variables for the problem.

Here are some ways to select variables:

  • Greedy algorithms which add and remove variables until some criterion is met. For instance, the stepwise regression with forward selection will add at each step the variable which improve the fit in the most significant way
  • Shrinking and penalization methods, which will add cost for having too many variables. For instance, L1 regularization will cut some variables’ coefficient to zero. Regularization limits the space where the coefficients can live in.
  • Selection of models based on criteria taking the number of dimensions into accounts such as the adjusted R², AIC or BIC. Contrary to regularization, the model is not trained to optimize these criteria. The criteria are computed after fitting to compare several models.
  • Filtering of variables using correlation, VIF or some “distance measure” between the features.

Features Engineering and extraction to reduce dimensionality

While features selection is efficient, it is brutal since variables are either kept or removed. However in some interval or for some modalities, removed variable may be useful while kept variable may be redundant. Features extraction (or engineering) seeks to keep only the intervals or modalities of the features which contain information.

PCA and Kernel PCA

Principal component analysis (or PCA), is a linear transformation of the data which looks for the axis where the data has the most variance. PCA will create new variables which are linear combinations of the original ones, these new variables will be orthogonal (i.e. correlation equals to zero). PCA can be seen as a rotation of the initial space to find more suitable axis to express the variability in the data.
On the new variables have been created, you can select the most important ones. The threshold is up-to-you and depends on how much variance you want to keep. You can check the tutorial below to see a working R example:

R Basics: PCA with R

Since PCA is linear, it mostly works on linearly separable data. Hence, if you want to perform classification on other data (Like the donut below), linear PCA will probably make you lose a lot of information.

Example of Kernel PCA from Scikit-Learn

Exemple Donut Scikit Learn

On the other hand, kernel PCA can work on nonlinearly separable data. The trick is simple:

  • before applying PCA, a function is applied project the initial space in a bigger space where the data are separable. The function can be a polynomial, radial or some activation function.
  • Then a PCA is applied in this new space. (NB: It is not exactly PCA but close to it).

Hence you can easily separate the two rings of the donut, which will improve the performance of classifiers. Kernel PCA raises two problems, you need to find the right kernel and the new variables are hard to understand for humans.

LDA and Kernel LDA

Linear discriminant analysis is similar to PCA but is supervised (while PCA does not require labels). The goal of the LDA is not to maximize variance but to create linear surfaces to separate the different groups. The new features will be axes that separate the data when the data are projected on them.

As with PCA, you can also apply the kernel trick to apply LDA on non linearly separable data.

ICA and Kernel ICA

Independant component analysis aims at creating independent variables from the original variables.
The typical example Cocktail party problem: you are in a room with lots of different people having different conversations and your goal is to separate the different conversations. If you have a lot of microphones in the room, each and every of them will get a linear combination of all the conversation (some kind of noise). The goal of ICA is to disentangle these conversations from the noise.
This can be seen as dimensionality reduction if you have 200 microphones but only ten independent conversations, you should be able to represent them with ten independant variables from the ICA.

Autoencoder

Autoencoder is a powerful method to reduce the dimensionality of data. It is composed of a neural network (it can be feed-forward, convolutional or recurrent, most of the architecture can be adapted into an autoencoder) which will try to learn its input. For instance, an autoencoder trained on images will try to reconstruct these images.

In addition to this, the autoencoder has a bottleneck, the number of neurons in the autoencoder’s hidden layers will be smaller than the number of input variables. Hence, the autoencoder has to learn a compressed representation of the data where the number of dimensions is the number of neuron in the smallest hidden layer.

The main advantages of autoencoder are their non-linearity and how flexible they can be.

T-SNE

T-SNE or t-distributed stochastic neighbor embedding is mainly used to fit data from large dimensions in 2D or 3D space. That is the technique we used to fit and visualize the twitter data in our analysis of French election. The main idea behind T-SNE is that nearby point in the original space should be close in the low dimensional space while distant point should also be distant in the smaller space.
T-sne is highly non-linear, is originally non-parametric, dependant on the random seed and does not keep distance alike. Hence, I mainly use it to plot high dimension and to visualise cluster and similarity.

NB: There is also a parametric variant which seems less widely used than the original T-SNE.
https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf

Here ends our presentation of the most widely used dimensionality reduction techniques.

The post Machine Learning Explained: Dimensionality Reduction appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Google Vision API in R – RoogleVision

By Scott Stoltzman

plot of chunk unnamed-chunk-2

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

Using the Google Vision API in R

Utilizing RoogleVision

After doing my post last month on OpenCV and face detection, I started looking into other algorithms used for pattern detection in images. As it turns out, Google has done a phenomenal job with their Vision API. It’s absolutely incredible the amount of information it can spit back to you by simply sending it a picture.

Also, it’s 100% free! I believe that includes 1000 images per month. Amazing!

In this post I’m going to walk you through the absolute basics of accessing the power of the Google Vision API using the RoogleVision package in R.

As always, we’ll start off loading some libraries. I wrote some extra notation around where you can install them within the code.

# Normal Libraries
library(tidyverse)

# devtools::install_github("flovv/RoogleVision")
library(RoogleVision)
library(jsonlite) # to import credentials

# For image processing
# source("http://bioconductor.org/biocLite.R")
# biocLite("EBImage")
library(EBImage)

# For Latitude Longitude Map
library(leaflet)

Google Authentication

In order to use the API, you have to authenticate. There is plenty of documentation out there about how to setup an account, create a project, download credentials, etc. Head over to Google Cloud Console if you don’t have an account already.

# Credentials file I downloaded from the cloud console
creds = fromJSON('credentials.json')

# Google Authentication - Use Your Credentials
# options("googleAuthR.client_id" = "xxx.apps.googleusercontent.com")
# options("googleAuthR.client_secret" = "")

options("googleAuthR.client_id" = creds$installed$client_id)
options("googleAuthR.client_secret" = creds$installed$client_secret)
options("googleAuthR.scopes.selected" = c("https://www.googleapis.com/auth/cloud-platform"))
googleAuthR::gar_auth()

Now You’re Ready to Go

The function getGoogleVisionResponse takes three arguments:

  1. imagePath
  2. feature
  3. numResults

Numbers 1 and 3 are self-explanatory, “feature” has 5 options:

  • LABEL_DETECTION
  • LANDMARK_DETECTION
  • FACE_DETECTION
  • LOGO_DETECTION
  • TEXT_DETECTION

These are self-explanatory but it’s nice to see each one in action.

As a side note: there are also other features that the API has which aren’t included (yet) in the RoogleVision package such as “Safe Search” which identifies inappropriate content, “Properties” which identifies dominant colors and aspect ratios and a few others can be found at the Cloud Vision website


Label Detection

This is used to help determine content within the photo. It can basically add a level of metadata around the image.

Here is a photo of our dog when we hiked up to Audubon Peak in Colorado:

dog_mountain_label = getGoogleVisionResponse('dog_mountain.jpg',
                                              feature = 'LABEL_DETECTION')
head(dog_mountain_label)
##            mid           description     score
## 1     /m/09d_r              mountain 0.9188690
## 2 /g/11jxkqbpp mountainous landforms 0.9009549
## 3    /m/023bbt            wilderness 0.8733696
## 4     /m/0kpmf             dog breed 0.8398435
## 5    /m/0d4djn            dog hiking 0.8352048

All 5 responses were incredibly accurate! The “score” that is returned is how confident the Google Vision algorithms are, so there’s a 91.9% chance a mountain is prominent in this photo. I like “dog hiking” the best – considering that’s what we were doing at the time. Kind of a little bit too accurate…


Landmark Detection

This is a feature designed to specifically pick out a recognizable landmark! It provides the position in the image along with the geolocation of the landmark (in longitude and latitude).

My wife and I took this selfie in at the Linderhof Castle in Bavaria, Germany.

us_castle 

plot of chunk unnamed-chunk-4

The response from the Google Vision API was spot on. It returned “Linderhof Palace” as the description. It also provided a score (I reduced the resolution of the image which hurt the score), a boundingPoly field and locations.

  • Bounding Poly – gives x,y coordinates for a polygon around the landmark in the image
  • Locations – provides longitude,latitude coordinates
us_landmark = getGoogleVisionResponse('us_castle_2.jpg',
                                      feature = 'LANDMARK_DETECTION')
head(us_landmark)
##         mid      description     score
## 1 /m/066h19 Linderhof Palace 0.4665011
##                               vertices          locations
## 1 25, 382, 382, 25, 178, 178, 659, 659 47.57127, 10.96072

I plotted the polygon over the image using the coordinates returned. It does a great job (certainly not perfect) of getting the castle identified. It’s a bit tough to say what the actual “landmark” would be in this case due to the fact the fountains, stairs and grounds are certainly important and are a key part of the castle.

us_castle 

plot of chunk unnamed-chunk-6

Turning to the locations – I plotted this using the leaflet library. If you haven’t used leaflet, start doing so immediately. I’m a huge fan of it due to speed and simplicity. There are a lot of customization options available as well that you can check out.

The location = spot on! While it isn’t a shock to me that Google could provide the location of “Linderhof Castle” – it is amazing to me that I don’t have to write a web crawler search function to find it myself! That’s just one of many little luxuries they have built into this API.

latt = us_landmark$locations[[1]][[1]][[1]]
lon = us_landmark$locations[[1]][[1]][[2]]
m = leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  setView(lng = lon, lat = latt, zoom = 5) %>%
  addMarkers(lng = lon, lat = latt)
m


Face Detection

My last blog post showed the OpenCV package utilizing the haar cascade algorithm in action. I didn’t dig into Google’s algorithms to figure out what is under the hood, but it provides similar results. However, rather than layering in each subsequent “find the eyes” and “find the mouth” and …etc… it returns more than you ever needed to know.

  • Bounding Poly = highest level polygon
  • FD Bounding Poly = polygon surrounding each face
  • Landmarks = (funny name) includes each feature of the face (left eye, right eye, etc.)
  • Roll Angle, Pan Angle, Tilt Angle = all of the different angles you’d need per face
  • Confidence (detection and landmarking) = how certain the algorithm is that it’s accurate
  • Joy, sorrow, anger, surprise, under exposed, blurred, headwear likelihoods = how likely it is that each face contains that emotion or characteristic

The likelihoods is another amazing piece of information returned! I have run about 20 images through this API and every single one has been accurate – very impressive!

I wanted to showcase the face detection and headwear first. Here’s a picture of my wife and I at “The Bean” in Chicago (side note: it’s awesome! I thought it was going to be really silly, but you can really have a lot of fun with all of the angles and reflections):

us_hats_pic 

plot of chunk unnamed-chunk-8

us_hats = getGoogleVisionResponse('us_hats.jpg',
                                      feature = 'FACE_DETECTION')
head(us_hats)
##                                 vertices
## 1 295, 410, 410, 295, 164, 164, 297, 297
## 2 353, 455, 455, 353, 261, 261, 381, 381
##                                 vertices
## 1 327, 402, 402, 327, 206, 206, 280, 280
## 2 368, 439, 439, 368, 298, 298, 370, 370
##
## landmarks...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           landmarks
##   rollAngle panAngle tiltAngle detectionConfidence landmarkingConfidence
## 1  7.103324 23.46835 -2.816312           0.9877176             0.7072066
## 2  2.510939 -1.17956 -7.393063           0.9997375             0.7268016
##   joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood
## 1   VERY_LIKELY    VERY_UNLIKELY   VERY_UNLIKELY      VERY_UNLIKELY
## 2   VERY_LIKELY    VERY_UNLIKELY   VERY_UNLIKELY      VERY_UNLIKELY
##   underExposedLikelihood blurredLikelihood headwearLikelihood
## 1          VERY_UNLIKELY     VERY_UNLIKELY        VERY_LIKELY
## 2          VERY_UNLIKELY     VERY_UNLIKELY        VERY_LIKELY
us_hats_pic 

plot of chunk unnamed-chunk-10

Here’s a shot that should be familiar (copied directly from my last blog) – and I wanted to highlight the different features that can be detected. Look at how many points are perfectly placed:

my_face_pic 

plot of chunk unnamed-chunk-11

my_face = getGoogleVisionResponse('my_face.jpg',
                                      feature = 'FACE_DETECTION')
head(my_face)
##                               vertices
## 1 456, 877, 877, 456, NA, NA, 473, 473
##                               vertices
## 1 515, 813, 813, 515, 98, 98, 395, 395
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             landmarks
## landmarks ...
##    rollAngle  panAngle tiltAngle detectionConfidence landmarkingConfidence
## 1 -0.6375801 -2.120439  5.706552            0.996818             0.8222974
##   joyLikelihood sorrowLikelihood angerLikelihood surpriseLikelihood
## 1   VERY_LIKELY    VERY_UNLIKELY   VERY_UNLIKELY      VERY_UNLIKELY
##   underExposedLikelihood blurredLikelihood headwearLikelihood
## 1          VERY_UNLIKELY     VERY_UNLIKELY      VERY_UNLIKELY
head(my_face$landmarks)
## [[1]]
##                            type position.x position.y    position.z
## 1                      LEFT_EYE   598.7636   192.1949  -0.001859295
## 2                     RIGHT_EYE   723.1612   192.4955  -4.805475700
## 3          LEFT_OF_LEFT_EYEBROW   556.1954   165.2836  15.825399000
## 4         RIGHT_OF_LEFT_EYEBROW   628.8224   159.9029 -23.345352000
## 5         LEFT_OF_RIGHT_EYEBROW   693.0257   160.6680 -25.614508000
## 6        RIGHT_OF_RIGHT_EYEBROW   767.7514   164.2806   7.637372000
## 7         MIDPOINT_BETWEEN_EYES   661.2344   185.0575 -29.068363000
## 8                      NOSE_TIP   661.9072   260.9006 -74.153710000
...
my_face_pic 

plot of chunk unnamed-chunk-14


Logo Detection

To continue along the Chicago trip, we drove by Wrigley field and I took a really bad photo of the sign from a moving car as it was under construction. It’s nice because it has a lot of different lines and writing the Toyota logo isn’t incredibly prominent or necessarily fit to brand colors.

This call returns:

  • Description = Brand name of the logo detected
  • Score = Confidence of prediction accuracy
  • Bounding Poly = (Again) coordinates of the logo
wrigley_image 

plot of chunk unnamed-chunk-15

wrigley_logo = getGoogleVisionResponse('wrigley_text.jpg',
                                   feature = 'LOGO_DETECTION')
head(wrigley_logo)
##           mid description     score                               vertices
## 1 /g/1tk6469q      Toyota 0.3126611 435, 551, 551, 435, 449, 449, 476, 476
wrigley_image 

plot of chunk unnamed-chunk-17


Text Detection

I’ll continue using the Wrigley Field picture. There is text all over the place and it’s fun to see what is captured and what isn’t. It appears as if the curved text at the top “field” isn’t easily interpreted as text. However, the rest is caught and the words are captured.

The response sent back is a bit more difficult to interpret than the rest of the API calls – it breaks things apart by word but also returns everything as one line. Here’s what comes back:

  • Locale = language, returned as source
  • Description = the text (the first line is everything, and then the rest are indiviudal words)
  • Bounding Poly = I’m sure you can guess by now
wrigley_text = getGoogleVisionResponse('wrigley_text.jpg',
                                   feature = 'TEXT_DETECTION')
head(wrigley_text)
##   locale
## 1     en

##                                                                                                        description
## 1 RIGLEY FnICHICAGO CUBSnORDER ONLINE AT GIORDANOS.COMnTOYOTAnMIDWESTnFENCEn773-722-6616nCAUTIONnCAUTIONn
                                                                                             ORDER
##                                 vertices
## 1   55, 657, 657, 55, 210, 210, 852, 852
## 2 343, 482, 484, 345, 217, 211, 260, 266

wrigley_image 

plot of chunk unnamed-chunk-19


That’s about it for the basics of using the Google Vision API with the RoogleVision library. I highly recommend tinkering around with it a bit, especially because it won’t cost you a dime.

While I do enjoy the math under the hood and the thinking required to understand alrgorithms, I do think these sorts of API’s will become the way of the future for data science. Outside of specific use cases or special industries, it seems hard to imagine wanting to try and create algorithms that would be better than ones created for mass consumption. As long as they’re fast, free and accurate, I’m all about making my life easier! From the hiring perspective, I much prefer someone who can get the job done over someone who can slightly improve performance (as always, there are many cases where this doesn’t apply).

Please comment if you are utilizing any of the Google API’s for business purposes, I would love to hear it!

As always you can find this on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

sparklyr 0.6

By Javier Luraschi

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.
  • Connect to External Data Sources using spark_read_source(), spark_write_source(), spark_read_jdbc() and spark_write_jdbc().
  • Use the Latest Frameworks including dplyr 0.7, DBI 0.7, RStudio 1.1 and Spark 2.2.

and several improvements across:

  • Spark Connections add a new Databricks connection that enables using sparklyr in Databricks through mode="databricks", add support for Yarn Cluster through master="yarn-cluster" and connection speed was also improved.
  • Dataframes add support for sdf_pivot(), sdf_broadcast(), cbind(), rbind(), sdf_separate_column(), sdf_bind_cols(), sdf_bind_rows(), sdf_repartition() and sdf_num_partitions().
  • Machine Learning adds support for multinomial regression in ml_logistic_regression(), weights.column for GLM, ml_model_data() and a new ft_count_vectorizer() function for ml_lda().
  • Many other improvements, from initial support for broom over ml_linear_regression() and ml_generalized_linear_regression(), dplyr support for %like%, %rlike% and %regexp%, sparklyr extensions now support download_scalac() to help you install the required Scala compilers while developing extensions, Hive database management got simplified with tbl_change_db() and src_databases() to query and switch between Hive databases. RStudio started a joint effort with Microsoft to support a cross-platform Spark installer under github.com/rstudio/spark-install.

Additional changes and improvements can be found in the sparklyr NEWS file.

Updated documentation and examples are available at spark.rstudio.com. For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.

Distributed R

sparklyr 0.6 provides support for executing distributed R code through spark_apply(). For instance, after connecting and copying some data:

library(sparklyr)
sc 

We can apply an arbitrary R function, say jitter(), to each column over each row as follows:

iris_tbl %>% spark_apply(function(e) sapply(e[,1:4], jitter))
## # Source:   table [?? x 4]
## # Database: spark_connection
##    Sepal_Length Sepal_Width Petal_Length Petal_Width
##           
##  1     5.102223    3.507372     1.406654   0.1990680
##  2     4.900148    3.002006     1.396052   0.2002922
##  3     4.699807    3.204126     1.282652   0.2023850
##  4     4.618854    3.084675     1.508538   0.2119644
##  5     4.985322    3.596079     1.388837   0.1846146
##  6     5.381947    3.881051     1.686600   0.3896673
##  7     4.613713    3.400265     1.404120   0.2954829
##  8     4.995116    3.408897     1.493193   0.1945901
##  9     4.418538    2.916306     1.392230   0.1976161
## 10     4.891340    3.096591     1.498078   0.1174069
## # ... with 140 more rows

One can also group by columns to apply an operation over each group of rows, say, to perform linear regression over each group as follows:

spark_apply(
  iris_tbl,
  function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)),
  names = c("term", "estimate", "std.error", "statistic", "p.value"),
  group_by = "Species"
)
## # Source:   table [?? x 6]
## # Database: spark_connection
##      Species         term    estimate  std.error  statistic      p.value
##        
## 1 versicolor  (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
## 2 versicolor Petal_Length  0.33105360 0.03750041  8.8279995 1.271916e-11
## 3  virginica  (Intercept)  1.13603130 0.37936622  2.9945505 4.336312e-03
## 4  virginica Petal_Length  0.16029696 0.06800119  2.3572668 2.253577e-02
## 5     setosa  (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
## 6     setosa Petal_Length  0.20124509 0.08263253  2.4354220 1.863892e-02

Packages can be used since they are automatically distributed to the worker nodes; however, using spark_apply() requires R to be installed over each worker node. Please refer to Distributing R Computations for additional information and examples.

External Data Sources

sparklyr 0.6 adds support for connecting Spark to databases. This feature is useful if your Spark environment is separate from your data environment, or if you use Spark to access multiple data sources. You can use spark_read_source(), spark_write_source with any data connector available in Spark Packages. Alternatively, you can use spark_read_jdbc() and spark_write_jdbc() and a JDBC driver with almost any data source.

For example, you can connect to Cassandra using spark_read_source(). Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section.

config 

To connect to MySQL, one can download the MySQL connector and use spark_read_jdbc() as follows:

config ",
  dbtable = "person"))

Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section. See also crassy, an sparklyr extension being developed to read data from Cassandra with ease.

Dataframe Functions

sparklyr 0.6 includes many improvements for working with DataFrames. Here are some additional highlights.

x_tbl 

Pivoting DataFrames

It is now possible to pivot (i.e. cross tabulate) one or more columns using sdf_pivot().

sdf_pivot(y_tbl, b ~ c, list(b = "count"))
## # Source:   table [?? x 4]
## # Database: spark_connection
##       b     A     B     C
##   
## 1     4   NaN     1   NaN
## 2     3     1   NaN   NaN
## 3     5   NaN   NaN     1

Binding Rows and Columns

Binding DataFrames by rows and columns is supported through sdf_bind_rows() and sdf_bind_cols():

sdf_bind_rows(x_tbl, y_tbl)
## # Source:   table [?? x 3]
## # Database: spark_connection
##       a     b     c
##   
## 1     1     2  
## 2     2     3  
## 3     3     4  
## 4   NaN     3     A
## 5   NaN     4     B
## 6   NaN     5     C
sdf_bind_cols(x_tbl, y_tbl)
## # Source:   lazy query [?? x 4]
## # Database: spark_connection
##       a   b.x   b.y     c
##   
## 1     1     2     3     A
## 2     3     4     5     C
## 3     2     3     4     B

Separating Columns

Separate lists into columns with ease. This is especially useful when working with model predictions that are returned as lists instead of scalars. In this example, each element in the probability column contains two items. We can use sdf_separate_column() to isolate the item that corresponds to the probability that vs equals one.

library(magrittr)

mtcars[, c("vs", "mpg")] %>%
  sdf_copy_to(sc, .) %>% 
  ml_logistic_regression("vs", "mpg") %>%
  sdf_predict() %>%
  sdf_separate_column("probability", list("P[vs=1]" = 2))
## # Source:   table [?? x 7]
## # Database: spark_connection
##       vs   mpg id58fb64e07a38 rawPrediction probability prediction
##    
##  1     0  21.0              0               1
##  2     0  21.0              1               1
##  3     1  22.8              2               1
##  4     1  21.4              3               1
##  5     0  18.7              4               0
##  6     1  18.1              5               0
##  7     0  14.3              6               0
##  8     1  24.4              7               1
##  9     1  22.8              8               1
## 10     1  19.2              9               0
## # ... with 22 more rows, and 1 more variables: `P[vs=1]` 

Machine Learning

Multinomial Regression

sparklyr 0.6 adds support for multinomial regression for Spark 2.1.0 or higher:

iris_tbl %>%
  ml_logistic_regression("Species", features = c("Sepal_Length", "Sepal_Width"))
## Call: Species ~ Sepal_Length + Sepal_Width
## 
## Coefficients:
##      (Intercept) Sepal_Length Sepal_Width
## [1,]   -201.5540     73.19269   -59.83942
## [2,]   -214.6001     75.09506   -59.43476
## [3,]    416.1541   -148.28775   119.27418

Improved Text Mining with LDA

ft_tokenizer() was introduced in sparklyr 0.5 but sparklyr 0.6 introduces ft_count_vectorizer() and vocabulary.only to simplify LDA:

library(janeaustenr)
lines_tbl %
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features") %>%
  ml_lda("features", k = 4)
## An LDA model fit on 1 features
## 
## Topics Matrix:
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.8996952 0.8569472 1.1249431 0.9366354
## [2,] 0.9815787 1.1721218 1.0795771 0.9990090
## [3,] 1.1738678 0.8600233 0.9864862 0.9573433
## [4,] 0.8951603 1.0730703 0.9562389 0.8899160
## [5,] 1.0528748 1.0291708 1.0699833 0.8731401
## [6,] 1.1857015 1.0441299 1.0987713 1.1247574
## 
## Estimated Document Concentration:
## [1] 13.52071 13.52172 13.52060 13.51963

The vocabulary can be printed with:

lines_tbl %>%
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features", vocabulary.only = TRUE)
## [1] "jane"        "sense"       "austen"      "by"          "sensibility"
## [6] "and"

That’s all for now, disconnecting:

spark_disconnect(sc)
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data visualization with googleVis exercises part 9

By Euthymios Kasvikis

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Histogram & Calendar chart

This is part 9 of our series and we are going to explore the features of two interesting types of charts that googleVis provides like histogram and calendar charts.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

To run this example we will first create an experimental data frame with:
Hist=data.frame(A=rpois(100, 10),
B=rpois(100, 20),
C=rpois(100, 30))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

Histogram

It is quite simple to create a Histogram with googleVis. We will use the “Hist” data frame we just created. You can see the variables of your data frame with head().
Look at the example below to create a simple histogram:
HistC
plot(HistC)

Exercise 1

Create a list named “HistC” and pass to it the “Hist” data frame as a histogram. HINT: Use gvisHistogram().

Exercise 2

Plot the the histogram. HINT: Use plot().

Options

To add a legend to your chart you can use:
options=list(
legend="{ position: 'top' }")

Exercise 3

Add a legend to the bottom of your histogram and plot it. HINT: Use list().

To decide the colours of your bars you can use:
options=list(
colors="['black', 'green', 'yellow']")

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Change the colours of the histogram’s bars to red, green and blue and plot it. HINT: Use colors.

To set the dimensions of your histogram you can use:
options=list(
width=400, height=400)

Exercise 5

Set width of your histogram to 500, its height to 400 and plot it.

Calendar chart

It is quite simple to create a Calendar Chart with googleVis. We will use the “Cairo” data set. You can see the variables of “Cairo” with head().
Look at the example below to create a simple calendar chart:
CalC
plot(CalC)

Exercise 6

Create a list named “CalC” and pass to it the “Cairo” data set as a calendar chart. HINT: Use gvisCalendar().

Exercise 7

Plot the the calendar chart. HINT: Use plot().

Options

You can add title to your chart and set the dimensions with:
options=list(
title="Title",
height=400)

Exercise 8

Add a title to your calendar chart, set height to 500 and plot it. HINT: Use list().

You can change the features of your labels with:
options=list(calendar="{yearLabel: { fontName: 'Times-Roman',
fontSize: 26, color: 'black', bold: false})

Exercise 9

Add labels to your chart ,set the font of your labels to “Times-Roman”, their size to 30, their color to black, make them bold and plot the chart.

To find more options about the cells you can use:
cellSize: 15,
cellColor: { stroke: 'red', strokeOpacity: 0.5 },
focusedCellColor: {stroke:'red'}

Exercise 10

Set the size of the cells to 10, the focused color to green and plot the chart.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Matching, Optimal Transport and Statistical Tests

By arthur charpentier

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

To explain the “optimal transport” problem, we usually start with Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais“, where the the problem of transporting a given distribution of matter (a pile of sand for instance) into another (an excavation for instance). This problem is usually formulated using distributions, and we seek the “optimal” transport from one distribution to the other one. The formulation, in the context of distributions has been formulated in the 40’s by Leonid Kantorovich, e.g. from the distribution on the left to the distribution on the right.

Consider now the context of finite sets of points. We want to transport mass from points https://latex.codecogs.com/gif.latex?%5C%7BA_1%2C%5Ccdots%2CA_4%5C%7D to points https://latex.codecogs.com/gif.latex?{B_1,cdots,B_4}. It is a complicated combinatorial problem. For 4 points, there are only 24 possible transfer to consider, but it exceeds 20 billions with 15 points (on each side). For instance, the following one is usually seen as inefficient

while the following is usually seen as much better

Of course, it depends on the cost of the transport, which depends on the distance between the origin and the destination. That cost is usually either linear or quadratic.

There are many application of optimal transport in economics, see eg Alfred’s book Optimal Transport Methods in Economics. And there are also applications in statistics, that what I’ve seen while I was discussing with Pierre while I was in Boston, in June. For instance if we want to test whether some sample were drawn from the same distribution,

set.seed(13)
npoints
mu1
mu2
Sigma1
Sigma2
Sigma2[2,1]
Sigma1
Sigma2
library(mnormt)
X1
X2
plot(X1[,1], X1[,2], ,col="blue")
points(X2[,1], X2[,2], col = "red")

Here we use a parametric model to generate our sample (as always), and we might think of a parametric test (testing whether mean and variance parameters of the two distributions are equal).

or we might prefer a nonparametric test. The idea Pierre mentioned was based on optimal transport. Consider some quadratic loss

ground_p
p
w1
w2
C
library(transport)
a

then it is possible to match points in the two samples

nonzero
from_indices
to_indices
for (i in from_indices){
segments(X1[from_indices[i],1], X1[from_indices[i],2], X2[to_indices[i], 1], X2[to_indices[i],2])
}

Here we can observe two things. The total cost can be seen as rather large

> cost=function(a,X1,X2){
nonzero
naa=a[nonzero,]
d=function(i) (X1[naa$from[i],1]-X2[naa$to[i],1])^2+(X1[naa$from[i],2]-X2[naa$to[i],2])^2
sum(Vectorize(d)(1:npoints))
}
> cost(a,X1,X2)
[1] 9.372472

and the angle of the transport direction is alway in the same direction (more or less)

> angle=function(a,X1,X2){
nonzero
naa=a[nonzero,]
d=function(i) (X1[naa$from[i],2]-X2[naa$to[i],2])/(X1[naa$from[i],1]-X2[naa$to[i],1])
atan(Vectorize(d)(1:npoints))
}
> mean(angle(a,X1,X2))
[1] -0.3266797

> library(plotrix)
> ag=(angle(a,X1,X2)/pi)*180
> ag[ag
> dag=hist(ag,breaks=seq(0,361,by=1)-.5)
> polar.plot(dag$counts,seq(0,360,by=1),main=”Test Polar Plot”,lwd=3,line.col=4)

(actually, the following plot has been obtain by generating a thousand of sample of size 25)

In order to have a decent test, we need to see what happens under the null assumption (when drawing samples from the same distribution), see

Here is the optimal matching

Here is the distribution of the total cost, when drawing a thousand samples,

VC=rep(NA,1000)
VA=rep(NA,1000*npoints)
for(s in 1:1000){
X1a
X1b
ground_p
p
w1
w2
C
ab
VC[s]=cout(ab,X1a,X1b)
VA[s*npoints-(0:(npoints-1))]=angle(ab,X1a,X1b)
}
plot(density(VC)

So our cost of 9 obtained initially was not that high. Observe that when drawing from the same distribution, there is now no pattern in the optimal transport

ag=(VA/pi)*180
ag[ag
dag=hist(ag,breaks=seq(0,361,by=1)-.5)
polar.plot(dag$counts,seq(0,360,by=1),main="Test Polar Plot",lwd=3,line.col=4)

Nice isn’t it? I guess I will spend some time next year working on those transport algorithm, since we have great R packages, and hundreds of applications in economics…

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Scripting for data analysis (with R)

By mrtnj

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.

Postat i:

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Understanding Overhead Issues in Parallel Computation

By matloff

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In my talk at useR! earlier this month, I emphasized the fact that a major impediment to obtaining good speed from parallelizing an algorithm is systems overhead of various kinds, including:

  • Contention for memory/network.
  • Bandwidth limits — CPU/memory, CPU/network, CPU/GPU.
  • Cache coherency problems.
  • Contention for I/O ports.
  • OS and/or R limits on number of sockets (network connections).
  • Serialization.

During the Q&A at the end, one person in the audience asked how R programmers without a computer science background might acquire this information. A similar question was posed here today by a reader on this blog, to which I replied,

That question was asked in my talk. I answered by saying that I have an introduction to such things in my book, but that this is not enough. One builds this knowledge in haphazard ways, e.g. by search terms like “cache miss” and “network latency” on the Web, and above all, by giving it careful thought and reasoning things out. (When Nobel laureate Richard Feynman was a kid, someone said in awe, “He fixes radios by thinking!”)

Join an R Users Group, if there is one in your area. (And if not, then start one!) Talk about these things with them (though if you follow my above advice, you may find you soon know more than they do).

The book I was referring to was Parallel Computing for Data Science: With Examples in R, C++ and CUDA (Chapman & Hall/CRC, The R Series, Jun 4, 2015.

I have decided that the topic of system overhead issues in parallel computation is important enough for me to place Chapter 2 on the Web, which I have now done. Enjoy. I’d be happy to answer your questions (of a general nature, not on your specific code).

We are continuing to add more features to our R parallel computation package, partools. Watch this space for news!

By the way, the useR! 2017 videos are now on the Web, including my talk on parallel computing.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News