## densratio: New R Package for Density Ratio Estimation

By hoxo_m

(This article was first published on HOXO-M – anonymous data analyst group in Japan – , and kindly contributed to R-bloggers)

## 1. Overview

Density ratio estimation is described as follows: for given two data samples \$x\$ and \$y\$ from unknown distributions \$p(x)\$ and \$q(y)\$ respectively, estimate \$\$ w(x) = frac{p(x)}{q(x)} \$\$ where \$x\$ and \$y\$ are \$d\$-dimensional real numbers.

The estimated density ratio function \$w(x)\$ can be used in many applications such as the inlier-based outlier detection [1] and covariate shift adaptation [2]. Other useful applications about density ratio estimation were summarized by Sugiyama et al. (2012) [3].

The package densratio provides a function `densratio()` that returns a result has the function to estimate density ratio `compute_density_ratio()`.

For example,

``set.seed(3)x <- rnorm(200, mean = 1, sd = 1/8)y <- rnorm(200, mean = 1, sd = 1/2)library(densratio)result <- densratio(x, y)result``
``## ## Call:## densratio(x = x, y = y, method = "uLSIF")## ## Kernel Information:##   Kernel type:  Gaussian RBF ##   Number of kernels:  100 ##   Bandwidth(sigma):  0.1 ##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...## ## Kernel Weights(alpha):##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...## ## The Function to Estimate Density Ratio:##   compute_density_ratio()``

In this case, the true density ratio \$w(x)\$ is known, so we can compare \$w(x)\$ with the estimated density ratio \$hat{w}(x)\$.

``true_density_ratio <- function(x) dnorm(x, 1, 1/8) / dnorm(x, 1, 1/2)estimated_density_ratio <- result\$compute_density_ratioplot(true_density_ratio, xlim=c(-1, 3), lwd=2, col="red", xlab = "x", ylab = "Density Ratio")plot(estimated_density_ratio, xlim=c(-1, 3), lwd=2, col="green", add=TRUE)legend("topright", legend=c(expression(w(x)), expression(hat(w)(x))), col=2:3, lty=1, lwd=2, pch=NA)``

## 2. How to Install

You can install the densratio package from CRAN.

``install.packages("densratio")``

You can also install the package from GitHub.

``install.packages("devtools") # if you have not installed "devtools" packagedevtools::install_github("hoxo-m/densratio")``

The source code for densratio package is available on GitHub at

## 3. Details

### 3.1. Basics

The package provides `densratio()` that the result has the function to estimate density ratio.

For data samples `x` and `y`,

``library(densratio)result <- densratio(x, y)``

In this case, `result\$compute_density_ratio()` can compute estimated density ratio.

### 3.2. Methods

`densratio()` has `method` parameter that you can pass `"uLSIF"` or `"KLIEP"`.

• uLSIF (unconstrained Least-Squares Importance Fitting) is the default method. This algorithm estimates density ratio by minimizing the squared loss. You can find more information in Hido et al. (2011) [1].

• KLIEP (Kullback-Leibler Importance Estimation Procedure) is the anothor method. This algorithm estimates density ratio by minimizing Kullback-Leibler divergence. You can find more information in Sugiyama et al. (2007) [2].

The both methods assume that the denity ratio is represented by linear model: \$\$ w(x) = alpha_1 K(x, c_1) + alpha_2 K(x, c_2) + … + alpha_b K(x, c_b) \$\$ where \$\$ K(x, c) = expleft(frac{-|x – c|^2}{2 sigma ^ 2}right) \$\$ is the Gaussian RBF.

`densratio()` performs the two main jobs:

• First, deciding kernel parameter \$sigma\$ by cross validation,
• Second, optimizing kernel weights \$alpha\$.

As the result, you can obtain `compute_density_ratio()`.

### 3.3. Result and Paremeter Settings

`densratio()` outputs the result like as follows:

``## ## Call:## densratio(x = x, y = y, method = "uLSIF")## ## Kernel Information:##   Kernel type:  Gaussian RBF ##   Number of kernels:  100 ##   Bandwidth(sigma):  0.1 ##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...## ## Kernel Weights(alpha):##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...## ## Regularization Parameter(lambda):  ## ## The Function to Estimate Density Ratio:##   compute_density_ratio()``
• Kernel type is fixed by Gaussian RBF.
• The number of kernels is the number of kernels in the linear model. You can change by setting `kernel_num` parameter. In default, `kernel_num = 100`.
• Bandwidth(sigma) is the Gaussian kernel bandwidth. In default, `sigma = "auto"`, the algorithms automatically select the optimal value by cross validation. If you set `sigma` a number, that will be used. If you set a numeric vector, the algorithms select the optimal value in them by cross validation.
• Centers are centers of Gaussian kernels in the linear model. These are selected at random from the data sample `x` underlying a numerator distribution `p_nu(x)`. You can find the whole values in `result\$kernel_info\$centers`.
• Kernel weights are alpha parameters in the linear model. It is optimaized by the algorithms. You can find the whole values in `result\$alpha`.
• The funtion to estimate density ratio is named `compute_density_ratio()`.

## 4. Multi Dimensional Data Samples

In the above, the input data samples `x` and `y` were one dimensional. `densratio()` allows to input multidimensional data samples as `matrix`.

For example,

``library(densratio)library(mvtnorm)set.seed(71)x <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/8, 2))y <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/2, 2))result <- densratio(x, y)result``
``## ## Call:## densratio(x = x, y = y, method = "uLSIF")## ## Kernel Information:##   Kernel type:  Gaussian RBF ##   Number of kernels:  100 ##   Bandwidth(sigma):  0.316 ##   Centers:  num [1:100, 1:2] 1.178 0.863 1.453 0.961 0.831 ...## ## Kernel Weights(alpha):##   num [1:100] 0.145 0.128 0.138 0.187 0.303 ...## ## Regularization Parameter(lambda):  0.1 ## ## The Function to Estimate Density Ratio:##   compute_density_ratio()``

Also in this case, we can compare the true density ratio with the estimated density ratio.

``true_density_ratio <- function(x) {  dmvnorm(x, mean = c(1, 1), sigma = diag(1/8, 2)) /    dmvnorm(x, mean = c(1, 1), sigma = diag(1/2, 2))}estimated_density_ratio <- result\$compute_density_ratioN <- 20range <- seq(0, 2, length.out = N)input <- expand.grid(range, range)z_true <- matrix(true_density_ratio(input), nrow = N)z_hat <- matrix(estimated_density_ratio(input), nrow = N)par(mfrow = c(1, 2))contour(range, range, z_true, main = "True Density Ratio")contour(range, range, z_hat, main = "Estimated Density Ratio")``

The dimensions of `x` and `y` must be same.

## 5. References

[1] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 2011.

[2] Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. & Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS 2007.

[3] Sugiyama, M., Suzuki, T. & Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press 2012.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## What’s new on CRAN: March 2016

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Packages continue to flood into CRAN at a rate the challenges the sanity of anyone trying to keep up with what’s new. So far this month, more than 190 packages have been added. Here is a my view of what’s interesting in this March madness.

The launch_tutorial() function from the RtutoR package by Anup Nair launches a Shiny-based interactive R tutorial that, so far, includes sections on basic operations on a data set, data manipulation, loops and functions, and basic model development. The following screen shows the page for selecting columns from a data set. Notice that the example code offers two different dplyr based alternatives. The interface is far from perfect, but it’s quite workable. Interactive tutorials launched directly from the command line may very well be the next generation of R documentation.

It also looks like the idea of using an R package to launch a shiny application may indicate a trend. The lavaan.shiny package by William Kyle Hamilton, also new this month, contains a single function to launch an interactive tutorial on latent variable analysis based on the lavaan package.

Time Series aficionados will want to have a look at the dCovTS package from Pitsillou and Foxianos which implements the distance covariance and correlation metrics for univariate and multivariate time series. These are relatively new metrics published by Z. Zhou in a 2012 paper in which he adapted the distance correlation metric developed by Szekely et al to measure non-linear dependence in time series. The following plots shows the data, ACF, PACF and Auto-Distance Correlation Function (ADCF) for a time series of monthly deaths from bronchitis, emphysema and asthma for makes in the UK between 1974 and 1979.

The ADCF plot produced by the function ADCFplot(mdeaths,method=”Wild”,b=100) uses the “Wild Bootstrap”, a relatively new re-sampling technique for stationary time series.

If you are working with generalized linear mixed models you may be interested in two new packages that provide a few enhancements for lme4. glmmsr by Helen Ogden provides some alternatives to the Laplace method for approximating likelihood functions (The vignette does a good job of explaining the new alternatives) and GLMMRR from Fox, Klotzke and Veen fits GLMM models to binary, randomized response data and provides Cauchit, Log-log, Logistic and Probit link functions.

Machine Learning enthusiasts may find a few new packages interesting. The MultivariateRandomForest package by Raziur Rahman contains functions to fit multivariate Random Forests models and make predictions. The hclust2() function in Gagolewski, Bartoszuk and Cena’s genie package clusters data using the Gini index. hclust2() is a hierarchical clustering technique that is billed as being outlier resistant. The package kmlShape by Genolini and Guichard contains functions to do hierarchical clustering on longitudinal data using the Frechet’s distance metric to group trajectories. The following plot shows clusters identified for the artificial data generated in example 2 for the kmlShape() function.

deepboost from Marcous and Sandbank provides and interface to google’s Deep Boasting algorithm as described in this paper by Cortes et al. it provides functions for training, evaluation, predicting and hyper parameter optimizing using grid search and cross validation.

The last package I’ll mention today is rEDM from Ye, Clark and Deyle that brings empirical dynamic modeling (EDM) to R. EDM uses time series data to reconstruct the state space of a dynamic system using the Takens’ Theorem (1981) which implies that the reconstruction can be accomplished using lags of a time series data for the unknown or unobserved variables. The vignette makes a nice case for why attractors and chaos belong in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Another Route to Jupyter Notebooks – Azure Machine Learning

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

In much the same way that the IBM DataScientist Workbench seeks to provide some level of integration between analysis tools such as Jupyter notebooks and data access and storage, Azure Machine Learning studio also provides a suite of tools for accessing and working with data in one location. Microsoft’s offering is new to me, but it crossed my radar with the announcement that they have added native R kernel support, as well as Python 2 and 3, to their Jupyter notebooks: Jupyter Notebooks with R in Azure ML Studio.

Guest workspaces are available for free (I’m not sure if this is once only, or whether you can keep going back?) but there is also a free workspace if you have a (free) Microsoft account.

Once inside, you are provides with a range of tools – the one I’m interested in to begin with is the notebook:

Select a kernel:

give your new notebook a name, and it will launch into a new browser tab:

You can also arrange notebooks within separate project folders. For example, create a project:

and then add notebooks to it:

When creating a new notebook, you may have noted an option to View More in Gallery. The gallery includes examples of a range of project components, including example notebooks:

Thinking about things like the MyBinder app, which lets you launch a notebook in a container from a Github account, it would be nice to see additional buttons being made available to let folk run notebooks in Azure Machine Learning, or the Data Scientist Workbench.

It’s also worth noting how user tools – such as notebooks – seem to be being provided for free with a limited amount of compute and storage resource behind them as a way of recruiting users into platforms where they might then start to pay for more compute power.

From a course delivery perspective, I’m often unclear as to whether we can tell students to sign up for such services as part of a course or whether that breaks the service terms? (Some providers, such as Wakari, make it clear that “[f]or classes, projects, and long-term use, we strongly encourage a paid plan or Wakari Enterprise. Special academic pricing is available.”) It seems unfair that we should require students to sign up for accounts on a “free” service in their own name as part of our offering for a couple of reasons at least: first, we have no control over what happens on the service; second, it seems that it’s a commercial transaction that should be formalised in some way, even if only to agree that we can (will?) send our students to that particular service exclusively. Another possibility is that we say students should make their own service available, whether by installing software themselves or finding an online provider for themselves.

On the other hand, trying to get online services provided at course scale in a timely fashion within an HEI seems to be all but impossible, at least until such a time as the indie edtech providers such as Reclaim Hosting start to move even more into end-user app provision either at the individual level, or affordable class level (with an SLA)…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Updated R & BLAS Timings

By Avraham

(This article was first published on R – Strange Attractors, and kindly contributed to R-bloggers)

With the recent releases of R 3.2.4 and OpenBLAS 2.17, I decided it was time to re-benchmark R speed. I’ve settled on a particular set of tests, based on my experience as well as some of Simon Urbanek’s work which I separated into two groups: those focusing on BLAS-heavy operations and those which do not. I’ve posted the code I use to its own page, but I’ll copy it below for convenience:

```set.seed(4987)
library(microbenchmark)
library(Matrix)
A <- matrix(rnorm(1e6, 1e3, 1e2), ncol = 1e3)
B <- matrix(rnorm(1e6, 1e3, 1e2), ncol = 1e3)
A <- crossprod(A, A)
A <- A * 1000 / mean(A)
colnames(A) <- colnames(B) <- NULL
options(scipen=4, digits = 3)

BLAS <- microbenchmark(
sort(c(as.vector(A), as.vector(B))),
det(A),
A %*% B,
t(A) %*% B,
crossprod(A, B),
solve(A),
solve(A, t(B)),
solve(B),
chol(A),
chol(B, pivot = TRUE),
qr(A, LAPACK = TRUE),
svd(A),
eigen(A, symmetric = TRUE),
eigen(A, symmetric = FALSE),
eigen(B, symmetric = FALSE),
lu(A),
fft(A),
Hilbert(3000),
toeplitz(A[1:500, 1]),
princomp(A),
times=25L, unit='ms', control = list(order = 'block')
)

NotBLAS <- microbenchmark(
A + 2,
A - 2,
A * 2,
A / 2,
A * 0.5,
A ^ 2,
sqrt(A[1:10000]),
sin(A[1:10000]),
A + B,
A - B,
A * B,
A / B,
A[1:1e5] %% B[1:1e5],
A[1:1e5] %/% B[1:1e5],
times = 5000L, unit='ms', control = list(order = 'block')
)```

My machine is a i7-2600K overclocked to 4.65Ghz with 16GB RAM, running Win7 64. I’ve also used RStudio Version 0.99.893 for each of these tests.

I ran the tests on four versions of R:

• Plain vanilla R version 3.2.4 Revised (2016-03-16 r70336) installed from the binary on CRAN
• The same version of R as #1, but with Rblas.dll replaced by one based on OpenBLAS 2.17 compiled specifically for the SandyBridge CPU (as part of the #3)
• R compiled from source using the gcc 4.9.3 toolchain, passing `-march=native` to EOPTS and linking to a multi-threaded (4 max) SandyBridge-specific OpenBLAS 2.17
• Same as #3, but activating link-time optimization for core R and the BLAS

This time I was a bit wiser, and saved the results to RDS objects so that I could combine them into one R session and compare them. I’ll post the raw output at the bottom of the page, but I think it’s more intuitive to look at graphical comparisons. The takeaway is if you can install a fast BLAS! The other optimization options had minor effects, sometimes better sometimes worse, but using a fast BLAS made a significant difference. Adding a `Platform` variable to the four sets of `BLAS` and `NotBLAS` outputs and then combining them in one session allowed for the generation of comparison plots. The call for these graphs, for those following along at home, is:

```ggplot(BLAS) + geom_boxplot(aes(y = time / 1e6, color = Platform, x = expr)) + scale_y_continuous(labels = comma) + ylab('milliseconds') + coord_flip()
ggplot(BLAS) + geom_jitter(aes(y = time / 1e6, color = Platform, x = expr), alpha = 0.3) + scale_y_continuous(labels = comma) + ylab('milliseconds') + coord_flip()
ggplot(NotBLAS) + geom_boxplot(aes(y = time / 1e6, color = Platform, x = expr)) + scale_y_continuous(labels = comma) + ylab('milliseconds') + coord_flip()```

You may also have to right-click and sellect “View Image” to get a better picture. First is a boxplot comparison for the BLAS related functions:

As there are only 25 samples for each function, a jitter plot may be a bit simpler:

It is clear that just adding the BLAS makes a noticeable difference for matrix-related functionality. The other optimizations tend to provide some benefit, but nothing significant.

The non-BLAS related functions, understandably, don’t respond the same way to the BLAS. As I ran 5000 samples per expression, a jitter plot would not be that clear, even with a low alpha.

What is seen here, in my opinion, that using the local architecture sometimes improves performance, such as for division, and other retards it slightly, such as for `sin` and `sqrt`. I think the latter two depend on the optimization as well. Default `R` calls for `-O3` for C code. The magnitude of the improvements seems to be greater than the change in those times where there is an impediment, at least when eyeballing the plot. If I recall from previous tests, strangely, it may be slightly more efficient to pass `-mtune = native`, at least for my test suite. I’m not sure why, as arch implies tune. I’d have to recompile `R` yet again to test it, though. Also, with 5000 iterations per expression, garbage collection has to be run on occasion, which accounts for those outlier points.

In summation, if you deal with heavy calculation in R, specifically matrix-related, you will notice significant speed improvement by replacing the vanilla `Rblas.dll` with a faster one. Compiling or tuning for a specific architecture may show some small increases, but does not return the same results.

While it’s actually easier now to compile OpenBLAS for R and in R on Windows, my instructions are a bit dated, and so I’ll have to update those eventually. I have considered hosting some pre-compiled Rblas files, as Dr. Ei-ji Nakama does, but I’ve held back for two reasons 1) I have to ensure I apply the proper licenses and 2) I’m a bit concerned about someone suing or the like if for whatever reason, a blas didn’t work properly. I have disclaimers and all, but you never know 8-).

In any event, I’m always interested in hearing comments about your experience with compiling R or a fast BLAS for Windows, and especially if you found any of my previous posts helpful. Thanks!

### Raw Benchmark Output

```R version 3.2.4 Revised (2016-03-16 r70336) (pure vanilla)
Unit: milliseconds

BLAS
expr     Min      LQ  Median      UQ     Max    Mean    SD      CV     n
(fctr)   (dbl)   (dbl)   (dbl)   (dbl)   (dbl)   (dbl) (dbl)   (dbl) (int)
1  sort(c(as.vector(A), as.vector(B)))  264.56  266.56  267.13  270.50  327.81  270.39 12.15 0.04492    25
2                               det(A)  134.83  135.77  136.47  137.74  206.93  139.83 14.12 0.10095    25
3                              A %*% B  465.64  473.39  492.22  504.79  546.33  493.55 22.62 0.04583    25
4                           t(A) %*% B  468.96  501.26  506.00  535.00  564.67  514.08 24.71 0.04807    25
5                      crossprod(A, B)  737.52  746.47  759.38  768.71  813.29  762.19 18.66 0.02449    25
6                             solve(A)  607.80  615.18  621.48  639.45  679.81  628.14 18.80 0.02992    25
7                       solve(A, t(B))  847.49  853.39  855.29  859.28  881.07  858.57  8.89 0.01036    25
8                             solve(B)  617.06  621.25  622.45  624.65  683.30  625.33 12.56 0.02008    25
9                              chol(A)  116.61  117.19  117.40  118.45  123.19  118.46  1.99 0.01676    25
10               chol(B, pivot = TRUE)    2.38    2.45    2.52    2.59    6.91    3.15  1.51 0.48091    25
11                qr(A, LAPACK = TRUE)  423.32  424.46  425.25  426.35  432.92  425.62  2.07 0.00487    25
12                              svd(A) 2219.86 2242.51 2265.10 2294.88 2372.62 2277.14 45.47 0.01997    25
13          eigen(A, symmetric = TRUE)  939.63  946.36  952.16  963.85 1020.42  959.00 19.78 0.02063    25
14         eigen(A, symmetric = FALSE) 3623.95 3650.20 3657.46 3675.48 3740.09 3662.55 28.50 0.00778    25
15         eigen(B, symmetric = FALSE) 4072.00 4127.98 4175.12 4228.25 4344.13 4183.00 77.04 0.01842    25
16                               lu(A)  137.73  142.06  144.51  147.26  210.54  156.60 26.47 0.16901    25
17                              fft(A)  112.27  116.98  119.89  123.29  135.01  120.63  5.20 0.04308    25
18                       Hilbert(3000)  134.83  196.37  197.50  199.45  386.30  201.86 46.23 0.22905    25
19               toeplitz(A[1:500, 1])    4.92    5.06    5.09    5.45   10.49    5.80  1.72 0.29759    25
20                         princomp(A) 1834.20 1855.31 1903.78 1947.02 2127.45 1923.42 81.44 0.04234    25

NotBLAS
expr   Min    LQ Median    UQ   Max  Mean     SD    CV     n
(fctr) (dbl) (dbl)  (dbl) (dbl) (dbl) (dbl)  (dbl) (dbl) (int)
1                      A + 2 1.719 1.891  1.922 2.012 74.96 2.542 2.5880 1.018  5000
2                      A - 2 1.723 1.910  1.965 2.066 72.86 2.628 2.6389 1.004  5000
3                      A * 2 1.725 1.900  1.948 2.049 80.89 2.604 2.6526 1.019  5000
4                        A/2 3.014 3.162  3.179 3.205 74.44 3.776 2.5372 0.672  5000
5                    A * 0.5 1.718 1.884  1.908 1.971 74.80 2.538 2.5780 1.016  5000
6                        A^2 1.718 1.896  1.939 2.036 73.41 2.587 2.6111 1.009  5000
7           sqrt(A[1:10000]) 0.110 0.111  0.111 0.112  5.30 0.116 0.0833 0.717  5000
8            sin(A[1:10000]) 0.364 0.365  0.365 0.366  1.35 0.368 0.0429 0.117  5000
9                      A + B 1.306 1.384  1.448 1.542 68.83 1.582 1.8367 1.161  5000
10                     A - B 1.308 1.355  1.432 1.539 67.87 1.571 1.8412 1.172  5000
11                     A * B 1.310 1.410  1.471 1.571 69.73 1.610 1.8548 1.152  5000
12                       A/B 4.767 4.779  4.789 4.853 66.93 4.946 1.7605 0.356  5000
13  A[1:100000]%%B[1:100000] 2.528 2.544  2.558 2.583 65.80 2.639 1.2633 0.479  5000
14 A[1:100000]%/%B[1:100000] 2.271 2.296  2.318 2.370 65.33 2.423 1.2646 0.522  5000
----------
R version 3.2.4 Revised (2016-03-16 r70336) (pure vanilla)
With SandyBridge-specific OpenBLAS (2.17)
Unit: milliseconds

BLAS
expr     Min      LQ  Median      UQ    Max    Mean    SD     CV     n
(fctr)   (dbl)   (dbl)   (dbl)   (dbl)  (dbl)   (dbl) (dbl)  (dbl) (int)
1  sort(c(as.vector(A), as.vector(B)))  264.10  265.06  265.77  266.93  323.4  268.39 11.52 0.0429    25
2                               det(A)   14.31   14.57   15.26   16.67   20.0   15.82  1.43 0.0903    25
3                              A %*% B   19.25   21.34   26.83   28.49   84.1   27.78 12.68 0.4563    25
4                           t(A) %*% B   26.46   30.04   32.20   38.76   43.6   34.25  5.37 0.1567    25
5                      crossprod(A, B)   17.63   21.14   22.82   27.48   35.5   24.68  5.10 0.2066    25
6                             solve(A)   40.67   45.92   49.58   51.48  107.1   51.29 12.37 0.2412    25
7                       solve(A, t(B))   46.32   50.88   51.63   56.40   63.2   53.00  4.24 0.0799    25
8                             solve(B)   47.09   51.22   51.72   56.44  118.6   56.32 13.71 0.2435    25
9                              chol(A)    6.59    6.65    6.77    7.01   10.6    7.38  1.39 0.1890    25
10               chol(B, pivot = TRUE)    2.36    2.47    2.50    2.58    6.8    3.25  1.58 0.4851    25
11                qr(A, LAPACK = TRUE)   70.19   71.59   72.47   74.69   79.1   72.96  2.22 0.0304    25
12                              svd(A)  375.74  378.56  389.32  415.33  444.7  399.66 26.19 0.0655    25
13          eigen(A, symmetric = TRUE)  171.72  175.04  176.94  177.97  239.8  179.80 13.20 0.0734    25
14         eigen(A, symmetric = FALSE)  849.09  859.77  872.76  891.57  915.7  876.86 20.89 0.0238    25
15         eigen(B, symmetric = FALSE) 1011.80 1065.85 1070.08 1075.52 1085.1 1065.76 18.14 0.0170    25
16                               lu(A)   17.91   18.91   20.78   22.62   82.8   32.68 25.43 0.7782    25
17                              fft(A)  110.52  111.17  113.16  113.34  113.8  112.33  1.18 0.0105    25
18                       Hilbert(3000)  128.99  194.32  194.70  195.01  377.4  196.80 45.44 0.2309    25
19               toeplitz(A[1:500, 1])    4.77    4.96    5.03    5.04   10.0    5.58  1.64 0.2937    25
20                         princomp(A)  285.96  302.37  316.33  329.08  400.5  322.00 29.13 0.0905    25

NotBLAS
expr   Min    LQ Median    UQ   Max  Mean     SD    CV     n
(fctr) (dbl) (dbl)  (dbl) (dbl) (dbl) (dbl)  (dbl) (dbl) (int)
1                      A + 2 1.712 1.887  1.904 1.938 72.64 2.514 2.6240 1.044  5000
2                      A - 2 1.723 1.893  1.913 1.962 68.52 2.540 2.5268 0.995  5000
3                      A * 2 1.721 1.894  1.918 1.974 68.97 2.543 2.5326 0.996  5000
4                        A/2 3.016 3.173  3.192 3.229 73.36 3.804 2.5315 0.665  5000
5                    A * 0.5 1.724 1.900  1.939 2.049 73.70 2.590 2.6048 1.006  5000
6                        A^2 1.722 1.890  1.910 1.974 71.55 2.543 2.5479 1.002  5000
7           sqrt(A[1:10000]) 0.110 0.111  0.111 0.112  5.59 0.116 0.0876 0.753  5000
8            sin(A[1:10000]) 0.364 0.365  0.366 0.366  1.32 0.370 0.0408 0.110  5000
9                      A + B 1.310 1.336  1.378 1.447 62.35 1.512 1.7318 1.146  5000
10                     A - B 1.310 1.345  1.411 1.491 63.70 1.542 1.7465 1.133  5000
11                     A * B 1.312 1.349  1.421 1.501 65.96 1.550 1.7753 1.145  5000
12                       A/B 4.766 4.781  4.804 4.959 73.53 4.992 1.8146 0.364  5000
13  A[1:100000]%%B[1:100000] 2.526 2.541  2.550 2.591 63.56 2.638 1.2305 0.467  5000
14 A[1:100000]%/%B[1:100000] 2.272 2.294  2.310 2.361 66.49 2.399 1.2681 0.529  5000
----------
R version 3.2.4 Patched (2016-03-28 r70390) -- "Very Secure Dishes"
Compiled with -march=native
With SandyBridge-specific OpenBLAS (2.17)

BLAS
expr    Min      LQ  Median      UQ     Max    Mean    SD     CV     n
(fctr)  (dbl)   (dbl)   (dbl)   (dbl)   (dbl)   (dbl) (dbl)  (dbl) (int)
1  sort(c(as.vector(A), as.vector(B))) 266.79  267.24  268.46  271.46  329.37  272.04 12.36 0.0454    25
2                               det(A)  14.24   14.43   15.12   16.94   18.63   15.69  1.43 0.0911    25
3                              A %*% B  20.95   24.25   28.79   31.89   34.96   28.26  4.21 0.1490    25
4                           t(A) %*% B  26.70   29.53   33.74   41.83  103.68   39.71 18.48 0.4652    25
5                      crossprod(A, B)  17.65   19.46   28.94   31.99   36.97   26.53  6.56 0.2474    25
6                             solve(A)  41.47   45.97   49.44   52.75   56.85   49.38  4.19 0.0849    25
7                       solve(A, t(B))  46.34   51.00   53.22   55.86  113.34   55.53 12.45 0.2242    25
8                             solve(B)  47.90   52.74   54.88   57.05  126.30   58.01 15.05 0.2595    25
9                              chol(A)   6.57    6.72    6.93    7.43   12.90    7.74  1.84 0.2377    25
10               chol(B, pivot = TRUE)   2.37    2.47    2.49    2.53    6.26    3.22  1.53 0.4755    25
11                qr(A, LAPACK = TRUE)  70.84   71.54   73.10   74.91   78.28   73.59  2.19 0.0298    25
12                              svd(A) 375.93  384.60  399.36  454.94  563.73  421.43 49.29 0.1170    25
13          eigen(A, symmetric = TRUE) 170.29  174.23  177.88  185.33  244.08  182.23 14.50 0.0796    25
14         eigen(A, symmetric = FALSE) 839.89  851.24  861.85  874.42  961.61  870.93 31.10 0.0357    25
15         eigen(B, symmetric = FALSE) 985.55 1045.25 1097.99 1142.28 1200.26 1093.09 56.17 0.0514    25
16                               lu(A)  18.25   20.03   21.52   23.14   91.58   34.15 27.10 0.7936    25
17                              fft(A) 106.88  110.80  112.50  119.96  132.55  115.51  6.66 0.0576    25
18                       Hilbert(3000) 133.10  197.69  201.83  204.57  411.96  205.26 50.69 0.2470    25
19               toeplitz(A[1:500, 1])   5.04    5.20    5.35    5.49   10.66    5.94  1.65 0.2773    25
20                         princomp(A) 326.08  336.14  341.65  349.99  425.50  354.56 31.37 0.0885    25

NotBLAS
expr   Min    LQ Median    UQ   Max  Mean     SD     CV     n
(fctr) (dbl) (dbl)  (dbl) (dbl) (dbl) (dbl)  (dbl)  (dbl) (int)
1                      A + 2 1.731 1.890  1.911 1.956 73.12 2.538 2.6108 1.0286  5000
2                      A - 2 1.726 1.884  1.903 1.943 70.36 2.527 2.5691 1.0168  5000
3                      A * 2 1.727 1.891  1.910 1.945 70.49 2.531 2.5708 1.0156  5000
4                        A/2 2.040 2.193  2.210 2.236 71.35 2.816 2.5552 0.9075  5000
5                    A * 0.5 1.726 1.902  1.939 2.016 74.95 2.591 2.6501 1.0227  5000
6                        A^2 1.727 1.886  1.907 1.966 71.32 2.541 2.5889 1.0188  5000
7           sqrt(A[1:10000]) 0.509 0.516  0.516 0.517  4.36 0.521 0.0670 0.1285  5000
8            sin(A[1:10000]) 0.715 0.720  0.721 0.722  1.72 0.725 0.0416 0.0573  5000
9                      A + B 1.331 1.356  1.378 1.459 65.85 1.522 1.8010 1.1831  5000
10                     A - B 1.331 1.357  1.383 1.461 64.68 1.527 1.7970 1.1765  5000
11                     A * B 1.332 1.354  1.372 1.431 65.46 1.513 1.8009 1.1905  5000
12                       A/B 2.412 2.423  2.432 2.460 68.39 2.580 1.8492 0.7168  5000
13  A[1:100000]%%B[1:100000] 2.917 2.999  3.005 3.027 67.03 3.084 1.2797 0.4150  5000
14 A[1:100000]%/%B[1:100000] 2.683 2.759  2.769 2.816 67.74 2.861 1.2999 0.4543  5000
----------
R version 3.2.4 Patched (2016-03-28 r70390) -- "Very Secure Dishes"
Compiled with -march=native and partial LTO (R core, blas, and lapack only)
With SandyBridge-specific OpenBLAS (2.17)

BLAS
expr    Min     LQ  Median      UQ     Max    Mean    SD     CV     n
(fctr)  (dbl)  (dbl)   (dbl)   (dbl)   (dbl)   (dbl) (dbl)  (dbl) (int)
1  sort(c(as.vector(A), as.vector(B))) 263.84 265.83  266.39  268.22  323.52  269.49 11.48 0.0426    25
2                               det(A)  16.55  17.15   17.72   19.73   23.69   18.55  1.82 0.0982    25
3                              A %*% B  19.55  21.52   25.02   33.33   96.59   29.20 15.17 0.5195    25
4                           t(A) %*% B  27.71  33.26   35.59   38.03   45.20   35.47  4.33 0.1220    25
5                      crossprod(A, B)  17.81  21.49   25.98   28.99   34.36   25.94  4.49 0.1730    25
6                             solve(A)  48.38  51.94   54.40   57.04  112.31   56.72 12.03 0.2122    25
7                       solve(A, t(B))  54.93  56.65   60.59   63.24   72.91   61.05  5.25 0.0859    25
8                             solve(B)  55.50  58.85   60.88   62.99  121.31   63.46 12.50 0.1969    25
9                              chol(A)   8.57   8.69    8.80    8.98   12.21    9.40  1.28 0.1361    25
10               chol(B, pivot = TRUE)   2.35   2.44    2.51    2.62    6.87    3.18  1.60 0.5035    25
11                qr(A, LAPACK = TRUE)  71.16  72.38   73.57   74.85   83.43   74.41  2.93 0.0393    25
12                              svd(A) 376.02 380.71  392.98  437.93  490.97  407.28 34.08 0.0837    25
13          eigen(A, symmetric = TRUE) 173.96 176.57  179.35  181.38  259.73  183.29 16.79 0.0916    25
14         eigen(A, symmetric = FALSE) 852.68 856.26  860.10  864.31  926.66  865.16 16.63 0.0192    25
15         eigen(B, symmetric = FALSE) 987.24 993.14 1001.10 1020.62 1079.71 1010.79 24.32 0.0241    25
16                               lu(A)  19.18  20.01   21.76   24.21   26.51   22.11  2.44 0.1103    25
17                              fft(A) 106.65 107.09  108.58  110.35  116.92  108.98  2.36 0.0216    25
18                       Hilbert(3000) 125.60 191.73  192.18  194.44  314.63  195.15 33.81 0.1732    25
19               toeplitz(A[1:500, 1])   4.99   5.18    5.21    5.24   10.22    5.79  1.62 0.2801    25
20                         princomp(A) 285.97 296.12  300.58  317.75  370.74  315.50 30.97 0.0982    25

NotBLAS
expr   Min    LQ Median    UQ   Max  Mean     SD     CV     n
(fctr) (dbl) (dbl)  (dbl) (dbl) (dbl) (dbl)  (dbl)  (dbl) (int)
1                      A + 2 1.828 1.995  2.022 2.091 74.52 2.615 2.4756 0.9468  5000
2                      A - 2 1.807 1.973  1.995 2.064 70.60 2.636 2.5624 0.9721  5000
3                      A * 2 1.827 1.991  2.009 2.054 71.57 2.637 2.5618 0.9714  5000
4                        A/2 2.147 2.293  2.309 2.335 68.18 2.912 2.4859 0.8537  5000
5                    A * 0.5 1.817 1.990  2.007 2.049 68.35 2.628 2.5134 0.9564  5000
6                        A^2 1.795 1.964  1.980 2.011 67.96 2.596 2.5082 0.9660  5000
7           sqrt(A[1:10000]) 0.531 0.535  0.536 0.536  2.51 0.540 0.0474 0.0878  5000
8            sin(A[1:10000]) 0.719 0.726  0.726 0.727  1.70 0.728 0.0413 0.0566  5000
9                      A + B 1.326 1.355  1.391 1.545 66.25 1.690 1.8104 1.0713  5000
10                     A - B 1.321 1.365  1.437 1.602 68.45 1.726 1.8618 1.0788  5000
11                     A * B 1.327 1.366  1.439 1.586 63.71 1.719 1.8044 1.0499  5000
12                       A/B 2.435 2.463  2.530 2.560 65.87 2.804 1.8057 0.6439  5000
13  A[1:100000]%%B[1:100000] 2.987 3.002  3.007 3.025 63.89 3.092 1.2231 0.3956  5000
14 A[1:100000]%/%B[1:100000] 2.718 2.732  2.736 2.745 63.70 2.809 1.2202 0.4344  5000```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Real-time model scoring for streaming data – a prototype based on Oracle Stream Explorer and Oracle R Enterprise

ORE is used for model building, in batch mode, at low frequency, and OSX handles the high frequency streams and pushes data toward a scoring application, performs predictions in real time and returns results to consumer applications connected to the output channels.

OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit

(This article was first published on Oracle R Technologies, and kindly contributed to R-bloggers)

Whether applied to manufacturing, financial services, energy, transportation, retail, government, security or other domains, real-time analytics is an umbrella term which covers a broad spectrum of capabilities (data integration, analytics, business intelligence) built on streaming input from multiple channels. Examples of such channels are: sensor data, log data, market data, click streams, social media and monitoring imagery.

Key metrics separating real-time analytics from more traditional, batch, off-line analytics are latency and availability. At one end of the analytics spectrum are complex, long running batch analyses with slow response time and low availability requirements. At the other end are real-time, lightweight analytic applications with fast response time (O[ms]) and high availability (99.99..%). Another distinction is between the capability for responding to individual events and/or ordered sequences of events versus the capability for handling only event collections in micro batches without preservation of their ordered characteristics. The complexity of the analysis performed on the real-time data is also a big differentiator: capabilities range from simple filtering and aggregations to complex predictive procedures. The level of integration between the model generation and the model scoring functionalities needs also to be considered for real-time applications. Machine learning algorithms specially designed for online model building exist and are offered by some streaming data platforms but their number is small. Practical solutions could be built by combining an off-line model generation platform with a data streaming platform augmented with scoring capabilities.

In this blog we describe a new prototype for real time analytics integrating two components : Oracle Stream Explorer (OSX) and Oracle R Enterprise (ORE). Examples of target applications for this type of integration are: equipment monitoring through sensors, anomaly detection and failure prediction for large systems made of a high number of components.

The basic architecture is illustrated below:

ORE is used for model building, in batch mode, at low frequency, and OSX handles the high frequency streams and pushes data toward a scoring application, performs predictions in real time and returns results to consumer applications connected to the output channels.

OSX is a middleware platform for developing streaming data applications. These applications monitor and process large amounts of streaming data in real time, from a multitude of sources like sensors, social media, financial feeds, etc. Readers unfamiliar with OSX should visit Getting Started with Event Processing for OSX.

In OSX, streaming data flows into, through, and out of an application. The applications can be created, configured and deployed with pre-built components provided with the platform or built from customized adapters and event beans. The application in this case is a custom scoring application for real time data. A thorough description of the application building process can be found in the following guide: Developing Applications for Event Processing with Oracle Stream Explorer.

In our solution prototype for streaming analytics, the model exchange between ORE and OSX is realized by converting the R models to a PMML representation. After that, JPMML – the Java Evaluator API for PMML – is leveraged for reading the model and building a custom OSX scoring application.
The end-to-end workflow is represented below:

and the subsequent sections of this blog will summarize the essentials aspects.

### Model Generation

As previously stated, the use cases targeted by this OSX-ORE integration prototype application consist of systems made of a large number of different components. Each component type is abstracted by a different model. We leverage ORE’s Embedded R Execution capability for data and task parallelism to generate a large number of models concurrently. This is accomplished for example with ore.groupApply():

res <- ore.groupApply(
X=…
INDEX=…
function(dat,frml) {mdl<-…},
…,
parallel=np)

### Model representation in PMML

The model transfer between the model generator and the scoring engine is enabled by conversion to a PMML representation. PMML is an XML-based mature standard for model exchange. A model in PMML format is represented by a collection of XML elements, or PMML components, which completely describe the modeling flow. For example, the Data Dictionary component contains the definitions for all fields used by the model (attribute types, value ranges, etc) the Data Transformations component describes the mapping functions between the
raw data and its desired form for the modeling algorithms, the Mining Schema component assigns the active and target variables and enumerates the policies for missing data, outliers, and so on. Besides the specifications for the data mining algorithms together with accompanying of pre- and post-processing steps, PMML can also describe more complex modeling concepts like model composition, model hierarchies, model verification and fields scoping – to find out more about PMML’s structure and functionality go to General Structure. PMML representations have been standardized for several classes of data mining algorithms. Details are available at the same location.

### PMML in R

In R the conversion/translation to PMML formats is enabled through the pmml package. The following algorithms are supported:

arules (arules)
coxph (survival)
glm (stats)
glmnet (glmnet)
hclust (stats)
kmeans (stats)
ksvm (kernlab)
lm (stats)
multinom (nnet)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomFoerst)
rfsrc (randomForestSRC)
rpart (rpart)
svm (e1071)

The r2pmml package offers complementary support for

gbm (gbm)
train(caret)

and a much better (performance-wise) conversion to PMML for randomForest. Check the details at converting_randomforest.

The conversion to pmml is done via the pmml() generic function which dispatches the appropriate method for the supplied model, depending on it’s class.

library(pmml)
mdl <- randomForest(…)
pmld <- pmml(mdl)

### Exporting the PMML model

In the current prototype, the pmml model is exported to the streaming platform as a physical XML file. A better solution is to leverage R’s serialization interface which supports a rich set of connections through pipes, url’s, sockets, etc.
The pmml objects can be also saved in ORE datastores within the database and specific policies can be implemented to control the access and usage.

write(toString(pmmdl),file=”..”)
serialize(pmmdl,connection)
ore.save(pmmdl,name=dsname,grantable=TRUE)
ore.grant(name=dsname, type=”datastore”, user=…)

### OSX Applications and the Event Processing Network (EPN)

The OSX workflow, implemented as an OSX application, consists of three logical steps: the pmml model is imported into OSX, a scoring application is created and scoring is performed on the input streams.

In OSX, applications are modeled as Data Flow graphs named Event Processing Networks (EPN). Data flows into, through, and out of EPNs. When raw data flows into an OSX application it is first converted into events. Events flow through the different stages of application where they are processed according to the specifics of the application. At the end, events are converted back to data in suitable format for consumption by downstream applications.

The EPN for our prototype is basic:

Streaming data flows from the Input Adapters through Input Channels, reaches the Scoring Processor where the prediction is performed, flows through the Output Channel to an Output Adapter and exits the application in a desired form. In our demo application the data is streamed out of a CSV file into the Input Adapter. The top adaptors (left & right) on the EPN diagram represent connections to the Stream Explorer User Interface (UI). Their purpose is to demonstrate options for controlling the scoring process (like, for example, change the model while the application is still running) and visualizing the predictions.

### The JPMML-Evaluator

The Scoring Processor was implemented by leveraging the open source library JPMML library, the Java Evaluator API for PMML. The methods of this class allow, among others to pre-process the active & target fields according to the DataDictionary and MiningSchema elements, evaluate the model for several classes of algorithms and post-process the results according to the Targets element.

JPMML offers support for:

Association Rules
Cluster Models
Regression
General Regression
k-Nearest Neighbors
Naïve Bayes
Neural Networks
Tree Models
Support Vector Machines
Ensemble Models

which covers most of the models which can be converted to PMML from R, using the pmml() method, except for time series, sequence rules & text models.

### The Scoring Processor

The Scoring Processor (see EPN) is implemented as a JAVA class with methods that automate scoring based on the PMML model. The important steps of this automation are enumerated below:

• The PMML schema is loaded, from the xml document,

• An instance of the Model Evaluator is created. In the example below we assume that we don’t know what type of model we are dealing with so the instantiation is delegated to an instance of a ModelEvaluatorFactory class.
• ModelEvaluatorFactory modelEvaluatorFactory =
ModelEvaluatorFactory.newInstance();

ModelEvaluator evaluator = modelEvaluatorFactory.newModelManager(pmml);

• This Model Evaluator instance is queried for the fields definitions. For the active fields:
• List activeModelFields = evaluator.getActiveFields();

• The subsequent data preparation performs several tasks: value conversions between the Java type system and the PMML type system, validation of these values according to the specifications in the Data Field element, handling of invalid, missing values and outliers as per the Mining Field element.
• FieldValue activeValue = evaluator.prepare(activeField, inputValue)
pmmlArguments.put(activeField, activeValue);

• The prediction is executed next
• Map results = evaluator.evaluate(pmmlArguments);

• Once this is done, the mapping between the scoring results & other fields to output events is performed. This needs to differentiate between the cases where the target values are Java primitive values or smth different.
• FieldName targetName = evaluator.getTargetField();
Object targetValue = results.get(targetName);
if (targetValue instanceof Computable){ ….

More details about this approach can be found at JPMML-Evaluator: Preparing arguments for evaluation and Java (JPMML) Prediction using R PMML model.

The key aspect is that the JPMML Evaluator API provides the functionality for implementing the Scoring Processor independently of the actual model being used. The active variables, mappings, assignments, predictor invocations are figured out automatically, from the PMML representation. This approach allows flexibility for the scoring application. Suppose that several PMML models have been generated off-line, for the same system component/equipment part, etc. Then, for example, an n-variables logistic model could be replaced by an m-variables decision tree model via the UI control by just pointing a Scoring Processor variable to the new PMML object. Moreover the substitution can be executed via signal events sent through the UI Application Control (upper left of EPN) without stopping and restarting the scoring application. This is practical because the real-time data keeps flowing in !

### Tested models

The R algorithms listed below were tested and identical results were obtained for predictions based on the OSX PMML/JPMML scoring application and predictions in R.

lm (stats)
glm (stats)
rpart (rpart)
naiveBayes (e1071)
nnet (nnet)
randomForest (randomForest)

The prototype is new and other algorithms are currently tested. The details will follow in a subsequent post.

### Acknowledgment

The OSX-ORE PMML/JPMML-based prototype for real-time scoring was developed togehther with Mauricio Arango | A-Team Cloud Solutions Architects. The work was presented at BIWA 2016.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## pacman Ver 0.4.1 Release

(This article was first published on R – TRinker’s R Blog, and kindly contributed to R-bloggers)

It was just over a year ago that Dason Kurkiewicz and I released pacman to CRAN. We have been developing the package on GitHub in the past 14 months and are pleased to announce these changes have made their way to CRAN in version 0.4.1.

## What Does pacman Do?

pacman is an R package management tool that combines the functionality of base library related functions into intuitively named functions. This package is ideally added to .Rprofile to increase workflow by reducing time recalling obscurely named functions, reducing code and integrating functionality of base functions to simultaneously perform multiple actions.

We really wished people would use pacman to share code (blog posts and help list/boards). The reason is selfish. Often one is trying out code and the poster has ten new packages in use that we don’t have. This means we have to stop the act of trying out the poster’s code to install the packages being used. To add injury to insult multiple library calls makes the script less readable.

Sing it with us…

Imagine there’s no install.packages
It’s easy if you try
No multiple library calls
Above us only sky
Imagine all the coders
Using pacman today…

Skip to the bottom where we demo what this coders utopia looks like.

## What’s New in Version 0.4.1?

Here are a few of the most notable highlights.

• Support for Bioconductor packages added compiments of Keith Hughitt.
• `p_boot` added to generate a string for the standard pacman script header that, when added to scripts, will ensure pacman is installed before attempting to use it. pacman will attempt to copy this string (standard script header) to the clipboard for easy cut and paste.
• `p_install_version_gh` and `p_load_current_gh` added as partners to `p_install_version` for GitHub packages. Thanks to Steve Simpson for this.

## Example Use

We will examine pacman‘s popularity in the last 14 months. We implore the readers to make the package used even more by using it in scripts posted online.

This script uses pacman to allow the user to check for, install, and load the four required packages all with two easy lines of code. The first line (compliments of `p_boot`) just makes sure pacman is installed. The later checks for, installs, and loads the packages. It’s pretty nice to just run a script isn’t it?

```if (!require("pacman")) install.packages("pacman")

package <- "pacman"
color <- "#26B8A6"
hjust <- -.069
start <- "2015-02-01"

lvls <- format(seq(as.Date(start), to = Sys.Date(), by = "month"), format = "%b %y")

tbl_df() %>%
select(-package) %>%
mutate(
date = as.POSIXct(date),
month = factor(format(date, format = "%b %y"), levels = lvls)
) %>%
na.omit() %>%
rename(timestamp = date)

aggregated <- dat %>%
group_by(month) %>%
summarize(n=sum(count), mean=mean(count), sd=sd(count)) %>%
filter(n > 0)

aggregated  %>%
ggplot(aes(month, n, group=1)) +
geom_path(size=4, color=color) +
geom_point(size=8, color=color) +
geom_point(size=3, color="white") +
theme_bw() +
#ggplot2::annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf, color="grey70") +
labs(
x = NULL, #"Year-Month",
) +
theme(
text=element_text(color="grey50"),
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.line = element_line(),
axis.ticks.x = element_line(color='grey70'),
axis.ticks.y = element_blank(),
plot.title = element_text(hjust = hjust, face="bold", color='grey50')
) +
scale_y_continuous(
expand = c(0, 0),
limits = c(0, max(aggregated\$n)*1.15),
labels = comma
)

```

The script is customizable for any package. Here we view a few more packages’ usage (some of the ones I’ve been enjoying as of late). Oh and you can download all of these packages via:

```if (!require("pacman")) install.packages("pacman")
```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Jupyter Notebooks with R in Azure Machine Learning Studio

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Earlier today Microsoft announced that Jupyter Notebooks are now available with the R Kernel as a service in Azure Machine Learning (ML) Studio.

I wrote about Jupyter Notebooks in September 2015 (Using R with Jupyter Notebooks), where I noted some of the great benefits of using notebooks:

• Jupyter is an easy to use and convenient way of mixing code and text in the same document.
• Unlike other reporting systems like RMarkdown and LaTex, Jupyter notebooks are interactive – you can run the code snippets directly in the document
• This makes it easy to share and publish code samples as well as reports.

This new announcement is that Jupyter Notebooks with R are now a feature of the Azure Machine Learning Studio. This means:

1. You can create and edit R notebooks directly in the Azure ML cloud, without having to create virtual machines.
2. Using the AzureML R package, you can interact between R notebooks and your Azure ML environment. Specifically, you can:
• Publish and consume web services
3. You can easily publish your notebook in the Cortana Intelligence Gallery, either privately or publicly.

## See notebooks in action

This video has a gentle introduction on how to use Jupyter Notebooks:

## See some sample notebooks

We have created some examples that demonstrates the possibilities:

You can find these notebooks, and others, in the Cortana Intelligence Gallery:

Published notebooks in the Cortana Intelligence Gallery

## How to try the notebooks in Azure ML

You can try the notebook service, as well as all of the Azure ML Studio, using a free guest account.

1. Go to http://studio.azureml.net/
• This will give you the option to use a guest account, a free account or a standard workspace
3. For completely anonymous access, choose “Guest Workspace”
• However, note that in a guest workspace you won’t be able to save your work.
• In a “Free Workspace” you can save your work and publish notebooks.
4. Take the option to do the quick tour, if you wish, otherwise skip this step
5. Click the Notebook tab on the left
6. You now have a range of sample notebooks to choose from. You can select either a new R notebook, or one of the sample notebooks.
• Notice that some of the notebooks are Python and some are R.
• You can tell the difference by looking for the small “R” icon in the top right hand corner of each tile
7. Click on “Blank R Notebook”, or open one of the other R sample notebooks
• Code away!
• To run code in a cell, click “Shift + Enter”

## Credits

The Jupyter project is an open source project with many contributors. Thank you to the Jupyter team, as well as the IRKernel team who did the integration work between Jupyter and R.

## Tell us what you think

You can help us prioritize features and otherwise tell us what you think by completing this one-minute survey

The announcement blog post contains more detail about the features of notebooks and the integration with Azure ML Studio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Introducing a Weekly R / Python / JS / etc Vis Challenge!

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Per a suggestion, I’m going to try to find a neat data set (prbly one from @jsvine) to feature each week and toss up some sample code (99% of the time prbly in R) and offer up a vis challenge. Just reply in the comments with a link to a gist/repo/rpub/blog/etc (or post directly, though inserting code requires some markup that you can ping me abt) post containing the code & vis with a brief explanation. I’ll gather up everything into a new github organization I made for this context. You can also submit a PR right to this week’s repo.

Winners get a free digital copy of Data-Driven Security, and if you win more than once I’ll come up with other stuff to give away (either an Amazon gift card, a book or something Captain America related).

Submissions should include a story/angle/question you were trying to answer, any notes or “gotchas” that the code/comments doesn’t explain and a [beautiful] vis. You can use whatever language or tool (even Excel or ugh Tableau), but you’ll have to describe what you did step-by-step for the GUI tools or record a video, since the main point about this contest is to help folks learn about asking questions, munging data and making visualizations. Excel & Tableau lock that knowledge in and Tableau even locks that data in.

### Droning on and on

Today’s data source comes from this week’s Data Is Plural newsletter and is all about drones. @jsvine linked to the main FAA site for drone sightings and there’s enough ways to slice the data that it should make for some interesting story angles.

I will remove one of those angles with a simple bar chart of unmanned aircraft (UAS) sightings by week, using an FAA site color for the bars. I wanted to see if there were any overt visual patterns in the time of year or if the registration requirement at the end of 2015 caused any changes (I didn’t crunch the numbers to see if there were any actual patterns that could be found statistically, but that’s something y’all can do). I’m not curious as to what caused the “spike” in August/September 2015 and the report text may have that data.

I’ve put this week’s example code & data into the 52 vis repo for this week.

 ```library(ggplot2) library(ggalt) library(ggthemes) library(readxl) library(dplyr) library(hrbrmisc) library(grid) # get copies of the data locally URL1 <- "http://www.faa.gov/uas/media/UAS_Sightings_report_21Aug-31Jan.xlsx" URL2 <- "http://www.faa.gov/uas/media/UASEventsNov2014-Aug2015.xls" fil1 <- basename(URL1) fil2 <- basename(URL2) if (!file.exists(fil1)) download.file(URL1, fil1) if (!file.exists(fil2)) download.file(URL2, fil2) # read it in xl1 <- read_excel(fil1) xl2 <- read_excel(fil2) # munge it a bit so we can play with it by various calendrical options drones <- setNames(bind_rows(xl2[,1:3], xl1[,c(1,3,4)]), c("ts", "city", "state")) drones <- mutate(drones, year=format(ts, "%Y"), year_mon=format(ts, "%Y%m"), ymd=as.Date(ts), yw=format(ts, "%Y%V")) # let's see them by week by_week <- mutate(count(drones, yw), wk=as.Date(sprintf("%s1", yw), "%Y%U%u")-7) # this looks like bad data but I didn't investigate it too much by_week <- arrange(filter(by_week, wk>=as.Date("2014-11-10")), wk) # plot gg <- ggplot(by_week, aes(wk, n)) gg <- gg + geom_bar(stat="identity", fill="#937206") gg <- gg + annotate("text", by_week\$wk[1], 49, label="# reports", hjust=0, vjust=1, family="Cabin-Italic", size=3) gg <- gg + scale_x_date(expand=c(0,0)) gg <- gg + scale_y_continuous(expand=c(0,0)) gg <- gg + labs(y=NULL, title="Weekly U.S. UAS (drone) sightings", subtitle="As reported to the Federal Aviation Administration", caption="Data from: http://www.faa.gov/uas/law_enforcement/uas_sighting_reports/") gg <- gg + theme_hrbrmstr(grid="Y", axis="X") gg <- gg + theme(axis.title.x=element_text(margin=margin(t=-6))) gg```

### Fin

I’ll still keep up a weekly vis from the Data Is Plural weekly collection even if this whole contest thing doesn’t take root with folks. You can never have too many examples for budding data folks to review.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## An awesome RStudio addin for selecting colours, and another for adding marginal density plots to ggplot2

(This article was first published on Dean Attali’s R Blog, and kindly contributed to R-bloggers)

TL;DR: There’s a colour picker addin in shinyjs and a ggplot2 marginal plots addin in ggExtra.

Any R user who hasn’t been spending the past 2 months under a rock should know by now about RStudio’s new exciting features: addins and gadgets.

(In case you don’t know, here’s a summary: Gadgets are simply Shiny apps that return a value, and are therefore meant to be used by the programmer to assign a value to a variable rather than by an end-user. Addins are a way to add your own clickable menus in RStudio that call R functions upon being clicked.)

From the moment I saw the announcement about gadgets and addins, I was excited to try them out. Unfortunately, I’ve been pretty tied up with wrapping up my thesis, so I didn’t get too much play time. I’m happy to finally announce the two addin+gadget pairs I’ve been working on: a colour picker and a tool to add marginal density plots to ggplot2.

## Colour picker

A colour picker gadget and addin are available in the `shinyjs` package.

Screenshot:

Some of you may already know that `shinyjs` provides a `colourInput()` function, which is a Shiny input widget that lets you select a colour (demo). The idea of the colour picker gadget is to extend this idea of a colour input and provide R developers with an easy way to select colours. It’s perfect if you want to choose a colour/a vector of colours to use for a plot or for anything else that requires a careful selection of colours.

You can either run the colour picker as an addin or as a gadget. To access it as an addin, click on the RStudio Addins menu and select Colour Picker. To access the colour picker gadget, run the `colourPicker()` function. When running the gadget, you can provide a positive integer as an argument (e.g. `colourPicker(3)`), which will cause the colour picker to initialize with placeholders for 3 colours.

### Features

• By default, the colour picker lets you select one colour. You can add/remove colours by using the buttons (plus sign to add another colour placeholder, garbage icon to remove the selected colour).

• The colours returned can either be HEX values (e.g. “#FF0000”) or R colour names (e.g. “red”). If you opt to get the R colour names, you will still get the HEX value for any colour that does not have a corresponding R colour.

• Each colour can be selected in one of three ways: 1. picking any arbitrary colour from a “colour wheel” (it’s actually a square though); 2. providing an arbitrary colour and selecting an R colour that’s very similar to the colour you provided; 3. picking from the complete list of all available R colours.

• When accessed as an addin (through the Addins menu), the result will be injected into the current script in RStudio as valid code that produces a vector of colours.

• When ran as a gadget, the result will return a vector of colours, so you can use it to assign a vector of colours to a variable, e.g. `mycols <- colourPicker(5)`.

Here’s a GIF demo of the colour picker in action:

## Add marginal density/histogram plots to ggplot2

This gadget and addin are available in the `ggExtra` package.

Screenshot:

The flagship function of the `ggExtra` package is `ggMarginal()`, which is used to add marginal density/histograms to ggplot2 plots (demo). Now with the help of this addin, you can do this interactively by setting all the different parameters and seeing how the plot is affected in real-time.

You can either run the marginal plot builder as an addin or as a gadget. To access it as an addin, highlight the code for a plot and then select ggplot2 Marginal Plots from the RStudio Addins menu. This will embed the resulting code for the marginal plots right into your script. Alternatively, you can call `ggMarginalGadget()` with a ggplot2 plot, and the return value will be a plot object. For example, if you have the following code in your script

``````library(ggplot2)
library(ggExtra)
myplot <- ggplot(mtcars, aes(wt, mpg)) + geom_point()
``````

You can either select the text `myplot` and run the addin from the menu, or you can select the entire ggplot expression and run the addin, or you can call `mynewplot <- ggMarginalGadget(myplot)`.

Disclaimer: The UI that is available for building Shiny gadgets is very limited, and there are very few resources and examples out there for building gadgets, so some of the UI code is a little bit hacky. If you’re having issues with the UI of one of these gadgets, please do let me know. If you have any other feedback, I’d also love to hear about it!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## How to check Likert scale summaries for plausibility

(This article was first published on BayesFactor: Software for Bayesian inference, and kindly contributed to R-bloggers)

Suppose you are reading a paper that uses Likert scale responses. The paper reports the mean, standard deviation, and number of responses. If we are — for some reason — suspicious of a paper, we might ask, “Are these summary statistics possible for this number of responses, for this Likert scale?” Someone asked me this recently, so I wrote some simple code to help check. In this blog post, I outline how the code works.

Suppose we are reading a paper that uses a 5-category Likert scale response (0-7) and they report that for 100 responses, the mean response was .5, and the standard deviation of the responses was 5. I have intentionally chosen these numbers to be impossible: for the mean to be .5, the responses have to be on average near the bottom of the scale. But if the responses are near the bottom of the scale, the standard deviation also has to be very low, because there is a bound at 0. The standard deviation of 5 is much to large for the mean.

Another possible inconsistency arises due to the discreteness of Likert scales. For 10 Likert responses on a 3 point scale (0-2), the mean must be a multiple of .1. This, in turn, imposes a complicated constraint on the standard deviation.

Checking whether a response pattern is possible may be simple for low N, but it gets complex as N increases. Suppose we are wondering, for a 6-item Likert scale (0-5), whether N=94, M=.83, and SD=1.21 are possible summary statistics. There are several ways we could go about this using R.

The code for both is available here: https://gist.github.com/richarddmorey/787d7b7547a736a49c3f

### Option 1: Brute force

In the brute force method, we create all possible response patterns (defined as the number of responses in each response category) and then check them, finding the ones closest to our desired response pattern. The code I linked above has two functions: count.ns, which counts the total number of response patterns for a given Likert scale and N. For the summaries above, this will be

> count.ns(N, nlev)
[1] 71523144

or about 72 million response patterns. Brute force isn’t pretty, but this job is possible in a minute or two on a modern PC. A note: on some older Windows computers this may exhaust your memory.
The function get.ns will compute all possible response patterns and put them in a matrix. This may take a bit of time (on my Macbook Pro it takes about a minute and a half):
> x = get.ns( 94, nlev=6)
> x[1:10,]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 94
[2,] 0 0 0 0 1 93
[3,] 0 0 0 0 2 92
[4,] 0 0 0 0 3 91
[5,] 0 0 0 0 4 90
[6,] 0 0 0 0 5 89
[7,] 0 0 0 0 6 88
[8,] 0 0 0 0 7 87
[9,] 0 0 0 0 8 86
[10,] 0 0 0 0 9 85
As you can see, x now contains the possible response patterns for N=94 responses, with K=6 Likert response categories. Above I show the first 10 patterns.
All that remains now is to compute the mean and standard deviation of each response pattern and compare it to the target. We can use the sum of the squared deviations from the target mean and standard deviation to sort the possible solutions. Note that if we wanted both deviations to be less than .005 (to account for rounding to the nearest .01) then we would want solutions that are no greater than 2 times .005^2 = .00005 in their summed squared error.
The code linked above does all the sorting and places the solutions in an object called res. After running the code, the first 10 rows of res are:
 > res[1:10,]
resp0 resp1 resp2 resp3 resp4 resp5 mean sum x^2 std. dev. sum error
[1,] 50 30 0 9 4 1 0.8297872 200 1.206063 1.554821e-05
[2,] 54 22 0 17 0 1 0.8297872 200 1.206063 1.554821e-05
[3,] 56 18 1 18 1 0 0.8297872 200 1.206063 1.554821e-05
[4,] 57 15 4 17 1 0 0.8297872 200 1.206063 1.554821e-05
[5,] 59 3 27 0 4 1 0.8297872 200 1.206063 1.554821e-05
[6,] 59 10 7 18 0 0 0.8297872 200 1.206063 1.554821e-05
[7,] 60 1 27 2 3 1 0.8297872 200 1.206063 1.554821e-05
[8,] 60 7 10 17 0 0 0.8297872 200 1.206063 1.554821e-05
[9,] 41 46 1 0 0 6 0.8297872 200 1.206063 1.554821e-05
[10,] 43 42 2 1 1 5 0.8297872 200 1.206063 1.554821e-05
The first six columns contain the possible response pattern; the next three columns contain the summary statistics; the final column contains the sum of the error. We wished to match M=.83 and SD=1.21, and there are many solutions that yield these summary statistics. There is no reason to be suspicious of these summaries.

### Option 2: Linear Inverse Models

The brute force solution above does the job, but it is slow and memory intensive. If we had a 7-item Likert scale, the number of possible response patterns would be about 1.2 billion; add more, and you can see that the amount of time and memory for the brute force method becomes prohibitive. We can actually use an approximate method — linear inverse models — to get close.
The idea of linear inverse models is that we can try to minimize a quadratic function, subject to some constraints. In our case, suppose we would like to find proportions of responses that would yield as summary statistics as close as possible to our given summary statistics. We have some reasonable constraints:
1. Proportions must sum to 1 (equality constraint).
2. Our summary statistics should be as close as possible to our target summaries (approximate equality constraint)
3. Proportions must be between 0 and 1 (inequality constraint).
The linSolve package in R allows us to sample approximate solutions for our response proportions according to these constraints. I will not go into detail here about how to define these constraints for the linSolve package; see my code linked above and the linSolve manual. I will, however, show you the output of limSolve‘s xsample function, which we use to sample possible solutions:
> xs <- limSolve::xsample(A = A, B = B, E = E, F = F, G = G, H = H, sdB = 1)
 > xs\$X[1:10,]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141
[2,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141
[3,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141
[4,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141
[5,] 0.5231726 0.2862201 0.12002470 0.02459116 0.00000000 0.04599141
[6,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995
[7,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995
[8,] 0.5085929 0.3226566 0.08151242 0.01938088 0.03202715 0.03582995
[9,] 0.5860701 0.2541897 0.04073821 0.03497458 0.01822376 0.06580361
[10,] 0.5860701 0.2541897 0.04073821 0.03497458 0.01822376 0.06580361

The possible solutions for the proportions of responses in each response category are in each row of the xs\$X element of the output. We need to multiply these by N=94 to see the response patterns themselves:
 > xs\$X[1:10,]*94
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192
[2,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192
[3,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192
[4,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192
[5,] 49.17823 26.90469 11.282322 2.311569 0.000000 4.323192
[6,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016
[7,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016
[8,] 47.80774 30.32972 7.662167 1.821802 3.010553 3.368016
[9,] 55.09059 23.89383 3.829392 3.287611 1.713034 6.185539
[10,] 55.09059 23.89383 3.829392 3.287611 1.713034 6.185539

There are several things to notice here. First, there are some duplicates, which is to be expected from the sampling method. Second, these solutions are not integers. Since response patterns must be integers, we would have to round these and make sure they sum to N=94 before testing them to see if their summaries are acceptably close.
Take the first solution: we might round this to obtain
49 27 11 2 0 4
However, this only sums to N=93, so we need to add 1 somewhere. Suppose we add it to the fourth category, to obtain
49 27 11 3 0 4

The mean response for this response pattern is almost exactly .83, our target mean. The standard deviation is 1.20, which is only .01 away from our target standard deviation.
Finally, note the similarity of these solutions to the ones we obtained by brute force. The linear inverse model method has gotten us in the neighborhood of the good solutions without the inelegance, time, and memory hogging of the brute force method. However, unlike the brute force method, it is not constrained to integer solutions, so we need to do some post processing.

### Conclusion

Both the brute force and linear inverse model solution yield the same answer: for a 6-item Likert scale (0-5), N=94, M=.83, and SD=1.21 are not problematic summary statistics. Which of these methods one uses depends largely on the situation, but one could even combine them for a more efficient search. As a reminder, the complete code can be found here: https://gist.github.com/richarddmorey/787d7b7547a736a49c3f.