RcppAPT 0.0.3

By Thinking inside the box

A new version of RcppAPT — our interface from R to the C++ library behind the awesome apt, apt-get, apt-cache, … commands and their cache powering Debian, Ubuntu and the like — is now on CRAN.

We changed the package to require C++11 compilation as newer Debian systems with g++-6 and the current libapt-pkg-dev library cannot build under the C++98 standard which CRAN imposes (and let’s not get into why …). Once set to C++11 we have no issues. We also added more examples to the manual pages, and turned on code coverage.

A bit more information about the package is available here as well as as the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Source:: R News

flea circus

By xi’an

grib

An old riddle found on X validated asking for Monte Carlo resolution but originally given on Project Euler:

A 30×30 grid of squares contains 30² fleas, initially one flea per square. When a bell is rung, each flea jumps to an adjacent square at random. What is the expected number of unoccupied squares after 50 bell rings, up to six decimal places?

The debate on X validated is whether or not a Monte Carlo resolution is feasible. Up to six decimals, certainly not. But with some lower precision, certainly. Here is a rather basic R code where the 50 steps are operated on the 900 squares, rather than the 900 fleas. This saves some time by avoiding empty squares.

xprmt=function(n=10,T=50){

 mean=0
 for (t in 1:n){

   board=rep(1,900)
   for (v in 1:T){

    beard=rep(0,900)
    if (board[1]>0){
      poz=c(0,1,0,30)
      ne=rmultinom(1,board[1],prob=(poz!=0))
      beard[1+poz]=beard[1+poz]+ne}
    #
    for (i in (2:899)[board[-1][-899]>0]){
     u=(i-1)%%30+1;v=(i-1)%/%30+1
     poz=c(-(u>1),(u<30),-30*(v>1),30*(v<30))      
     ne=rmultinom(1,board[i],prob=(poz!=0))      
     beard[i+poz]=beard[i+poz]+ne} 
    #     
    if (board[900]>0){
     poz=c(-1,0,-30,0)
     ne=rmultinom(1,board[900],prob=(poz!=0))
     beard[900+poz]=beard[900+poz]+ne}
     board=beard}
   mean=mean+sum(board==0)}
 return(mean/n)}

The function returns an empirical average over n replications. With a presumably awkward approach to the borderline squares, since it involves adding zeros to keep the structure the same… Nonetheless, it produces an approximation that is rather close to the approximate expected value, in about 3mn on my laptop.

> exprmt(n=1e3)
[1] 331.082
> 900/exp(1)
[1] 331.0915

Further gains follow from considering only half of the squares, as there are two independent processes acting in parallel. I looked at an alternative and much faster approach using the stationary distribution, with the stationary being the Multinomial (450,(2/1740,3/1740…,4/1740,…,2/1740)) with probabilities proportional to 2 in the corner, 3 on the sides, and 4 in the inside. (The process, strictly speaking, has no stationary distribution, since it is periodic. But one can consider instead the subprocess indexed by even times.) This seems to be the case, though, when looking at the occupancy frequencies, after defining the stationary as:

inva=function(B=30){
return(c(2,rep(3,B-2),2,rep(c(3,
  rep(4,B-2),3),B-2),2,rep(3,B-2),2))}

namely

> mn=0;n=1e8 #14 clock hours!
> proz=rep(c(rep(c(0,1),15),rep(c(1,0),15)),15)*inva(30)
> for (t in 1:n)
+ mn=mn+table(rmultinom(1,450,prob=rep(1,450)))[1:4]
> mn=mn/n
> mn[1]=mn[1]-450
> mn
     0      1      2     3
166.11 164.92  82.56 27.71
> exprmt(n=1e6) #55 clock hours!!
[1] 165.36 165.69 82.92 27.57

my original confusion being that the Poisson approximation had not yet taken over… (Of course, computing the first frequency for the stationary distribution does not require any simulation, since it is the sum of the complement probabilities to the power 450, i.e., 166.1069.)

Filed under: Books, Kids, pictures, R, Statistics Tagged: coupling, cross validated, fleas, Leonhard Euler, Markov chains, Markov random field, Monte Carlo integration, occupancy, Poisson approximation, R, random walk, stationarity

Source:: R News

PISA 2015 – how to read/process/plot the data with R

By smarterpoland

pisa2015

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

Yesterday OECD has published results and data from PISA 2015 study (Programme for International Student Assessment). It’s a very cool study – over 500 000 pupils (15-years old) are examined every 3 years. Raw data is publicly available and one can easily access detailed information about pupil’s academic performance and detailed data from surveys for studetns, parents and school officials (~2 000 variables). Lots of stories to be found.

You can download the dataset in the SPSS format from this webpage. Then use the foreign package to read sav files and intsvy package to calculate aggregates/averages/tables/regression models (for 2015 data you shall use the GitHub version of the package).

Below you will find a short example, how to read the data, calculate weighted averages for genders/countries and plot these results with ggplot2. Here you will find other use cases for the intsvy package.

library("foreign")
library("intsvy")
library("dplyr")
library("ggplot2")
library("tidyr")

stud2015 <- read.spss("CY6_MS_CMB_STU_QQQ.sav", use.value.labels = TRUE, to.data.frame = TRUE)
genderMath <- pisa2015.mean.pv(pvlabel = "MATH", by = c("CNT", "ST004D01T"), data = stud2015)

genderMath <- genderMath[,c(1,2,4,5)]
genderMath %>%
  select(CNT, ST004D01T, Mean) %>%
  spread(ST004D01T, Mean) -> genderMathWide

genderMathSelected <-
  genderMathWide %>%
  filter(CNT %in% c("Austria", "Japan", "Switzerland",  "Poland", "Singapore", "Finland", "Singapore", "Korea", "United States"))

pl <- ggplot(genderMathWide, aes(Female, Male)) +
  geom_point() +
  geom_point(data=genderMathSelected, color="red") +
  geom_text(data=genderMathSelected, aes(label=CNT), color="grey20") +
  geom_abline(slope=1, intercept = 0) + 
  geom_abline(slope=1, intercept = 20, linetype = 2, color="grey") + 
  geom_abline(slope=1, intercept = -20, linetype = 2, color="grey") +
  geom_text(x=425, y=460, label="Boys +20 points", angle=45, color="grey", size=8) + 
  geom_text(x=460, y=425, label="Girls +20 points", angle=45, color="grey", size=8) + 
  coord_fixed(xlim = c(400,565), ylim = c(400,565)) +
  theme_bw() + ggtitle("PISA 2015 in Math - Gender Gap") +
  xlab("PISA 2015 Math score for girls") +
  ylab("PISA 2015 Math score for boys")

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

2016-15 Automating R Demonstration Videos

By pmur002

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This document describes a proof-of-concept for producing R demonstration videos in a fully-automated manner. The “script” for the video consists of a text file containing code chunks paired with text commentary. The video is produced by running the code while recording a screen capture, using text-to-speech software to record audio of the commentary, then combining video and audio with appropriate timings and pauses.

Paul Murrell

Download

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Microsoft R Server 9.0 now available

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Server 9.0, Microsoft’s R distribution with added big-data, in-database, and integration capabilities, was released today and is now available for download to MSDN subscribers. This latest release is built on Microsoft R Open 3.3.2, and adds new machine-learning capabilities, new ways to integrate R into applications, and additional big-data support for Spark 2.0.

This release includes a brand new R package for machine learning: MicrosoftML. This package provides state-of-the-art, fast and scalable machine learning algorithms for common data science tasks including featurization, classification and regression. Some of the functions provided include:

  • Fast linear and logistic model functions based on the Stochastic Dual Coordinate Ascent method;
  • Fast Forests, a random forest and quantile regression forest implementation based on FastRank, an efficient implementation of the MART gradient boosting algorithm;
  • A neural network algorithm with support for custom, multilayer network topologies and GPU acceleration;
  • One-class anomaly detection based on support vector machines.

You can learn more about MicrosoftML at this live webinar on Wednesday, December 14.

The RevoScaleR package, which provides big-data support for Microsoft R Server, has been updated to support Spark 2.0 (This is in addition to existing support for in-database computations with SQL Server and Teradata.) New functions allow you to connect to a Hive, Parquet or Spark DataFrame data source, and execute data manipulation and predictive modeling tasks directly on the data as it resides within Spark.

Microsoft R Server also sports new capabilities to integrate R functions into other applications. (This is the results of enhancements to the former DeployR project, which is now part of Microsoft R Server.) Microsoft R Server can now host R functions exposed as web services: data scientists can publish R functions to the R Server, which can then be accessed from any application. Application developers can integrate those R functions into their applications using easy-to-consume Swagger-based APIs from any programming language.

Also released today is the free Microsoft R Client 3.3.2. This is a desktop edition of Microsoft R Server 9.0, designed for local data science development and remote execution on local servers and in the cloud. It includes the same MicrosoftML package as Microsoft R Server. The RevoScaleR package in Microsoft R Client is limited in data size when computing locally, but can shift computations to a remote Microsoft R Server when working with larger data sets. It also includes the new mrsdeploy package, so you can remote-execute any R code on the remote R Server, or even publish R functions as a Web service.

For a complete list of changes, see What’s New in R Server 9.0.1. And for more on the release of Microsoft R Server 9.0, check out the blog post linked below.

Cortana Intelligence and Machine Learning Blog: Introducing Microsoft R Server 9.0

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

November Package Picks

By Joseph Rickert

by Joseph Rickert

November was a prolific month for R developers: 189 new packages landed in CRAN. I have selected more than a quarter of them for this post, but I haven’t listed everything that is worth a look. My November 2016 picks are organized into four categories: Biotech (4 picks), Data (6 picks), Machine Learning (9 picks) , Statistics (9 picks), Time Series (4 picks) and Utilities (20 picks). The relatively large number of Utilities packages listed seriously over-represents this category. However, I have included so many to emphasize the cumulative impact of developers working to improve the R ecosystem at a fairly low level. Also, I believe that these sorts of packages are relatively difficult to discover.

Biotech

The packages listed under this heading support analyses in biostatistics, genetics and medicine.

  • esaddle v0.0.2: Provides functions for fitting the Extended Empirical Saddlepoint (EES) density. The vignette provides examples.
  • incidence v1.0.1: Provides functions and classes to compute, manipulate and visualize incidence from dated events for defined time intervals. There are three vignettes including this overview.
  • speaq2 v0.1.0: Provides wavelet-based tools for the analysis of NMR spectra. The vignette shows how to process a data set.
  • starmie v0.1.2:Provides data structures and methods for manipulating the output of genetic population structuring algorithms. There is a Basic Usage vignette, as well as one on Admixture models.

Data

The packages here provide access to data through various methods.

  • ALA4R v1.5.3: Provides an interface to the Atlas of Living Australia (ALA) that allows users to access and visualize data on Australian plants and animals. The vignette shows how to use it.
  • BatchGetSymbols v1.0: Makes it easy to download a large amount of trade data from Yahoo or Google Finance. There is a brief vignette.
  • elasticsearchr v0.1.0: Provides a lightweight interface to Elasticsearch, a NoSQL search engine and column store database. The vignette provides details.
  • hansard v0.2.5: Provides functions for downloading data using the UK Parliament API. The vignette describes how to access information on individual members of parliament, briefings and more.
  • isdparser v0.1.0: Provides tools for parsing NOAA Integrated Surface Database (ISD) files. The vignette gives the basics on reading and parsing the files.
  • RBMRB v2.0: Provides an interface to the Biological Magnetic Resonance Data Bank (BMRB), along with tools for NMR images. Look here for documentation.

Machine Learning

The packages listed here are geared towards machine-learning applications.

  • bcROCsurface v1.0-1: Offers functions to compute bias-corrected estimates of ROC curves. There is a guide.
  • BiBitR v0.1.0: raps the Java BiBIt biclustering algorithm for extracting bit-patterns from binary data sets. See the Bioinformatics paper for details.
  • cleanNLP v0.24: Provides a Tidy Data model based on dplyr for converting a textual corpus into a set of normalized tables. The underlying NLP pipeline is based on Stanford’s CoreNLP library.
  • ffstream v0.1-5: Provides an implementation of the adaptive forgetting factor algorithm for estimating the mean and variance of a data stream in order to detect multiple checkpoints. The details are in the vignette.
  • FTRLProximal v0.1.2: Implements the Regularized Leader Proximal algorithm for online training of large-scale regression models using a mixture of L1 and L2 regularization
  • IDmining v1.0.0: Contains functions for mining large high-dimensional data sets using the Intrinsic Dimension technique. This paper describes the idea.
  • mltools v0.1.0: Provides a collection of machine learning helper functions for exploratory analysis. The README file provides some details.
  • OpenML v1.1: Provides an interface to the OpenML online machine-learning platform. The vignette provides an example of how to use it.
  • rucrdtw v0.1.1: Provides R bindings for functions from the UCR Suite (Rakthanmanon et al. 2012), which enables ultrafast subsequence search under both Dynamic Time Warping and Euclidean Distance. The vignette shows how to use the package.

Statistics

The packages listed under this heading mostly offer algorithms to support statistical analyses. Notable are queuecomputer, which implements a discrete event simulation, and regtools, which could have also been listed under the Machine Learning heading.

  • bayesplot v1.0.0: Provides plotting functions for posterior analysis, model checking, and MCMC diagnostics. There are vignettes for MCMC diagnostics, plotting MCMC draws, and graphical posterior checks.
  • eMLEloglin v1.0.1: Provides functions for fittlin log-linear models of sparse contingency tables. See the user manual for the math.
  • POT v1.1-6 : Implements functions to perform Peaks Over Threshold analysis, useful in Extreme Value Theory. The vignette explains the math.
  • queuecomputer v0.5.1: Provides computationally efficient solutions for simulating queues with arbitrary arrival and service times. There is a vignette describing how to use the package and one showing how to simulate M/M/k queues.
  • regtools v1.0.1: Provides novel tools for linear and nonlinear regression, and nonparametric regression and classification. The vignette contains examples.
  • revdbayes v1.0.0: Provides functions for the Bayesian analysis of extreme value models. The vignette contains several interesting examples and references.
  • slim v0.1.0: Provides functions to fit singular linear models to longitudinal data. The theory is described in this Biometrika paper, and the vignette provides examples.
  • varband v0.9.0: Implements the variable banding procedure described in a paper by Yu and Bien for modeling local dependence and estimating precision matrices. The vignette shows how to use the package. The following plot shows the sparsity patterns of the true model, and the sample covariance matrix for one of the examples.
  • xyz v0.1: Implements an algorithm by Thanei, Meinshausen and Shah for finding strong interactions in almost linear time. The vignette contains an example.

Time Series

The packages listed here explicitly call out time series applications.

  • GeomComb v1.0: Provides an eigenvector-based method for combining time series forecasts.
  • ptest v1.0-8: Implements p-value computations for testing periodicity in short time series. The vignette provides examples and references.
  • tsdisagg2 v0.1.0: Provides functions to disaggregate low frequency time series data to higher frequency series. The vignette describes the math and provides references.
  • zoocat v0.2.0: Extends the zoo class and provides tools for manipulating multivariate time series data. The vignette contains an example.

Utilities

The packages listed here are a varied collection of convenience utilities, package extensions, gateways to other software, and low-level computing functions. Notable are flock and subprocess, which feel like systems-level programming.

  • batchtools v0.9.0: As a successor of the packages ‘BatchJobs’ and ‘BatchExperiments’, this package provides a parallel implementation of the Map function for high-performance computing systems managed by schedulers such as ‘IBM Spectrum LSF’ (), ‘OpenLava’ (), and others. There are four vignettes, including this one on error handling.
  • benchr v0.1.0: Provides infrastructure to accurately measure and compare the execution times of R expressions. Usage is described here.
  • bindr v0.1: Provides an interface for creating active binding where the bound function accepts additional arguments. Usage is described here.
  • bytescircle v1.0: Shows statistics about bytes contained in a file as a circle graph of deviations from mean in sigma increments. The following plot from the vignette shows byte values mapped as an archimedean spiral, where each byte value is represented as a color circle and size indicates the deviation from sigma.
  • crul v0.1.0: Implements a simple HTTP client for making HTTP requests. Look at the GitHub README file for information on where to start.
  • datapasta v1.0.0: Provides three addins for copying and pasting tables and vectors from Excel, Jupyter, and websites into the RStudio editor. The vignette provides an example.
  • debugme v1.0.1: Offers functions to specify debugging messages as special string constants, and control package debugging via environment variables. Look here for an example.
  • errorizer V0.1.1: Creates “errorized” versions of existing R functions with enhanced capabilities for logging and error handling. The vignette provides an example and describes the limitations of the method.
  • fauxpas v0.1.0: Provides methods for general-purpose HTTP error handling. Integrates with packages crul, curl and httr. Look here for crul and curl examples.
  • flock v0.7: Nitty-gritty package that implements synchronization between R processes using file locks.
  • ggforce v0.1.1: Offers new stats and geoms to be used with ggplot2. The vignette provides several examples.
  • ggstance v0.3: an extension to ggplot2 that provides flipped components and horizontal versions of stats and geoms. The package README file contains examples.
  • naptime v1.2.0: Provides a “near drop-in” replacement for base::Sys.sleep() that allows for more control of delays. The vignette explains the why and how of napping.
  • packagedocs v0.4.0: Should make writing package vignettes a little easier by providing functions for building websites of Package Documentation. See the quick start manual and package reference manual.
  • reactR v0.1.0: Provides an interface to the React Javascript library for building user interfaces. There is an introduction.
  • rly v1.0.1: Another nitty-gritty package, it provides an R implementation of the parsing tools lex and yacc. See the README file for examples.
  • spark.sas7bdat V1.0: Allows R users to read SAS data files into Spark. The vignette indicates how to read SAS data in parallel.
  • startup v0.3.0: Provides new directories to specify R’s startup configuration that makes it possible to keep private / secret variables separate from other environment variables. See the README file.
  • subprocess v0.7.4 : Brings systems-level programming to R, with the capability to create, interact with and control the life cycle of child processes. The vignette shows how.
  • tesseract v1.2: Allows text to be extracted from an image. This is an OCR engine with unicode (UTF-8) support that can recognize over 100 languages. See the README file.

Source:: R News

Secret Santa Picker 2 using R

By wszafranski

(This article was first published on The Practical R, and kindly contributed to R-bloggers)

Last year I made a blog post about a Secret Santa picker HERE, but to use it required quite a bit of messing around with the code. So this year I decided to improve the whole thing by making it a function rather than a script. The function take two inputs, a list of names and a number of names. The code for the function is listed first followed by the code for the script used to call the function.

Here’s the function. Nothing needs to be changed in this code for it to run properly.

# make a function 
secret_santa <-function(npeople, names){
  
  # this 'flag' is used to determine if the
  # function stays in or out of the while function
  flag = "bad"
  
  # first list of names
  namelist1 = matrix(names, ncol = 1, nrow = npeople)
  fam = matrix(ncol = 1, nrow = npeople, NA)
  
  while (flag == "bad"){
    
    # names to choose from
    namelist2 = matrix(ncol = 1, nrow = npeople, NA)
    
    for (i in 1:npeople){
      #pick the first name
      if (i==1){
        xx2 = sample(names, (npeople-i+1), replace=FALSE)
      } else
        xx2 = sample(xx2, (npeople-i+1), replace=FALSE)
      
      if (i == npeople & xx2[1]==namelist1[i,1]){
        flag = "bad"
        
      }else if(xx2[1]!= namelist1[i,1]){
        namelist2[i,1] = xx2[1]
        flag = "good"
      } else{
        namelist2[i,1] = xx2[2]
        flag = "good"
        }
      
      
      #set up the new matrix with one less name
      used = which(xx2==namelist2[i])
      xx2[used] = "zzzzz"
      xx2 = sort(xx2)[1:(npeople-i)]
    }
    
    #flag
    #add "has" to the matrix
    has = matrix(ncol=1, nrow = npeople, "has")
    
    #build the final matrices
    final = cbind(namelist1, has, namelist2)	
    #the final results
    #final
    
    
  }
  final
}

Save this function as “secret-santa-function.R” and we’ll call it from our script. Okay, now let’s make our script.

# call the function from the script
source("secret-santa-function.R")

### Function input
### make a list of names
names = c("James","Nick","Emily","Natasha","Bob", "Teddy")
n = length(names)

#call the function
output <-secret_santa(n, names)
output 

The list of names is the only input needed. In the case above it’s ‘names = c(“James”,”Nick”,”Emily”,”Natasha”,”Bob”, “Teddy”)’. The other variable the function needs is the number of names, which is read automatically from the length function. That’s it, you’re done. Call the function from the script and you’ve got your names.

To leave a comment for the author, please follow the link and comment on their blog: The Practical R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Pipe Dream

By R-SquareD

Pipes

(This article was first published on R-SquareD, and kindly contributed to R-bloggers)

Plusses and Arrows and Percents, oh my!. –

Do you continually substitute “%>%” for “+” when switching between data wrangling and data visualization? I’ve got just the solution for you!

Count myself as one of those people that continually use a pipe instead of a plus and vice-verca when I’m writing a lot of code. Sir Hadley has basically shit the door on ever switching ggplot to using magrittr pipes and I don’t blame him. But he can’t stop me from doing whatever the heck I want.

In the following code, I took the sample ggplot code from the help and modified it to use magrittr.

library(magrittr)
library(ggplot2)
library(dplyr)
geom_point_p = function(p, ...) {
  return(
    p + geom_point(...)
  )
}

geom_errorbar_p = function(p, ...) {
  return(
    p + geom_errorbar(...)
  )
}

df = data.frame(
  gp = factor(rep(letters[1:3], each = 10)),
  y = rnorm(30)
)
ds = plyr::ddply(df, "gp", plyr::summarise, mean = mean(y), sd = sd(y))

# The summary data frame ds is used to plot larger red points on top
# of the raw data. Note that we don't need to supply `data` or `mapping`
# in each layer because the defaults from ggplot() are used.
ggplot(df, aes(gp, y)) %>%
  geom_point_p() %>%
  geom_point_p(data = ds, aes(y = mean), colour = 'red', size = 3)

plot of chunk unnamed-chunk-1

# Same plot as above, declaring only the data frame in ggplot().
# Note how the x and y aesthetics must now be declared in
# each geom_point() layer.
ggplot(df) %>%
  geom_point_p(aes(gp, y)) %>%
  geom_point_p(data = ds, aes(gp, mean), colour = 'red', size = 3)

plot of chunk unnamed-chunk-1

# Alternatively we can fully specify the plot in each layer. This
# is not useful here, but can be more clear when working with complex
# mult-dataset graphics
ggplot() %>%
  geom_point_p(data = df, aes(gp, y)) %>%
  geom_point_p(data = ds, aes(gp, mean), colour = 'red', size = 3) %>%
  geom_errorbar_p(
    data = ds,
    aes(gp, mean, ymin = mean - sd, ymax = mean + sd),
    colour = 'red',
    width = 0.4
  )

plot of chunk unnamed-chunk-1

I may have to roll this into a package at some point.

To leave a comment for the author, please follow the link and comment on their blog: R-SquareD.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using replyr::let to Parameterize dplyr Expressions

By Nina Zumel

Rplot

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
      
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.

Enter let, from our new package replyr.

replyr::let implements a mapping from the “symbolic” names used in a dplyr expression to the names of the actual columns in a data frame. This allows you to encapsulate complex dplyr expressions without the use of the lazyeval package, which is the currently recommended way to manage dplyr‘s use of non-standard evaluation. Thus, you could write the function to create the table above as:

# to install replyr: 
# devtools::install_github('WinVector/replyr')

library(dplyr)
library(replyr)  

#
# calculate mean +/- sd intervals and
#           median +/- 1/2 IQR intervals
# for arbitrary data frame column, with optional grouping
#
dist_intervals = function(dframe, colname, groupcolname=NULL) {
  mapping = list(col=colname)
  if(!is.null(groupcolname)) {
    dframe %>% group_by_(groupcolname) -> dframe
  }
  let(alias=mapping,
      expr={
        dframe %>% summarize(sdlower = mean(col)-sd(col),
                             mean = mean(col),
                             sdupper = mean(col) + sd(col),
                             iqrlower = median(col)-0.5*IQR(col),
                             median = median(col),
                             iqrupper = median(col)+0.5*IQR(col))
      })()
}

The mapping is specified as a list of assignments symname=colname, where symname is the name used in the dplyr expression, and colname is the name (as a string) of the corresponding column in the data frame. We can now call our dist_intervals on the iris dataset:

dist_intervals(iris, "Sepal.Length")

   sdlower     mean  sdupper iqrlower median iqrupper
1 5.015267 5.843333 6.671399     5.15    5.8     6.45

dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
      
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

dist_intervals(iris, "Petal.Length", "Species")
# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
      
1     setosa 1.288336 1.462 1.635664   1.4125   1.50   1.5875
2 versicolor 3.790089 4.260 4.729911   4.0500   4.35   4.6500
3  virginica 5.000105 5.552 6.103895   5.1625   5.55   5.9375

The implementation of let is adapted from gtools::strmacro by Gregory R. Warnes. Its primary purpose is for wrapping dplyr, but you can use it to parameterize other functions that take their arguments via non-standard evaluation, like ggplot2 functions — in other words, you can use replyr::let instead of ggplot2::aes_string, if you are feeling perverse. Because let creates a macro, you have to avoid variable collisions (for example, remapping x in ggplot2 will clobber both sides of aes(x=x)), and you should remember that any side effects of the expression will escape let‘s execution environment.

The replyr package is available on github. Its goal is to supply uniform dplyr-based methods for manipulating data frames and tbls both locally and on remote (dplyr-supported) back ends. This is a new package, and it is still going through growing pains as we figure out the best ways to implement desired functionality. We welcome suggestions for new functions, and more efficient or more general ways to implement the functionality that we supply.

Source:: R News

the incredible accuracy of Stirling’s approximation

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

The last riddle from the Riddler [last before The Election] summed up to find the probability of a Binomial B(2N,½) draw ending up at the very middle, N. Which is

If one uses the standard Stirling approximation to the factorial function,

log(N!)≈Nlog(N) – N + ½log(2πN)

the approximation to ℘ is 1/√πN, which is not perfect for the small values of N. Introducing the second order Stirling approximation,

log(N!)≈Nlog(N) – N + ½log(2πN) + 1/12N

the approximation become

℘≈exp(-1/8N)/√πN

which fits almost exactly from the start. This accuracy was already pointed out by William Feller, Section II.9.

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News