rfoaas 0.1.6

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

After a few quiet months, a new version of rfoaas is now on CRAN. As before, it shadows upstream release of FOAAS from which carried over the version number 0.1.6.

The rfoaas package provides an interface for R to the most excellent FOAAS service–which provides a modern, scalable and RESTful web service for the frequent need to tell someone to f$#@ off.

Release 0.1.6 of FOAAS builds on the initial support for filters and now adds working internationalization. This is best shown by example:

R> library(rfoaas)
R> off("Tom", "Everyone")                                      # standard upstream example
[1] "Fuck off, Tom. - Everyone"
R> off("Tom", "Everyone", language="fr")                       # now in French
[1] "Va te faire foutre, Tom. - Everyone"
R> off("Tom", "Everyone", language="de", filter="shoutcloud")  # or in German and LOUD
[1] "VERPISS DICH, TOM. - EVERYONE"
R> 

We start with a standard call to off(), add the (now fully functional upstream) language support and finally illustrate the shoutcloud.io filter added in the preceding 0.1.3 release.

As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Paper Helicopter Experiment, part III

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

As final part of my paper helicopter experiment analysis (

As can be seen it seems that larger values have the larger error, but it is not really corrected very much by a log transformation. To examine this a bit more, the Box-Cox transformation is used. From there it seems square root is almost optimum, but log and no transformation should also work. It was decided to use a square root transformation.
Given the square root transformation the residual error should not be lower than 0.02, since that is what the replications have. On the other hand, much higher than 0.02 is a clear sign of under fitting.
Analysis of Variance Table

Response: sqrt(Time)
Df Sum Sq Mean Sq F value Pr(>F)
allvl 3 1.84481 0.61494 26 2.707e-05 ***
Residuals 11 0.26016 0.02365
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Model selection

Given the residual variance desired, the model linear in variables is not sufficient.
Analysis of Variance Table

Response: sTime
Df Sum Sq Mean Sq F value Pr(>F)
RotorLength 1 3.6578 3.6578 18.4625 0.0001257 ***
RotorWidth 1 1.0120 1.0120 5.1078 0.0299644 *
BodyLength 1 0.1352 0.1352 0.6823 0.4142439
FootLength 1 0.2719 0.2719 1.3725 0.2490708
FoldLength 1 0.0060 0.0060 0.0302 0.8629331
FoldWidth 1 0.0189 0.0189 0.0953 0.7592922
PaperWeight 1 0.6528 0.6528 3.2951 0.0778251 .
DirectionOfFold 1 0.4952 0.4952 2.4994 0.1226372
Residuals 36 7.1324 0.1981
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Adding interactions and quadratic effects via stepwise regression did not improve much.

Analysis of Variance Table

Response: sTime
Df Sum Sq Mean Sq F value Pr(>F)
RotorLength 1 3.6578 3.6578 29.5262 3.971e-06 ***
RotorWidth 1 1.0120 1.0120 8.1687 0.007042 **
FootLength 1 0.3079 0.3079 2.4851 0.123676
PaperWeight 1 0.6909 0.6909 5.5769 0.023730 *
I(RotorLength^2) 1 2.2035 2.2035 17.7872 0.000159 ***
I(RotorWidth^2) 1 0.3347 0.3347 2.7018 0.108941
FootLength:PaperWeight 1 0.4291 0.4291 3.4634 0.070922 .
RotorWidth:FootLength 1 0.2865 0.2865 2.3126 0.137064
Residuals 36 4.4598 0.1239
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
Just adding the quadratic effects did not help either. However, using both linear and quadratic as a starting point did give a more extensive model.
Analysis of Variance Table

Response: sTime
Df Sum Sq Mean Sq F value Pr(>F)
RotorLength 1 3.6578 3.6578 103.8434 5.350e-10 ***
RotorWidth 1 1.0120 1.0120 28.7293 1.918e-05 ***
FootLength 1 0.3079 0.3079 8.7401 0.0070780 **
FoldLength 1 0.0145 0.0145 0.4113 0.5276737
FoldWidth 1 0.0099 0.0099 0.2816 0.6007138
PaperWeight 1 0.7122 0.7122 20.2180 0.0001633 ***
DirectionOfFold 1 0.5175 0.5175 14.6902 0.0008514 ***
I(RotorLength^2) 1 1.7405 1.7405 49.4119 3.661e-07 ***
I(RotorWidth^2) 1 0.3160 0.3160 8.9709 0.0064635 **
I(FootLength^2) 1 0.1216 0.1216 3.4525 0.0760048 .
I(FoldLength^2) 1 0.0045 0.0045 0.1272 0.7245574
RotorLength:RotorWidth 1 0.4181 0.4181 11.8693 0.0022032 **
RotorLength:PaperWeight 1 0.3778 0.3778 10.7247 0.0033254 **
RotorWidth:FootLength 1 0.6021 0.6021 17.0947 0.0004026 ***
PaperWeight:DirectionOfFold 1 0.3358 0.3358 9.5339 0.0051968 **
RotorWidth:FoldLength 1 1.5984 1.5984 45.3778 7.167e-07 ***
RotorWidth:FoldWidth 1 0.3937 0.3937 11.1769 0.0028207 **
RotorWidth:PaperWeight 1 0.2029 0.2029 5.7593 0.0248924 *
RotorWidth:DirectionOfFold 1 0.0870 0.0870 2.4695 0.1297310
RotorLength:FootLength 1 0.0687 0.0687 1.9517 0.1757410
FootLength:PaperWeight 1 0.0732 0.0732 2.0781 0.1629080
Residuals 23 0.8102 0.0352
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

This model does is quite extensive. For prediction purpose I would probably drop a few terms. For instance FootLength:PaperWeight could be removed, lessen the fit yet improve predictions, since its p-value is close to 0.15. As it is currently the model does have some issues. For instance, quite some points have high leverage.

Conclusion

The paper helicopter needs quite a complex model to fit all effects on flying time. This somewhat validates the complex models found in part 1.

Code used

library(dplyr)
library(car)
h3 <- read.table(sep='t',header=TRUE,text='
RotorLength RotorWidth BodyLength FootLength FoldLength FoldWidth PaperWeight DirectionOfFold Time
5.5 3 1.5 0 5 1.5 light against 11.8
5.5 3 1.5 2.5 11 2.5 heavy against 8.29
5.5 3 5.5 0 11 2.5 light with 9
5.5 3 5.5 2.5 5 1.5 heavy with 7.21
5.5 5 1.5 0 11 1.5 heavy with 6.65
5.5 5 1.5 2.5 5 2.5 light with 10.26
5.5 5 5.5 0 5 2.5 heavy against 7.98
5.5 5 5.5 2.5 11 1.5 light against 8.06
11.5 3 1.5 0 5 2.5 heavy with 9.2
11.5 3 1.5 2.5 11 1.5 light with 19.35
11.5 3 5.5 0 11 1.5 heavy against 12.08
11.5 3 5.5 2.5 5 2.5 light against 20.5
11.5 5 1.5 0 11 2.5 light against 13.58
11.5 5 1.5 2.5 5 1.5 heavy against 7.47
11.5 5 5.5 0 5 1.5 light with 9.79
11.5 5 5.5 2.5 11 2.5 heavy with 9.2
8.5 4 3.5 1.25 8 2 light against 10.52
8.5 4 3.5 1.25 8 2 light against 10.81
8.5 4 3.5 1.25 8 2 light against 10.89
8.5 4 3.5 1.25 8 2 heavy against 15.91
8.5 4 3.5 1.25 8 2 heavy against 16.08
8.5 4 3.5 1.25 8 2 heavy against 13.88
8.5 4 2 2 6 2 light against 12.99
9.5 3.61 2 2 6 2 light against 15.22
10.5 3.22 2 2 6 2 light against 16.34
11.5 2.83 2 2 6 1.5 light against 18.78
12.5 2.44 2 2 6 1.5 light against 17.39
13.5 2.05 2 2 6 1.5 light against 7.24
10.5 2.44 2 1.5 6 1.5 light against 13.65
12.5 2.44 2 1.5 6 1.5 light against 13.74
10.5 3.22 2 1.5 6 1.5 light against 15.48
12.5 3.22 2 1.5 6 1.5 light against 13.53
11.5 2.83 2 1.5 6 1.5 light against 17.38
11.5 2.83 2 1.5 6 1.5 light against 16.35
11.5 2.83 2 1.5 6 1.5 light against 16.41
10.08 2.83 2 1.5 6 1.5 light against 12.51
12.91 2.83 2 1.5 6 1.5 light against 15.17
11.5 2.28 2 1.5 6 1.5 light against 14.86
11.5 3.38 2 1.5 6 1.5 light against 11.85
11.18 2.94 2 2 6 1.5 light against 15.54
11.18 2.94 2 2 6 1.5 light against 16.4
11.18 2.94 2 2 6 1.5 light against 19.67
11.18 2.94 2 2 6 1.5 light against 19.41
11.18 2.94 2 2 6 1.5 light against 18.55
11.18 2.94 2 2 6 1.5 light against 17.29
‘)
names(h3)

h3 %
mutate(.,
FRL=factor(format(RotorLength,digits=2)),
FRW=factor(format(RotorWidth,digits=2)),
FBL=factor(format(BodyLength,digits=2)),
FFt=factor(format(FootLength,digits=2)),
FFd=factor(format(FoldLength,digits=2)),
FFW=factor(format(FoldWidth,digits=2)),
allvl=interaction(FRL,FRW,FBL,FFt,FFd,FFW,PaperWeight,DirectionOfFold,drop=TRUE)
)

h4 %
as.data.frame %>%
filter(.,Freq>1) %>%
merge(.,h3) %>%
select(.,RotorLength,
RotorWidth,BodyLength,FootLength,
FoldLength,FoldWidth,PaperWeight,
DirectionOfFold,allvl,Time,Freq) %>%
print
lm(Time~allvl,data=h4) %>% anova

par(mfrow=c(1,2))
aov(Time~allvl,data=h3) %>% residualPlot(.,main=’Untransformed’)
aov(log10(Time)~allvl,data=h3) %>% residualPlot(.,main=’Log10 Transform’)

lm(Time ~ RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold,
data=h3) %>%
boxCox(.)
dev.off()

lm(sqrt(Time)~allvl,data=h4) %>% anova

h3 <- mutate(h3,sTime=sqrt(Time))

lm(sTime ~ RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold,
data=h3) %>%
anova

s1 <- lm(sTime ~
RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold ,
data=h3) %>%
step(.,scope=~(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold)*
(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth)+
I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
PaperWeight*DirectionOfFold)
anova(s1)

s1 <- lm(sTime ~
RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold ,
data=h3) %>%
step(.,scope=~(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold)*
(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth)+
I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
PaperWeight*DirectionOfFold)

anova(s1)
s2 <- lm(sTime ~
RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold +
I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) ,
data=h3) %>%
step(.,scope=~(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth +
PaperWeight + DirectionOfFold)*
(RotorLength + RotorWidth + BodyLength +
FootLength + FoldLength + FoldWidth)+
I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
PaperWeight*DirectionOfFold)

anova(s2)
par(mfrow=c(2,2))
plot(s2)
To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A new R package for detecting unusual time series

By Rob J Hyndman

alpha

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

The anomalous package provides some tools to detect unusual time series in a large collection of time series. This is joint work with Earo Wang (an honours student at Monash) and Nikolay Laptev (from Yahoo Labs). Yahoo is interested in detecting unusual patterns in server metrics.

The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

  1. A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
  2. A series of –convex hulls are computed on the first two PC scores with decreasing , and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of , and the remaining outliers with decreasing values of .

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.

The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.

There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Lessons learned in high-performance R

By tony.fischetti@gmail.com

Comparing performance of different HP methods

(This article was first published on On the lambda » R, and kindly contributed to R-bloggers)

On this blog, I’ve had a long running investigation/demonstration of how to make a “embarrassingly-parallel” but computationally intractable (on commodity hardware, at least) R problem more performant by using parallel computation and Rcpp.

The example problem is to find the mean distance between every airport in the United States. This silly example was chosen because it exhibits polynomial growth in running time as a function of the number of airports and, thus, quickly becomes intractable without sampling. It is also easy to parallelize.

The first post used the (now-deprecated in favor of ‘parallel’) multicore package to achieve a substantial speedup. The second post used Rcpp to achieve a statistically significant but, functionally, trivial speedup by replacing the inner loop (the distance calculation using the Haversine formula) with a version written in C++ using Rcpp. Though I was disappointed in the results, it should be noted that porting the function to C++ took virtually no extra work.

By necessity, I’ve learned a lot more about high-performance R since writing those two posts (part of this is by trying to make my own R package as performant as possible). In particular, I did the Rcpp version all wrong, and I’d like to rectify that in this post. I also compare the running times of approaches that use both parallelism and Rcpp.

Lesson 1: use Rcpp correctly
The biggest lesson I learned, is that it isn’t sufficient to just replace inner loops with C++ code; the repeated transferring of data from R to C++ comes with a lot of overhead. By actually coding the loop in C++, the speedups to be had are often astounding.

In this example, the pure R version, that takes a matrix of longitude/latitude pairs and computed the mean distance between all combinations, looks like this…

just.R <- function(dframe){
  numrows <- nrow(dframe)
  combns <- combn(1:nrow(dframe), 2)
  numcombs <- ncol(combns)
  combns %>%
  {mapply(function(x,y){
          haversine(dframe[x,1], dframe[x,2],
                    dframe[y,1], dframe[y,2]) },
          .[1,], .[2,])} %>%
  sum %>%
  (function(x) x/(numrows*(numrows-1)/2))
}

The naive usage of Rcpp (and the one I used in the second blog post on this topic) simply replaces the call to “haversine” with a call to “haversine_cpp”, which is written in C++. Again, a small speedup was obtained, but it was functionally trivial.

The better solution is to completely replace the combinations/”mapply” construct with a C++ version. Mine looks like this…

double all_cpp(Rcpp::NumericMatrix& mat){
    int nrow = mat.nrow();
    int numcomps = nrow*(nrow-1)/2;
    double running_sum = 0;
    for( int i = 0; i < nrow; i++ ){
        for( int j = i+1; j < nrow; j++){
            running_sum += haversine_cpp(mat(i,0), mat(i,1),
                                         mat(j,0), mat(j,1));
        }
    }
    return running_sum / numcomps;
}

The difference is incredible…

res <- benchmark(R.calling.cpp.naive(air.locs[,-1]),
                 just.R(air.locs[,-1]),
                 all_cpp(as.matrix(air.locs[,-1])),
                 columns = c("test", "replications", "elapsed", "relative"),
                                  order="relative", replications=10)
res
#                                   test replications elapsed relative
# 3  all_cpp(as.matrix(air.locs[, -1]))           10   0.021    1.000
# 1 R.calling.cpp.naive(air.locs[, -1])           10  14.419  686.619
# 2              just.R(air.locs[, -1])           10  15.068  717.524

The properly written solution in Rcpp is 718 times faster than the native R version and 687 times faster than the naive Rcpp solution (using 200 airports).

Lesson 2: Use mclapply/mcmapply
In the first blog post, I used a messy solution that explicitly called two parallel processes. I’ve learned that using mclapply/mcmapply is a lot cleaner and easier to intregrate into idiomatic/functional R routines. In order to parallelize the native R version above, all I had to do is replace the call to “mapply” to “mcmapply” and set the number of cores (now I have a 4-core machine!).

Here are the benchmarks:

                                           test replications elapsed relative
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  10.433    1.000
4              just.R.parallel(air.locs[, -1])           10  11.809    1.132
1          R.calling.cpp.naive(air.locs[, -1])           10  15.855    1.520
3                       just.R(air.locs[, -1])           10  17.221    1.651

Lesson 3: Smelly combinations of Rcpp and parallelism are sometimes counterproductive

Because of the nature of the problem and the way I chose to solve it, the solution that uses Rcpp correctly is not easily parallelize-able. I wrote some *extremely* smelly code that used explicit parallelism to use the proper Rcpp solution in a parallel fashion; the results were interesting:

                                          test replications elapsed relative
5           all_cpp(as.matrix(air.locs[, -1]))           10   0.023    1.000
4              just.R.parallel(air.locs[, -1])           10  11.515  500.652
6             all.cpp.parallel(air.locs[, -1])           10  14.027  609.870
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  17.580  764.348
1          R.calling.cpp.naive(air.locs[, -1])           10  21.215  922.391
3                       just.R(air.locs[, -1])           10  32.907 1430.739

The parallelized proper Rcpp solution (all.cpp.parallel) was outperformed by the parallelized native R version. Further the parallelized native R version was much easier to write and was idiomatic R.

How does it scale?

Two quick things…

  • The “all_cpp” solution doesn’t appear to exhibit polynomial growth; it does, it’s just so much faster than the rest that it looks completely flat
  • It’s hard to tell, but that’s “just.R.parallel” that is tied with “R.calling.cpp.naive.parallel”

Too long, didn’t read:
If you know C++, try using Rcpp (but correctly). If you don’t, try multicore versions of lapply and mapply, if applicable, for great good. If it’s fast enough, leave well enough alone.

PS: I way overstated how “intractable” this problem is. According to my curve fitting, the vanilla R solution would take somewhere between 2.5 and 3.5 hours. The fastest version of these methods, the non-parallelized proper Rcpp one, took 9 seconds to run. In case you were wondering, the answer is 1,869.7 km (1,161 miles). The geometric mean might have been more meaningful in this case, though.

share this: twittergoogle_plusredditpinterestlinkedintumblrmail

To leave a comment for the author, please follow the link and comment on his blog: On the lambda » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Reproducibility breakout session at USCOTS

By mine

(This article was first published on Citizen-Statistician » R Project, and kindly contributed to R-bloggers)

Somehow almost an entire academic year went by without a blog post, I must have been busy… It’s time to get back in the saddle! (I’m using the classical definition of this idiom here, “doing something you stopped doing for a period of time”, not the urban dictionary definition, “when you are back to doing what you do best”, as I really don’t think writing blog posts are what I do best…)

One of the exciting things I took part in during the year was the NSF supported Reproducible Science Hackathon held at NESCent in Durham back in December.

I wrote here a while back about making reproducibility a central focus of students’ first introduction to data analysis, which is an ongoing effort in my intro stats course. The hackathon was a great opportunity to think about promoting reproducibility to a much wider audience than intro stat students — wider with respect to statistical background, computational skills, and discipline. The goal of the hackathon was to develop a two day workshop for reproducible research, or more specifically, reproducible data analysis and computation. Materials from the hackathon can be found here and are all CC0 licensed.

If this happened in December, why am I talking about this now? I was at USCOTS these last few days, and lead a breakout session with Nick Horton on reproducibility, building on some of the materials we developed at the hackathon and framing them for a stat ed audience. The main goals of the session were

  1. to introduce statistics educators to RMarkdown via hands on exercises and promote it as a tool for reproducible data analysis and
  2. to demonstrate that with the right exercises and right amount of scaffolding it is possible (and in fact easier!) to teach R through the use of RMarkdown, and hence train new researchers whose only data analysis workflow is a reproducible one.

In the talk I also discussed briefly further tips for documentation and organization as well as for getting started with version control tools like GitHub. Slides from my talk can be found here and all source code for the talk is here.

There was lots of discussion at USCOTS this year about incorporating more analysis of messy and complex data and more research into the undergraduate statistics curriculum. I hope that there will be an effort to not just do “more” with data in the classroom, but also do “better” with it, especially given that tools that easily lend themselves to best practices in reproducible data analysis (RMarkdown being one such example) are now more accessible than ever.

To leave a comment for the author, please follow the link and comment on his blog: Citizen-Statistician » R Project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Cracking Safe Cracker with R

By tylerrinker

(This article was first published on TRinker’s R Blog » R, and kindly contributed to R-bloggers)

My wife got me a Safe Cracker 40 puzzle a while back. I believe I misplaced the solution some time back. The company, Creative Crafthouse, stands behind their products. They had amazing customer service and promptly supplied me with a solution. I’d supply the actual wheels as a cutout paper version but this is their property so this blog will be more enjoyable if you buy yourself a Safe Cracker 40 as well (I have no affiliation with the company, just enjoy their products and they have great customer service). Here’s what the puzzle looks like:

There are 26 columns of 4 rows. The goal is to line up the dials so you have all columns summing to 40. It is somewhat difficult to explain how the puzzle moves, but the dials control two rows. The outer row of the dial is notched and only covers every other cell of the row below. The outer most row does not have a notched row covering it. I believe there are 16^4 = 65536 possible combinations. I think it’s best to understand the logic by watching the video:

I enjoy puzzles but after a year didn’t solve it. This one begged me for a computer solution, and so I decided to use R to force the solution a bit. To me the computer challenge was pretty fun in itself.

Here are the dials. The NAs represents the notches in the notched dials. I used a list structure because it helped me sort things out. Anything in the same list moves together, though are not the same row. Row a is the outer most wheel. Both b and b_1 make up the next row, and so on.

L1 <- list(#outer
    a = c(2, 15, 23, 19, 3, 2, 3, 27, 20, 11, 27, 10, 19, 10, 13, 10),
    b = c(22, 9, 5, 10, 5, 1, 24, 2, 10, 9, 7, 3, 12, 24, 10, 9)
)
L2 <- list(
    b_i = c(16, NA, 17, NA, 2, NA, 2, NA, 10, NA, 15, NA, 6, NA, 9, NA),
    c = c(11, 27, 14, 5, 5, 7, 8, 24, 8, 3, 6, 15, 22, 6, 1, 1)
)
L3 <- list(
    c_j = c(10, NA, 2,  NA, 22, NA, 2,  NA, 17, NA, 15, NA, 14, NA, 5, NA),
    d = c( 1,  6,  10, 6,  10, 2,  6,  10, 4,  1,  5,  5,  4,  8,  6,  3) #inner wheel
)
L4 <- list(#inner wheel
    d_k = c(6, NA, 13, NA, 3, NA, 3, NA, 6, NA, 10, NA, 10, NA, 10, NA)
)

This is a brute force method but is still pretty quick. I made a shift function to treat vectors like circles or in this case dials. Here’s a demo of shift moving the vector one rotation to the right.

"A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

results in:

"J" "A" "B" "C" "D" "E" "F" "G" "H" "I" 

I use some indexing of the NAs to over write the notched dials onto each of the top three rows.

shift <- function(x, n){
    if (n == 0) return(x)
    c(x[(n+1):length(x)], x[1:n])
}

dat <- NULL
m <- FALSE

for (i in 0:15){ 
    for (j in 0:15){
        for (k in 0:15){

            # Column 1
            c1 <- L1[[1]]  

            # Column 2
            c2 <- L1[[2]]  
            c2b <- shift(L2[[1]], i)
            c2[!is.na(c2b)]<- na.omit(c2b)

            # Column 3
            c3 <- shift(L2[[2]], i)
            c3b <- shift(L3[[1]], j)
            c3[!is.na(c3b)]<- na.omit(c3b)

            # Column 4
            c4 <- shift(L3[[2]], j)
            c4b <- shift(L4[[1]], k)
            c4[!is.na(c4b)]<- na.omit(c4b)

            ## Check and see if all rows add up to 40
            m <- all(rowSums(data.frame(c1, c2, c3, c4)) %in% 40)

            ## If all rows are 40 print the solution and assign to dat
            if (m){
                assign("dat", data.frame(c1, c2, c3, c4), envir=.GlobalEnv)
                print(data.frame(c1, c2, c3, c4))
                break
            }
            if (m) break
        }    
        if (m) break
    }
    if (m) break
}

Here’s the solution:

   c1 c2 c3 c4
1   2  6 22 10
2  15  9  6 10
3  23  9  2  6
4  19 10  1 10
5   3 16 17  4
6   2  1 27 10
7   3 17 15  5
8  27  2  5  6
9  20  2 14  4
10 11  9  7 13
11 27  2  5  6
12 10  3 24  3
13 19 10 10  1
14 10 24  3  3
15 13 15  2 10
16 10  9 15  6

We can check dat (I wrote the solution the global environment) with rowSums:

 rowSums(dat)
 [1] 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40

A fun exercise for me. If anyone has a more efficient and/or less code intensive solution I’d love to hear about it.

To leave a comment for the author, please follow the link and comment on his blog: TRinker’s R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Modeling Contagion Using Airline Networks in R

By Benevolent Planner

Each vertex is a country and each edge represents and existing airline route between two countries. Flights beginning and ending in the same country are not represented for clarity.

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

I first became interested in networks when reading Matthew O’Jackson’s 2010 paper describing networks in economics. At some point during the 2014 ebola outbreak, I became interested in how the disease could actually come to the U.S. I was caught up with work/classes at the time, but decided to use airline flight data to at least explore the question.

This is the same dataset I used here. The datasource is given in that post.

I assumed that the disease had a single origin (Liberia) and wanted to explore the question of how fast the disease could travel to the U.S.

A visualization of the network can be seen below. Each node is a country and each edge represents an existing airline route from one country to another. Flights that take off and land in the same country are omitted to avoid clutter.

Each vertex is a country and each edge represents and existing airline route between two countries. Flights beginning and ending in the same country are not represented for clarity.

Communities and Homophily

I used a spinglass algorithm to detect “communities” of countries, i.e. sets of countries with many flights between themselves, but few flights between them and countries not in the set. Roughly speaking, the algorithm tended to group countries in the same continent together. However, this is not always the case. For example, France was was placed in the same community as several African countries, due to the close relationships with its former colonies. Roughly speaking, this network seems to exhibit homophily – where countries on the same continent tend to be connected more with each other than with countries off their continent.

Liberia, the US, and Degree Distribution

The labels are unclear in the plot, but the United States and Liberia are in two separate communities, which may lead us to believe that Ebola spreading from the former to the latter would be unlikely. In fact, the degrees (number of countries a given country is connected to) of the countries differ greatly, which would also support this intuition. The US is connected to 186 other countries, whereas Liberia is connected to only 12. The full degree distribution is shown below. It roughly follows a power law, which, according to wikipedia, is what we should expect. Note that the approximation is asymptotic, which could be why this finite sample is off. According to the degree distribution, half of all countries are connected to 27 other countries. Liberia falls far below the median and the US falls far above the median.


Degree distribution of airline connections and the power law. If a network’s degree distribution follows a power law, we say it is a “scale-free” network.

Small Worlds

Let’s zoom in and look at Liberia’s second-degree connections:

Liberia's Airline connections. Sierra Leone and Cote d'Ivoire have no direct connects to the US, so their connections are not shown.
Liberia’s Airline connections. Sierra Leone and Cote d’Ivoire have no direct connects to the US, so their connections are not shown.

Even though they’re in two different communities, Liberia and the US are only two degrees of separation away. This is generally case for all countries. If, for each node, we calculated the shortest path between it and every other node, the average shortest distance would be about 2 (specifically 2.3) hops. This is called the small world phenomenon. On average, every country is 2 hops away from every other country. Many networks exhibit this phenomenon largely due to “hubs” – countries (more generally, nodes) with lots of connections to other countries. For example, you can imagine that Charles de Gaulle airport in France is a hub, which links countries in the US, Eastern Europe, Asia, and Africa. The existence of these hubs makes it possible to get from one country to another with very few transfers.

Contagion

The close-up network above shows that if ebola were to spread to the US, it might be through Nigeria, Ghana, Morocco, and Belgium. If we knew the proportion of flights that went from liberia to each of these countries and from each of these countries to the US, we could estimate the probability of Ebola spreading for each route. I don’t have time to do this, although it’s certainly possible given the data.

Of course, this is a huge simplification for many reasons. Even though Sierra Leon, for example, doesn’t have a direct connection to the US, it could be connected to other countries that are connected to the US. This route could have a very high proportion of flights end up in the US. So we would need to account for this factor.

There are also several epidemiological parameters that could change how quickly the disease spreads. For example, the length of time from infection to symptoms is important. If the infected don’t show symptoms until a week after infection, then they can’t be screened and contained as easily. They could infect many others before showing symptoms.

The deadliness of the disease is also important. If patients die within hours of being infected, then the disease can’t spread very far. To take the extreme, consider a patient dies within a second of being infected. Then there’s very little time for him to infect others.

Finally, we assumed a single origin. If the disease is already present in several countries when we run this analysis, then we would have to factor in multiple origins.

routes=read.table('.../routes.dat',sep=',')
ports=read.table('.../airports.dat',sep=',')

library(igraph)

# for each flight, get the country of the airport the plane took off from and landed at.
ports=ports[,c('V4','V5')]
names(ports)=c('country','airport')

routes=routes[,c('V3','V5')]
names(routes)=c('from','to')

m=merge(routes,ports,all.x=TRUE,by.x=c('from'),by.y=c('airport'))
names(m)[3]=c('from_c')
m=merge(m,ports,all.x=TRUE,by.x=c('to'),by.y=c('airport'))
names(m)[4]=c('to_c')

m$count=1
# create a unique country to country from/to route ID
m$id=paste(m$from_c,m$to_c,sep=',')

# see which routes are flown most frequently
a=aggregate(m$count,list(m$id),sum)
names(a)=c('id','flights')
a$fr=substr(a$id,1,regexpr(',',a$id)-1)
a$to=substr(a$id,regexpr(',',a$id)+1,100)
a=a[,2:4]

a$perc=(a$flights/sum(a$flights))*100

# create directed network graph
a=a[!(a[,2]==a[,3]),]
mat=as.matrix(a[,2:3])

g=graph.data.frame(mat, directed = T)

edges=get.edgelist(g)
deg=degree(g,directed=TRUE)
vv=V(g)

# use spinglass algo to detect community
set.seed(9)
sgc = spinglass.community(g)
V(g)$membership=sgc$membership
table(V(g)$membership)

V(g)[membership==1]$color = 'pink'
V(g)[membership==2]$color = 'darkblue'
V(g)[membership==3]$color = 'darkred'
V(g)[membership==4]$color = 'purple'
V(g)[membership==5]$color = 'darkgreen'

plot(g,
     main='Airline Routes Connecting Countries',
     vertex.size=5,
     edge.arrow.size=.1,
     edge.arrow.width=.1,
     vertex.label=ifelse(V(g)$name %in% c('Liberia','United States'),V(g)$name,''),
     vertex.label.color='black')
legend('bottomright',fill=c('darkgreen','darkblue', 'darkred', 'pink', 'purple'),
       c('Africa', 'Europe', 'Asia/Middle East', 'Kiribati, Marshall Islands, Nauru', 'Americas'),
       bty='n')

# plot degree distribution
dplot=degree.distribution(g,cumulative = TRUE)

plot(dplot,type='l',xlab='Degree',ylab='Frequency',main='Degree Distribution of Airline Network',lty=1)
lines((1:length(dplot))^(-.7),type='l',lty=2)
legend('topright',lty=c(1,2),c('Degree Distribution','Power Law with x^(-.7)'),bty='n')

# explore membership...regional patterns exist
cc=cbind(V(g)$name,V(g)$membership)
tt=cc[cc[,2]==5,]

# explort connection from Liberia to United States
m=mat[mat[,1]=='Liberia',]

t=mat[mat[,1] %in% m[,2],]
tt=t[t[,2]=='United States',]

# assess probabilities

lib=a[a$fr=='Liberia',]
lib$prob=lib$flights/sum(lib$flights)
  # most probable route from liberia is Ghana

vec=c(tt[,1],'Liberia')
names(vec)=NULL

g2=graph.data.frame(mat[(mat[,1] %in% vec & mat[,2] == 'United States') | (mat[,1]=='Liberia'),], directed = T)
V(g2)$color=c('darkblue','darkgreen','darkgreen','darkgreen','darkgreen','purple','darkgreen','darkgreen')

plot(g2,
     main='Airline Connections from Liberia to the United States',
     vertex.size=5,
     edge.arrow.size=1,
     edge.arrow.width=.5,
     vertex.label.color='black')
legend('bottomright',fill=c('darkgreen','darkblue','purple'),
       c('Africa', 'Europe', 'Americas'),
       bty='n')


aa=a[a$fr %in% tt[,1],]
sum=aggregate(aa$flights,list(aa$fr),sum)

bb=a[a$fr %in% tt[,1] & a$to=='United States',]

fin=data.frame(bb$fr,sum$x,bb$flights,bb$flights/sum$x)

s=shortest.paths(g)
mean(s)

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Top of the Heap: How Much Can We Learn from Partial Rankings?

By Joel Cadwell

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

The recommendation system gives you a long list of alternatives, but the consumer clicks on only a handful: most appealing first, then the second best, and so on until they stop with all the remaining receiving the same rating as not interesting enough to learn more. As a result, we know the preference order for only the most preferred. Survey research may duplicate this process by providing many choices and asking only for your top three selections – your first, second and third picks. This post will demonstrate that by identifying clusters with different ranking schemes (mixtures of rankings), we can learn a good deal about consumer preferences across all the alternatives from observing only a limited ordering of the most desired (partial top-K rankings).

However, we need to remain open to the possibility that our sample is not homogeneous but contains mixtures of varying ranking schemes. To be clear, the reason for focusing on the top-K rankings is because we have included so many alternatives that no one person will be able to respond to all of them. For example, the individual is shown a screen filled with movies, songs, products or attributes and asked to pick the best of the list in order of preference. Awareness and familiarity will focus attention on some subset, but not the same subset for everyone. We should recall that N-K of the possible options will not be selected and thus be given zeros. Consequently, with individuals in rows and alternatives as columns, no one should be surprised to discover that the data matrix has a To see how all this works in practice, we begin by generating complete ranking data using the simulISR( ) function from the R package Rankcluster. The above graphic, borrowed from Wikipedia, illustrates the Insertion Sort Ranking (ISR) process that Rankcluster employs to simulate rankings. We start with eight objects in random order and sort them one at a time in a series of paired comparisons. However, the simulation function from Rankcluster allows us to introduce heterogeneity by setting a dispersion parameter called pi. That is, we can generate a sample of individuals sharing a common ranking scheme, yet with somewhat different observed rankings from the addition of an error component.

As an example, everyone intends to move #7 to be between #6 and #8, but some proportion of the sample may make “mistakes” with that proportion controlled by pi. Of course, the error could represent an overlap in the values associated with #6 and #7 or # 7 and #8 so that sometimes one looks better and other times it seems the reverse (sensory discrimination). Regardless, we do not generate a set of duplicate rankings. Instead, we have a group of ranks distributed about a true rank. The details can be found in their technical paper.

You will need to install the Rankcluster and NMF packages in order to run the following R code.

# Rankcluster needed to simulate rankings
library(Rankcluster)

# 100 respondents with pi=0.90
# who rank 20 objects from 1 to 20
rank1<-simulISR(100, 0.90, 1:20)

# 100 respondents with pi=0.90
# who rank 20 object in reverse order
rank2<-simulISR(100, 0.90, 20:1)

# check the mean rankings
apply(rank1, 2, mean)
apply(rank2, 2, mean)

# row bind the two ranking schemes
rank<-rbind(rank1,rank2)

# set ranks 6 to 20 to be 0s
top_rank<-rank
top_rank[rank>5]<-0

# reverse score so that the
# scores now represent intensity
focus<-6-top_rank
focus[focus==6]<-0

# use R package NMF to uncover
# mixtures with different rankings
library(NMF)
fit<-nmf(focus, 2, method="lee", nrun=20)

# the columns of h transposed
# represent the ranking schemes
h<-coef(fit)
round(t(h))

# w contains the membership weights
w<-basis(fit)

# hard clustering
type<-max.col(w)

# validates the mixture model
table(type,c(rep(1,100),rep(2,100)))

Created by Pretty R at inside-R.org

We begin with the simulIST( ) function simulating two sets of 100 rankings each. The function takes three arguments: the number of rankings to be generated, the value of pi, and the rankings listed for each object. The sequence 1:20 in the first ranking scheme indicates that there will be 20 objects ordered from first to last. Similarly, the sequence 20:1 in the second ranking scheme inputs 20 objects ranked in reverse from last to first. We concatenate data produced by the two ranking schemes and set three-quarters of the rankings to 0 as if only the top-5 rankings were provided. Finally, the scale is reversed so that the non-negative values suggest greater intensity with five as the highest score.

The R package NMF performs the nonnegative matrix factorization with the number of latent features set to two, the number of ranking schemes generating the data. I ask that you read an earlier post for the specifics of how to use the R package NMF to factor partial top-K rankings. More generally though, we are inputting a sparse data matrix with zeros filling 75% of the space. We are trying to reproduce that data (labeled V in the diagram below) by multiplying two matrices. One has a row for every respondent (w in the R code), and the other has a column for every object that was ranked (h in the R code). What links these two matrices is the number of latent features, which in this case happens also to be two because we simulated and concatenated two ranking schemes.

Let us say that we placed 20 bottles of wine along a shelf so that the cheapest was in the first position on the left and the most expensive was last on the shelf on the far right. These are actual wines so that most would agree that the higher priced bottles tended to be of higher quality. Then, our two ranking schemes could be called “price sensitivity” and “demanding palette” (feel free to substitute less positive labels if you prefer). If one could only be Price Sensitive or Demanding Palette and nothing in between, then you would expect precisely 1 to 20 and 20 to 1 rankings for everyone in each segment, respectively, assuming perfect knowledge and execution. That is, some of our drinkers may be unaware that #16 received a higher rating than #17 or simply give it the wrong rank. This is encoded in our pi parameter (pi=0.90 in this simulation). Still, if I knew your group membership and the bottle’s position, I could predict your ranking with some degree of accuracy varying with pi.

Nonnegative matrix factorization (NMF) seeks to recover the latent features separating the wines and the latent feature membership for each drinker from the data matrix, which you recall does not contain complete rankings but only the partial top-K. Since I did not set the seed, your results will be similar, but not identical, to the following decomposition.

Columns

h
Demanding Palette

Price Sensitivity

Rows

w
Demanding Palette

Price Sensitivity

C1

0

368

R1

0.00000

0.01317

C2

0

258

R2

0.00100

0.00881

C3

0

145

R3

0.00040

0.00980

C4

4

111

R4

0.00105

0.00541

C5

18

68

R5

0.00000

0.01322

C6

49

80

R6

0.00000

0.01207

C7

33

59

R7

0.00291

0.00541

C8

49

61

R8

0.00361

0.00416

C9

45

50

R9

0.00242

0.01001

C10

112

31

.

.

.

C11

81

30

.

.

.

C12

63

9

.

.

.

C13

79

25

R193

0.01256

0.00000

C14

67

18

R194

0.00366

0.00205

C15

65

28

R195

0.01001

0.00030

C16

79

28

R196

0.00980

0.00000

C17

85

14

R197

0.00711

0.00028

C18

93

5

R198

0.00928

0.00065

C19

215

0

R199

0.01087

0.00000

C20

376

0

R200

0.01043

0.00000

The 20 columns from transposed h are presented first, and then the first few rows followed by the last rows from w. These coefficients will reproduce the data matrix, which contains numbers from 0 to 5. For instance, the reproduced score for the first respondent for the first object is 0*0.00000 + 386*0.01317 = 4.84656 or almost 5, suggesting that they most prefer the cheapest wine. In a similar fashion, the last row, R200, gives greater weight to the first column, and the first column seems to prefer the higher end of the wine continuum.

Clearly, there are some discrepancies toward the middle of the wine rankings, yet the ends are anchored. This makes sense given that we have data only on the top-5 rankings. Our knowledge of the ten objects in the middle comes solely from the misclassification when making pairwise comparisons set by pi=0.90. In the aggregate we seem to be able to see some differentiation even when we did not gather any individual data after the Kth position. Hopefully, C1 represents wine in a box and C20 is a famous vintage from an old village with a long wine history, making our interpretation of the latent features easier for us.

When I run this type of analysis with actual marketing data, I typically uncover many more latent features and find respondents with sizable membership weightings spread across several of those latent features. Preference for wine is based on more than a price-quality tradeoff, so we would expect to see other latent features accounting for the top-5 rankings (e.g., the reds versus the whites). The likelihood that an object makes it into the top-5 selection is a decreasing function of its rank order across the entire range of options so that we might anticipate some differentiate even when the measurement is as coarse as a partial ranking. NMF will discover that decomposition and reproduce the original rankings as I have shown with the above example. It seems that there is much we can learn for partial rankings.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Chapter 4 of Modeling data with functional programming in R is out

By Brian Lee Yung Rowe

(This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers)

This chapter is on what I call fold-vectorization. In some languages, it’s called reduce, though the concept is the same. Fold implements binary iterated function application, where elements of a sequence are passed along with an accumulator to a function. This process repeats such that each successive element is paired with the previous result of the function application. The version of fold discussed in the book appears in my lambda.tools package.

There are numerous applications of fold that span basic algebraic functions, such as the cumulative sum and product, to applying the derivative, to implementing Markov Chains. This chapter first delves into the mathematical motivation for fold. It then returns to the ebola.sitrep package and discusses the use of fold for applying transformations to data frames. Some examples work off the parsed file, which resides in the data directory of the project. The chapter finishes up by reviewing some methods of numerical analysis related to function approximation and series to further illustrate the use of fold in implementing mathematical relationships.

As usual, feedback and comments are welcome. I’m mostly interested in conceptual issues, although grammatical and typographical issues are also fair game.

To leave a comment for the author, please follow the link and comment on his blog: Cartesian Faith » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Update on Snowdoop, a MapReduce Alternative

By matloff

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In blog posts a few months ago, I proposed an alternative to MapReduce, e.g. to Hadoop, which I called “Snowdoop.” I pointed out that systems like Hadoop and Spark are very difficult to install and configure, are either too primitive (Hadoop) or too abstract (Spark) to program, and above all, are SLOW. Spark is of course a great improvement on Hadoop, but still suffers from these problems to various extents.

The idea of Snowdoop is to

  • retain the idea of Hadoop/Spark to work on top of distributed file systems (“move the computation to the data rather than vice versa”)
  • work purely in R, using familiar constructs
  • avoid using Java or any other external language for infrastructure
  • sort data only if the application requires it

I originally proposed Snowdoop just as a concept, saying that I would slowly develop it into an actual package. I later put the beginnings of a Snowdoop package in a broader package, partools, which also contains other parallel computation utilities, such as a debugging aid for the cluster portion of R’s parallel package (which I like to call Snow, as it came from the old snow package).

I remain convinced that Snowdoop is a much more appropriate tool for many R users who are currently using Hadoop or Spark. The latter two, especially Spark, may be a much superior approach for those with very large clusters, thus with a need for built-in fault tolerance (Snowdoop provides none on its own), but even the Hadoop Wiki indicates that many MapReduce users actually work on clusters of very modest size.

So, in the last few weeks, I’ve added quite a bit to Snowdoop, and have run more timing tests. The latter are still very preliminary, but they continue to be very promising. In this blog post, I’ll give an extended example of usage of the latest version, which you can obtain from GitHub. (I do have partools on CRAN as well, but have not updated that yet.)

The data set in this example will be the household power usage data set from UCI. Most people would not consider this “Big Data,” with only about 2 million rows and 9 columns, but it’s certainly no toy data set, and it will make serve well for illustration purposes.

But first, an overview of partools:

  • distributed approach, either persistent (distributed files) or quasi-persistent (distributed objects at the cluster nodes, in memory but re-accessed repeatedly)
  • most Snowdoop-specific function names have the form file*
  • most in-memory functions have names distrib*
  • misc. functions, e.g. debugging aid and “Software Alchemy”

Note that partools, therefore, is more than just Snowdoop. One need not use distributed files, and simply use the distrib* functions as handy ways to simplify one’s parallel code.

So, here is a session with the household power data. I’m running on a 16-core machine, using 8 of the cores. For convenience, I changed the file name to hpc.txt. We first create the cluster, and initialize partools, e.g. assign an ID number to each cluster node:

 
> cls <- makeCluster(8)
> setclsinfo(cls) # partools call

Next we split the file into chunks, using the partools function filesplit() (done only once, not shown here). This creates files hpc.txt.1, hpc.txt.2 and so on (in this case, all on the same disk). Now have each cluster node read in its chunk:

> system.time(clusterEvalQ(cls,hp <-    read.table(filechunkname("hpc.txt",1),
header=TRUE,sep=";",stringsAsFactors=FALSE)))
 user system elapsed
 9.468 0.270 13.815

(Make note of that time.) The partools function filechunkname() finds the proper file chunk name for the calling cluster node, based on the latter’s ID. We now have a distributed data frame, named hp at each cluster node.

The package includes a function distribagg(), a distributed analog of R’s aggregate() function. Here is an example of use, first converting the character variables to numeric:

> clusterEvalQ(cls,for (i in 3:9) hp[,i] <- as.numeric(hp[,i]))
> system.time(hpoutdis <-distribagg(cls,"x=hp[,3:6],
   by=list(hp[,7],hp[,8])","max",2))
 user system elapsed
 0.194 0.017 9.918

As you can see, the second and third arguments to distribagg() are those of aggregate(), in string form.

Now let’s compare to the serial version:

> system.time(hp <- read.table("hpc.txt",header=TRUE,sep=";",
stringsAsFactors=FALSE))
 user system elapsed
 22.343 0.175 22.529
> for (i in 3:9) hp[,i] <- as.numeric(hp[,i])
> system.time(hpout <- aggregate(x=hp[,3:6],by=list(hp[,7],hp[,8]),FUN=max))
 user system elapsed
 76.348 0.184 76.552


So, the computation using distribagg() was almost 6 times faster than serial, a good speed for 8 cluster nodes. Even the input from disk was more than twice as fast, in spite of the files being on the same disk, going through the same operating system.

To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News