RcppAnnoy 0.0.3

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Hours after the initial blog post announcing the first release of the new package RcppAnnoy, Qiang Kou sent us a very nice pull request adding mmap support in Windows.

So a new release with Windows support is on now CRAN, and Windows binaries should be available by this evening as usual.

To recap, RcppAnnoy wraps the small, fast, and lightweight C++ template header library Annoy written by Erik Bernhardsson for use at Spotify. RcppAnnoy uses Rcpp Modules to offer the exact same functionality as the Python module wrapped around Annoy.

Courtesy of CRANberries, there is also a diffstat report for this release. More detailed information is on the RcppAnnoy page page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

MilanoR meeting is coming

By MilanoR

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR staff is happy to announce the next MilanoR meeting on December 18, 2014.
Please stay connected: further details will be published soon!

To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hi

By aschinchon

HI2

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Why do some mathematicians wear a white coat? Are they afraid to be splashed by an integral? (Read on Twitter)

If you run into someone wearing a white coat who tells you something like

e raised to minus 3 by zero point five plus x squared plus y squared between two plus e raised to minus x squared minus y squared between two by cosine of four by x

do not be afraid: is just a harmless mathematician waving to you. Look at this:

This is the code to draw these mathematical greetings:

levelpersp=function(x, y, z, colors=heat.colors, ...) {
  ## getting the value of the midpoint
  zz=(z[-1,-1] + z[-1,-ncol(z)] + z[-nrow(z),-1] + z[-nrow(z),-ncol(z)])/4
  ## calculating the breaks
  breaks=hist(zz, plot=FALSE)$breaks
  ## cutting up zz
  cols=colors(length(breaks)-1)
  zzz=cut(zz, breaks=breaks, labels=cols)
  ## plotting
  persp(x, y, z, col=as.character(zzz), ...)
}
x=seq(-5, 5, length= 30);y=x
f=function(x,y) {exp(-3*((0.5+x)^2+y^2/2))+exp(-x^2-y^2/2)*cos(4*x)}
z=outer(x, y, f)
z[z>.001]=.001;z[z<0]=z[1,1]
levelpersp(x, y, z, theta = 30, phi = 55, expand = 0.5, axes=FALSE, box=FALSE, shade=.25)

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Video presentation on Revolution R Open and DeployR Open

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The video replay from last week’s Introducing Revolution R Open webinar is now available, and I’ve embedded it below for anyone who may have missed the live presentation.

In addition to providing an overview of Revolution R Open, the presentation also introduces other open source projects from Revolution Analytics. If you’ve ever wanted to embed R into another application, check out the demo of DeployR Open beginnning at the 21-minute mark.

There are lots of links in the slides, which you can follow in the presentation on SlideShare. If you want to get started with Revolution R Open, visit our MRAN website (which has just been upgraded to provide even more detailed package info — see the knitr entry for an example).

Revolution Analytics Webinars: Introducing Revolution R Open: Enhanced, Open Source R distribution from Revolution Analytics

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Example 2014.13: Statistics doesn’t have to be so hard! Resampling in R and SAS

By Nick Horton

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We’ll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book).

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water),
x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

TRUE FALSE
0.1 99.9

The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results– plotted with the lattice graphics package that mosaic loads by default– demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.

alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.

resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) -
mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.

resamp_means = function(data, labs){
resample_labels = sample(labs)
resample_diff = mean(data[resample_labels=="beer"]) -
mean(data[resample_labels=="water"])
return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:

alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)

SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the “line hold” double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.

data bites;;
if _n_ le 18 then drink="water";
else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3;
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

Obs obsdiff
1 4.37778

(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don’t understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.

data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff;
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

Obs rrdiff
1 -1.73778

One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.

proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;

To get a p-value and make a histogram, we use the macro variable created earlier.

data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

The MEANS Procedure
Analysis Variable : exceeds
Sum
9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;

The p-value is 0.001 (= (9+1)/10000).

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on his blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Example 2014.13: Statistics doesn’t have to be so hard! Resampling in R and SAS

By Nick Horton

A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We’ll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book).

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water),
x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

TRUE FALSE
0.1 99.9

The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results– plotted with the lattice graphics package that mosaic loads by default– demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.

alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.

resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) -
mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.

resamp_means = function(data, labs){
resample_labels = sample(labs)
resample_diff = mean(data[resample_labels=="beer"]) -
mean(data[resample_labels=="water"])
return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:

alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)

SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the “line hold” double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.

data bites;;
if _n_ le 18 then drink="water";
else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3;
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

Obs obsdiff
1 4.37778

(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don’t understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.

data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff;
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

Obs rrdiff
1 -1.73778

One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.

proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;

To get a p-value and make a histogram, we use the macro variable created earlier.

data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

The MEANS Procedure
Analysis Variable : exceeds
Sum
9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;

The p-value is 0.001 (= (9+1)/10000).

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Source:: SAS & R

2 New (Draft) Chapters for Intro Fish Science with R

By dogle

(This article was first published on fishR » R, and kindly contributed to R-bloggers)

I have added two new drafts of chapters for the Introductory Fisheries Science with R book. Both chapters are still rough, but feedback would be greatly appreciated.

The first chapter is an introduction to basic data manipulations, largely using the (relatively) new dplyr package. Sections in this chapter include “Reading Data into R” (largely from CSV files, with mention of other files), “Basic Data Manipulations” (including filtering, selecting variables or individuals, renaming variables, creating new variables, creating length categorization variables, recoding factor variables, and sorting), “Joining Data Frames”, “Re-Arranging Data Frames” (wide and long formats), and “New Data Frames from Aggregation.”

The second chapter is related to condition metrics and includes a very basic introduction to one-way analysis of variance.

Both chapters precipitated changes to the FSA package, which are detailed in the NEWS files. Thus, it is best to download the very latest version of FSA.

Let me know if you have questions, comments, or suggestions for improvements. Thanks.

Filed under:

To leave a comment for the author, please follow the link and comment on his blog: fishR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Simulating the Rule of Five

By “Jay Jacobs (@jayjacobs)”

ruleof5 plot

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

SIRAcon, the annual conference for the Society of Information Risk
Analysts was held September 9th and 10th. I was fortunate enough to not
only attend, but I also spoke (on how to stop worrying and love math)
and moderated a fantastic panel with
Ali-Samad-Khan, Jack
Jones
and Doug
Hubbard
. It was a
fantastic 2-day conference and not just because of the speakers. The
small venue and schedule enabled multiple networking opportunities and
side conversations. One of my conversation was with Doug Hubbard where
the Rule of Five came up from his first book.

The “Rule of Five” states, “There is a 93.75% chance that the median of
a population is between the smallest and largest values in any random
sample of five from that population.” And even though the math on it is
fairly easy, I found it much more fun to think through how to simulate
it. There may be times when the math is either too complex (relative
term, I know), but creating a simulation may be the only option.

Let’s think through this, first we will need a population and the median
of that population. We will draw our population from a normal
distribution, but it could be any type of distribution, since we are
working with the median, it doesn’t matter.

set.seed(1)  # make it repeatable
pop <- rnorm(1000)  # generate the "population"
med <- median(pop) # calculate the median

Next, we will want to repeatedly draw 5 random samples from the
population and check if the median from the population is in the range
(between the minimum and maximum) of the 5 samples. According to to
setup, we should expect 93.75% of the samples to contain the median.

ssize <- 100000 # how many trials to run
matched <- sapply(seq(ssize), function(i) {
  rg <- range(sample(pop, 5)) # get the range for 5 random samples
  ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE) # test if median in range
})
sum(matched)/ssize # proportion matched

## [1] 0.9383

That’s pretty close to 93.75%, but what if we broke this up across
multiple populations (and sizes) and used many iterations? Let’s create
a simluation where we generate populations of varying sizes, capture a
whole bunch of these and then plot the output.

# first set up a function to do the sampling
pickfive <- function(popsize, ssize) {
  pop <- rnorm(popsize)
  med <- median(pop)
  matched <- sapply(seq(ssize), function(i) {
     rg <- range(sample(pop, 5))
     ifelse(med>=rg[1] & med<=rg[2], TRUE, FALSE)
  })
  sum(matched)/ssize
}
# test it
pickfive(popsize=1000, ssize=100000)

## [1] 0.9381

Now we can create a loop to call the function over and over again.

# 1,000 to 1 million by a thousand
set.seed(1)
possible <- seq(1000, 1000000, by=1000)
output <- sapply(possible, pickfive, 5000) # takes a while
print(mean(output))

## [1] 0.9377

And we can visualize the variations in the samples:

There are so many interesting concepts being touched on here. First,
this is a great example of the Law of Large
Numbers
— the more
iterations we do, the closer to reality our estimation will be. Also,
the results will form a normal distribution around the true answer, no
matter what the underlying population looks like, that’s why taking the
mean works. Finally, this also hints at a concept underlying many
machine learning algorithms: combining multiple ‘weak predictors’ is
more accurate than one (or handful) of strong predictors. By taking the
mean of all our outputs (even the samples that are way off) we are
using all the samples to derive a more accurate estimate of true value.

Oh yeah, the math…

The math here is so much simpler than all the stuff I did above (but so
much less exciting). Since we talking about the median, 50% of the
samples should be above and below the median value. This sets a
coin-toss analogy: if we say heads is above the median and tails is
below, what is the probability of flipping either 5 heads or 5 tails in
a row? It is the same thing, we want to know the probability that all 5 of
our samples will be above (or below) the median. The math is simply 50%
to the power of 5 for getting either 5 tails or heads in row.
So we can calculate it once and double it (once for heads, twice for tails).
Then we want to know what
the probability is of not getting heads, so we subtract it from 1.

# probability of getting 5 heads in a row
prob_of_heads <- 0.5 ^ 5  # == 0.03125

# probability of getting 5 heads or 5 tails in a row
prob_of_heads_or_tails <- 2 * prob_of_heads # == 0.0625

# probability of NOT getting 5 heads or tails in a row
rule_of_five <- (1 - prob_of_heads_or_tails)  # yup, 0.9375 or 93.75%

Or, if we want to wrap this whole discussion up into a single line of code:

print(1 - 2 * 0.5^5)

## [1] 0.9375

I told ya the math was easy! But simluations likes this are fun to do
and come in handy if you don’t know (or can’t remember) the math. Plus
with this, you should be to able to now go simulate the “Urn of Mystery”
(chapter 3, 3rd edition of How to Measure
Anything
).
Or better yet, use this skill of simulation to finally prove to yourself
how the infamous Monte Hall
problem
actually
works
out

like the math says it should.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Moving The Earth (well, Alaska & Hawaii) With R

By hrbrmstr

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

In a previous post we looked at how to use D3 TopoJSON files with R and make some very D3-esque maps. I mentioned that one thing missing was moving Alaska & Hawaii a bit closer to the continental United States and this post shows you how to do that.

The D3 folks have it easy. They just use the built in modified Albers composite projection. We R folk have to roll up our sleeves a bit (but not much) to do the same. Thankfully, we can do most of the work with the elide (“ih lied”) function from the maptools package.

We’ll start with some package imports:

library(maptools)
library(mapproj)
library(rgeos)
library(rgdal)
library(RColorBrewer)
library(ggplot2)
 
# for theme_map
devtools::source_gist("33baa3a79c5cfef0f6df")

I’m using a GeoJSON file that I made from the 2013 US Census shapefile. I prefer GeoJSON mostly due to it being single file and the easy conversion to TopoJSON if I ever need to use the same map in a D3 context (I work with information security data most of the time, so I rarely have to use maps at all for the day job). I simplified the polygons a bit (passing -simplify 0.01 to ogr2ogr) to reduce processing time.

We read in that file and then transform the projection to Albers equal area and join the polygon ids to the shapefile data frame:

# https://www.census.gov/geo/maps-data/data/cbf/cbf_counties.html
# read U.S. counties moderately-simplified GeoJSON file
us <- readOGR(dsn="data/us.geojson", layer="OGRGeoJSON")
 
# convert it to Albers equal area
us_aea <- spTransform(us, CRS("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"))
us_aea@data$id <- rownames(us_aea@data)

Now, to move Alaska & Hawaii, we have to:

  • extract them from the main shapefile data frame
  • perform rotation, scaling and transposing on them
  • ensure they have the right projection set
  • merge them back into the original spatial data frame

The elide function has parameters for all the direct shape munging, so we’ll just do that for both states. I took a peek at the D3 source code for the Albers projection to get a feel for the parameters. You can tweak those if you want them in other positions or feel the urge to change the Alaska rotation angle.

# extract, then rotate, shrink & move alaska (and reset projection)
# need to use state IDs via # https://www.census.gov/geo/reference/ansi_statetables.html
alaska <- us_aea[us_aea$STATEFP=="02",]
alaska <- elide(alaska, rotate=-50)
alaska <- elide(alaska, scale=max(apply(bbox(alaska), 1, diff)) / 2.3)
alaska <- elide(alaska, shift=c(-2100000, -2500000))
proj4string(alaska) <- proj4string(us_aea)
 
# extract, then rotate & shift hawaii
hawaii <- us_aea[us_aea$STATEFP=="15",]
hawaii <- elide(hawaii, rotate=-35)
hawaii <- elide(hawaii, shift=c(5400000, -1400000))
proj4string(hawaii) <- proj4string(us_aea)
 
# remove old states and put new ones back in; note the different order
# we're also removing puerto rico in this example but you can move it
# between texas and florida via similar methods to the ones we just used
us_aea <- us_aea[!us_aea$STATEFP %in% c("02", "15", "72"),]
us_aea <- rbind(us_aea, alaska, hawaii)

Rather than just show the resultant plain county map, we’ll add some data to it. The first example uses US drought data (from November 11th, 2014). Drought conditions are pretty severe in some states, but we’ll just show areas that have any type of drought (levels D0-D4). The color ramp shows the % of drought coverage in each county (you’ll need a browser that can display SVGs to see the maps):

# get some data to show...perhaps drought data via:
# http://droughtmonitor.unl.edu/MapsAndData/GISData.aspx
droughts <- read.csv("data/dm_export_county_20141111.csv")
droughts$id <- sprintf("%05d", as.numeric(as.character(droughts$FIPS)))
droughts$total <- with(droughts, (D0+D1+D2+D3+D4)/5)
 
# get ready for ggplotting it... this takes a cpl seconds
map <- fortify(us_aea, region="GEOID")
 
# plot it
gg <- ggplot()
gg <- gg + geom_map(data=map, map=map,
                    aes(x=long, y=lat, map_id=id, group=group),
                    fill="#ffffff", color="#0e0e0e", size=0.15)
gg <- gg + geom_map(data=droughts, map=map, aes(map_id=id, fill=total),
                    color="#0e0e0e", size=0.15)
gg <- gg + scale_fill_gradientn(colours=c("#ffffff", brewer.pal(n=9, name="YlOrRd")),
                                na.value="#ffffff", name="% of County")
gg <- gg + labs(title="U.S. Areas of Drought (all levels, % county coverage)")
gg <- gg + coord_equal()
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg <- gg + theme(plot.title=element_text(size=16))
gg

While that shows Alaska & Hawaii in D3-Albers-style, it would be more convincing if we actually used the FIPS county codes on Alaska & Hawaii, so we’ll riff off the previous post and make a county land-mass area choropleth (since we have the land mass area data available in the GeoJSON file):

gg <- ggplot()
gg <- gg + geom_map(data=map, map=map,
                    aes(x=long, y=lat, map_id=id, group=group),
                    fill="white", color="white", size=0.15)
gg <- gg + geom_map(data=us_aea@data, map=map, aes(map_id=GEOID, fill=log(ALAND)),
                    color="white", size=0.15)
gg <- gg + scale_fill_gradientn(colours=c(brewer.pal(n=9, name="YlGn")),
                                na.value="#ffffff", name="County LandnMass Area (log)")
gg <- gg + labs(title="U.S. County Area Choropleth (log scale)")
gg <- gg + coord_equal()
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg <- gg + theme(plot.title=element_text(size=16))
gg

Now, you have one less reason to be envious of the D3 cool kids!

The code & shapefiles are available on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing RcppAnnoy

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new package RcppAnnoy is now on CRAN.

It wraps the small, fast, and lightweight C++ template header library Annoy written by Erik Bernhardsson for use at Spotify.

While Annoy is setup for use by Python, RcppAnnoy offers the exact same functionality from R via Rcpp.

A new page for RcppAnnoy provides some more detail, example code and further links. See a recent blog post by Erik for a performance comparison of different approximate nearest neighbours libraries for Python.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News