New Australian data on the HMD

By Rob J Hyndman

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.

The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.

Tim Riffe from the HMD has provided the following information about the update:

  1. All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
  2. Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
  3. Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.

Some of the data can be read into R using the hmd.mx and hmd.e0 functions from the demography package. Tim has his own package on github that provides a more extensive interface.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How About a “Snowdoop” Package?

By matloff

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop. This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System. But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package, sparkr. Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations. Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here: Do we really need all that complicated machinery? I’ll propose a much simpler alternative here, and am curious to see what people think. (Disclaimer: I have only limited experience with Hadoop, and only a bit with SparkR. I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated. There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava. Some of this requires systems knowledge that many R users may lack. And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative: Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop. I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file.

(It’s assumed here that the reader is familiar with the basics of Snow. If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files, words.1 and words.2. In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Here is our code:

 
# give each node in the cluster cls an ID number
assignids <- function(cls) {
   clusterApply(cls,1:length(cls),
      function(i) myid <<- i)
}

# each node executes this function
getwords <- function(basename) {
   fname <- paste(basename,".",myid,sep="")
   words <- scan(fname,what="")
   length(mywords)
}

# manager
wordcount <- function(cls,basename) {
   assignids(cls)
   clusterExport(cls,"getwords")
   counts <- clusterCall(cls,getwords,basename)
   sum(unlist(counts))
}

This couldn’t be simpler. Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output of scan() above)
  • no configuration or platform worries
  • ordinary R programming

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package. It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine. What do you think?

To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Confidence vs. Credibility Intervals

By arthur charpentier

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Tomorrow, for the final lecture of the Mathematical Statistics course, I will try to illustrate – using Monte Carlo simulations – the difference between classical statistics, and the Bayesien approach.

The (simple) way I see it is the following,

  • for frequentists, a probability is a measure of the the frequency of repeated events, so the interpretation is that parameters are fixed (but unknown), and data are random
  • for Bayesians, a probability is a measure of the degree of certainty about values, so the interpretation is that parameters are random and data are fixed

Or to quote Frequentism and Bayesianism: A Python-driven Primer, a Bayesian statistician would say “given our observed data, there is a 95% probability that the true value of falls within the credible region” while a Frequentist statistician would say “there is a 95% probability that when I compute a confidence interval from data of this sort, the true value of will fall within it”.

To get more intuition about those quotes, consider a simple problem, with Bernoulli trials, with insurance claims. We want to derive some confidence interval for the probability to claim a loss. There were http://latex.codecogs.com/gif.latex?n = 1047 policies. And 159 claims.

Consider the standard (frequentist) confidence interval. What does that mean that

http://latex.codecogs.com/gif.latex?overline{x}pm%201.96%20sqrt{frac{overline{x}(1-overline{x})}{n}}

is the (asymptotic) 95% confidence interval? The way I see it is very simple. Let us generate some samples, of size http://latex.codecogs.com/gif.latex?n, with the same probability as the empirical one, i.e. http://latex.codecogs.com/gif.latex?widehat{theta} (which is the meaning of “from data of this sort”). For each sample, compute the confidence interval with the relationship above. It is a 95% confidence interval because in 95% of the scenarios, the empirical value lies in the confidence interval. From a computation point of view, it is the following idea,

> xbar <- 159
> n <- 1047
> ns <- 100
> M=matrix(rbinom(n*ns,size=1,prob=xbar/n),nrow=n)

I generate 100 samples of size http://latex.codecogs.com/gif.latex?n. For each sample, I compute the mean, and the confidence interval, from the previous relationship

> fIC=function(x) mean(x)+c(-1,1)*1.96*sqrt(mean(x)*(1-mean(x)))/sqrt(n)
> IC=t(apply(M,2,fIC))
> MN=apply(M,2,mean)

Then we plot all those confidence intervals. In red when they do not contain the empirical mean

> k=(xbar/n<IC[,1])|(xbar/n>IC[,2])
> plot(MN,1:ns,xlim=range(IC),axes=FALSE,
+ xlab="",ylab="",pch=19,cex=.7,
+ col=c("blue","red")[1+k])
> axis(1)
> segments(IC[,1],1:ns,IC[,2],1:
+ ns,col=c("blue","red")[1+k])
> abline(v=xbar/n)

Now, what about the Bayesian credible interval ? Assume that the prior distribution for the probability to claim a loss has a http://latex.codecogs.com/gif.latex?mathcal{B}(alpha,beta) distribution. We’ve seen in the course that, since the Beta distribution is the conjugate of the Bernoulli one, the posterior distribution will also be Beta. More precisely

http://latex.codecogs.com/gif.latex?mathcal{B}left(alpha+sum%20x_i,beta+n-sum%20x_iright)

Based on that property, the confidence interval is based on quantiles of that (posterior) distribution

> u=seq(.1,.2,length=501)
> v=dbeta(u,1+xbar,1+n-xbar)
> plot(u,v,axes=FALSE,type="l")
> I=u<qbeta(.025,1+xbar,1+n-xbar)
> polygon(c(u[I],rev(u[I])),c(v[I],
+ rep(0,sum(I))),col="red",density=30,border=NA)
> I=u>qbeta(.975,1+xbar,1+n-xbar)
> polygon(c(u[I],rev(u[I])),c(v[I],
+ rep(0,sum(I))),col="red",density=30,border=NA)
> axis(1)

What does that mean, here, that we have a 95% credible interval. Well, this time, we do not draw using the empirical mean, but some possible probability, based on that posterior distribution (given the observations)

> pk <- rbeta(ns,1+xbar,1+n-xbar)

In green, below, we can visualize the histogram of those values

> hist(pk,prob=TRUE,col="light green",
+ border="white",axes=FALSE,
+ main="",xlab="",ylab="",lwd=3,xlim=c(.12,.18))

And here again, let us generate samples, and compute the empirical probabilities,

> M=matrix(rbinom(n*ns,size=1,prob=rep(pk,
+ each=n)),nrow=n)
> MN=apply(M,2,mean)

Here, there is 95% chance that those empirical means lie in the credible interval, defined using quantiles of the posterior distribution. We can actually visualize all those means : in black the mean used to generate the sample, and then, in blue or red, the averages obtained on those simulated samples,

> abline(v=qbeta(c(.025,.975),1+xbar,1+
+ n-xbar),col="red",lty=2)
> points(pk,seq(1,40,length=ns),pch=19,cex=.7)
> k=(MN<qbeta(.025,1+xbar,1+n-xbar))|
+ (MN>qbeta(.975,1+xbar,1+n-xbar))
> points(MN,seq(1,40,length=ns),
+ pch=19,cex=.7,col=c("blue","red")[1+k])
> segments(MN,seq(1,40,length=ns),
+ pk,seq(1,40,length=ns),col="grey")

More details and exemple on Bayesian statistics, seen with the eyes of a (probably) not Bayesian statistician in my slides, from my talk in London, last Summer,

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Happy Thanksgiving | More Examples of XML + rvest with SVG

By klr

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

I did not intend for this little experiment to become a post, but I think the code builds nicely on the XML + rvest combination (also see yesterday’s post) for working with XML/HTML/SVG documents in R. It all started when I was playing on my iPhone in the Sketchbook app and drew a really bad turkey. Even though, the turkey was bad, I thought it would be fun to combine with vivus.js. However,

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Slightly Advanced rvest with Help from htmltools + XML + pipeR

By klr

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

Hadley Wickham’s post “rvest: easy web scraping with R” introduces the fine new package rvest very well. For those now yearning a slightly more advanced example with a little help from pipeR + htmltools + XML, I thought this might fill your yearn. The code grabs css information running the fancy new site cssstats.com on my blog site. With the background colors, it makes and labels some

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

MilanoR meeting: 18th December

By MilanoR

Revolution Analytics

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR staff is happy to announce the next MilanoR meeting.

When

Thursday, December 18, 2014

from 6 to 8 pm

Agenda

Welcome Presentation
by Nicola Sturaro
Consultant at Quantide

Shine your Rdata: multi-source approach in media analysis for telco industry
by Giorgio Suighi (Head Of Analytics), Carlo Bonini (Data Scientist) and Paolo Della Torre (ROI Manager), MEC

The second speaker will be announced soon. If you follow R blogs or tweets may be you already know his/her name. Otherwise, you should wait until Monday. Stay connected!

Where

Fiori Oscuri Bistrot & Bar
Via Fiori Oscuri, 3 – Milano (Zona Brera)

Buffet

Our sponsors will provide the buffet after the meeting.
MilanoR meeting is sponsored by
Quantide

MilanoR is a free event, open to all R users and enthusiasts or those who wish to learn more about R. Places are limited so, if you would like to attend to the MilanoR meeting, please register below. (If you’re reading this post from a news feed, e.g. from R-bloggers, please visit the original post in the MilanoR website to see the form and subscribe the event)

[contact-form-7]

To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The beautiful R charts in London: The Information Capital

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you’ve lived in or simply love London, a wonderful new book for your coffee-table is London: The Information Capital. In 100 beautifully-rendered charts, the book explores the data that underlies the city and its residents. To create most of these charts, geographer James Cheshire and designer Oliver Uberti relied on programs written in R. Using the R programming language not only created beautiful results, it saved time: “a couple of lines of code in R saved a day of manually drawing lines”.

Take for example From Home To Work, the graphic illustrating the typical London-area commute. R’s ggplot2 package was used to draw the invidual segments as transparent lines, which when overlaid build up the overall picture of commuter flows around cities and towns. The R graphic was then imported into Adobe Illustrator to set the color palette and add annotations. (FlowingData’s Nathan Yau uses a similar process.)

Another example is the chart below of cycle routes in London. (We reported on an earlier version of this chart back in 2012.) As the authors note, “hundreds of thousands of line segments are plotted here, making the graphic an excellent illustration of R’s power to plot large volumes of data.”

You can learn more from the authors about how R was used to create the graphics in London: The Information Capital and see several more examples at the link below. And if you’d like a copy, you can buy the book here.

London: The Information Capital / Our Process: The Coder and Designer

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Extracting NOAA sea surface temperatures with ncdf4

By Luke Miller

(This article was first published on lukemiller.org » R-project, and kindly contributed to R-bloggers)

I’ve written previously about some example R scripts I created to extract sea surface temperature data from NOAA’s Optimum Interpolated Sea Surface Temperature products. If you want daily global sea surface temperatures on a 0.25×0.25° grid, they gather those into 1-year files available at http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.highres.html. If you want weekly average SST values on a 1×1° […]

To leave a comment for the author, please follow the link and comment on his blog: lukemiller.org » R-project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

2014-03 The gridGraphics Package

By pmur002

(This article was first published on Stat Tech » R, and kindly contributed to R-bloggers)

The gridGraphics package provides a function, grid.echo(), that can be used to convert a plot drawn with the graphics package to the same result drawn using grid. This provides access to a variety of grid tools for making customisations and additions to the plot that are not possible with the graphics package.

Paul Murrell

Download.

To leave a comment for the author, please follow the link and comment on his blog: Stat Tech » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RStudio shortcuts (Windows) – for cleaner and faster coding

By tcamm

RStudio shortcuts

(This article was first published on TC » R, and kindly contributed to R-bloggers)

RStudio has a number of keyboard shortcuts that make for cleaner and faster coding. I put all the Windows shortcuts that I use onto a single page so that I can pin them next to my computer.

(PDF)

Some favourites of mine are:

Using code sections/chunks – Use Ctrl+Shift+R to insert a code section and a popup box will appear for you to name that section. Ctrl+Alt+T runs the current code section. When you are done working on a code section you can ‘fold’ it up to improve the readability of your file (Alt+L is fold current code section, Alt+O is fold all sections).

Re-running code quickly – Ctrl + Shift + P will execute the same region of code that was just previously run with the changes made since then.

Deleting/moving stuff faster – Ctrl+D deletes an entire line. Ctrl + backspace deletes the current word as in most word processing software. Alt + up/down moves code up and down lines in the console while Shift+Alt+up/down copies lines up/down.

Switch between plots – To toggle between plots use Ctrl+Shift+PgUp/PgDn (It’s a lot faster than using the arrows above the plots!)

To leave a comment for the author, please follow the link and comment on his blog: TC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News