How I made every tech company that I may ever want to work for in the future hate me, or “GO R Consortium!”

By richierocks

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

It turns out that when people tell you things, you should listen. Like when Joe Rickert of Microsoft says “this is not news, please don’t repeat what I’m about to say”, you should maybe take note and keep your mouth shut.

I’m not quite sure how I missed that, but I did. So on Sunday night I wrote a blog post about what happened at the R summit. And last night Gavin Simpson (@acfagls) tweeted me to say “what was this R Consortium that I’d mentioned in my post?”. I responded with what Joe had said: this is an organisation contributed to by some big tech companies that work with R, designed to fund R infrastructure projects. I also mentioned a conversation that I’d overheard about a possible replacement for R-forge built on github, that I guess might have been related. This was talk in a bar, so I hadn’t assumed it was top secret or likely true, and I made it clear I was only repeating gossip.

It turns out that despite me deleting the tweet and editing my blog post, gossiping spreads rather quickly on twitter (who’d have thought), and consequently the news ended up on Computer World. It could have been worse, I could have ended up on Infoworld.

Anyway, I spent this evening apologising to Joe Rickert and all the R Consortium members that I could find.

Fortunately, the R Consortium announced itself to the public today. And if we can move on from my idiocy, I’d like to explain why I think that the R Consortium is a big deal.

R infrastructure, by which I mean the tools that you use to write R code, publish it, and consume code by others the traditionally been the responsibility of R-Core. R-Core, as well as developing R itself, maintain CRAN and the mailing lists, not to mention a good number of packages. In all my interactions with R-Core I’ve been very impressed. They are however limited by the fact that there are only 21 of them, which means that the user community outnumbers them by five orders of magnitude. There’s just a fundamental manpower bottleneck in what they can do.

In recent years, RStudio, OpenAnalytics, Revolutions Analytics (now part of Microsoft) and Tibco have been working on creating better IDEs for R. (Three of those are part of R Consortium; I’m not sure whether OpenAnalytics intend to join or not.) github and Bitbucket, while not R-specific, have taken over the code management side of things. A load of projects have been made to get R running in places that it was never designed to go (I’m thinking Renjin for R-in-Google-App-Engine, and the projects for running R inside Oracle/MonetDB/SQL Server databases, but there are many more.)

For publishing R documents, knitr has taken over the world. As well as RStudio’s RPubs facility, O’Reilly’s Atlas software lets you write in Markdown or AsciiDoc, meaning you can knit a book. I know, I’ve done it.

The trouble is, many of these projects are run by small teams in individual companies, and there hasn’t been way to grow them into bigger projects. The costs of finding out what users want, and of communicating between groups was too high.

R Consortium solves this in two ways. Firstly, it involves many of the big corporate players in R. (The R Foundation also gets at least one seat, I believe.) Having all these companies paying to sit at the same table increases the chance that they’ll speak to each other. From their point of view, they save costs by not having to implement everything themselves; for everyone else, we have the benefit of these projects being made publically available.

The other genius move is to get ideas from the community about what to build. R has suffered a little bit from the open source “if you want something, build it yourself” attitude, so having a place where you can ask other people to build things for you sounds good.

I have really high hopes for the R Consortium, and I’ll be following what they do closely. Assuming I haven’t been blacklisted by them all!*

*Please don’t let me have been blacklisted by them all.

Tagged:

To leave a comment for the author, please follow the link and comment on his blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Chronicles from useR! – day 0

By Enrico Tonini

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Today my Quantide’s colleague Nicola Sturaro and I came to Aalborg, Denmark, to the useR! conference.

We landed late, thus we were able only to participate to the welcome reception in the evening and not to the tutorials in the afternoon too. Anyway, it has been a nice chance to meet other useRs, taste Danish food and drink beer.

Tomorrow the conference will be really getting going and we will keep you aware. Furthermore Quantide will be the official sponsor of tomorrow morning’s coffee break. You can’t miss it!

To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Notes from the Kölner R meeting, 26 June 2015

By Markus Gesmann

(This article was first published on mages’ blog, and kindly contributed to R-bloggers)

Last Friday the Cologne R user group came together for the 14th time, and for the first time we met at Startplatz, a start-up incubator venue. The venue was excellent, not only did they provide us with a much larger room, but also with the whole infrastructure, including table-football and drinks. Many thanks to Kirill for organising all of this!

Photo: Günter Faes

We had two excellent advanced talks. Both were very informative and well presented.

Data Science at the Command Line

Kirill Pomogajko showed us how he uses various command line tools to pre-process log-files for further analysis with R.

Photo: Günter Faes

Imagine you have several servers that generate large data sets with no standard delimiters, like the example below.

The columns appear to be separated by a blank at first glance, but the second column has strings such as “Air Force”. Furthermore, other columns have missing data and another uses speech-marks. Thus, it’s messy and difficult to read into R.

To solve the problem Kirill developed a Makefile that uses tools such as scp, sed and awk to download and clean the server files.

Kirill’s tutorial files are available via GitHub.

An Introduction to RStan and the Stan Modelling Language

Paul Viefers gave an great introduction to Stan and RStan, with a focus on explaining the differences to other MCMC packages such as JAGS.

Photo: Günter Faes

Stan is a probabilistic programming language for Bayesian inference. One of the major challenges in Bayesian analysis is that often there is no analytical solution for the posterior distribution. Hence, the posterior distribution is approximated via simulations, such as Gibbs sampling in JAGS. Stan, on the other hand, uses Hamiltonian Monte Carlo (HMC), an algorithm that is more subtle in proposing jumps, using more structure by translation into Hamiltonian mechanics framework.

Paul ended his talk by walking us through the various building blocks of a Stan script, using a hierarchical logistic regression example.

You can access Paul’s slides on Dropbox.

Drinks and Networking

No Cologne R user group meeting is complete without Kölsch and networking. In the end some of us ended up in a fancy burger place.

Next Kölner R meeting

The next meeting will be scheduled in September. Details will be published on our Meetup site. Thanks again to Revolution Analytics for their sponsorship.

This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on his blog: mages’ blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

PRESS RELEASE: Mango Solutions and The R Consortium

By Mango Solutions

r consortium

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Mango Solutions announces its key involvement with the newly launched R Consortium to support millions of users around the world

Mango Solutions, the leading data science company based in Europe and with offices in the UK and China are pleased to announce that they are one of the founding organisations of the recently launched R Consortium.

The R Consortium was formed by a number of interested technology firms including Microsoft, Oracle and HP and has been guided by The Linux Foundation, the nonprofit organization dedicated to accelerating the growth of Linux and collaborative development.

The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.

Mango Solutions Ltd provides complex analysis solutions, consulting, training, and application development for some of the largest companies in the world. Founded and based in UK in 2002, the company offers a number of services and products for data science including validation of open-source software for regulated industries.

Matt Aldridge, CEO, Mango Solutions said “Mango has been helping customers to leverage R in a commercial environment for over a decade. The R Consortium represents a vital step change, enabling more enterprises to adopt this excellent technology, and Mango are proud and excited to be involved in this exciting new chapter as a founder member.”

Other founding companies and organizations of the R Consortium include The R Foundation, RStudio, TIBCO Software Inc., Alteryx, and Ketchum Trading.

The R user community is vibrant with local user groups organized all over the world. The R Consortium will work with this user community and the R Foundation to amplify and focus the impact of this global community in order to advance the project for all users and developers, including millions more in the coming years.

“This is a great opportunity to harness the power of the thriving R user community around the globe and advance the R language for everyone,” said John Chambers on behalf of the R Foundation Board. “The R Consortium will provide vital funding support for R services and development, made possible by the Linux Foundation’s proven track record of bringing large-scale communities together. We are looking forward to working with both organizations.”

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Announcing the R Consortium

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The R community has grown explosively over the past few years, both in terms of the number of R users and the number of companies who rely on R as their data science platform. To serve the needs of this rapidly growing community, and to continue the success of the R Project as a whole, representatives from the R Foundation and from industry have joined forces to create the R Consortium, a new collaborative project of the Linux Foundation.

The R Consortium is a 501(c)6 non-profit organization dedicated to the support and growth of the R user community. The R Consortium will work with and provide support to the R Foundation and other organizations developing, maintaining and distributing R software, and provide a unifying framework for the R user community. It is funded by the contributions of its members and governed by these by-laws. The founding members include the R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, HP, Mango Solutions, Google, Ketchum Trading and Oracle.

While the R Foundation continues its role as the maintainer of the core R language engine, the R Consortium will initiate projects to help the user community make even better use of R, and to help the R developer community further extend R via packages and other ancillary software projects. Projects already proposed include: building and maintaining mirrors for downloading R; testing and quality assurance platforms; financial support for the annual useR! Conference; and promotion and support of worldwide user groups. In general, the Consortium will seek the input of its members and the R Community at large for projects that foster the continuing growth of R and the community of people that drives its evolution.

On a personal note, I am very proud that Microsoft is one of the founding Platinum members of the R Consortium — the highest level of commitment. R is strategic to Microsoft, and is being integrated into Microsoft’s data platforms to provide R’s built-in advanced analytics functionality and access to community-developed extensions like CRAN packages. (You can learn more about R at Microsoft in this presentation.) Joseph Sirosh, Corporate Vice President of Machine Learning at Microsoft, affirms the commitment by saying:

Our efforts to build R into more Microsoft products and services, combined with our contribution to the R Consortium as a Platinum Member, gives me confidence that we’re helping today’s data scientists and business leaders to drive innovation and advances in the field of data science with R.

Many have been working behind the scenes for many months to make the R Consortium a reality. Special thanks go to John Chambers of the R Foundation for his support and participation on the board, and to the team at the Linux Foundation whose experience with Linux and other open source projects has been invaluable. The R Consortium begins its mission today, and you can keep up-to-date with its activities at www.r-consortium.org. (I’ll also share news here on the blog.) The R Consortium is here for the R user community, so share your suggestions at the website or in the comments of this post.

R Consortium Press Releases: Linux Foundation Announces R Consortium to Support 2 Million Users Around the World

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Accelerating R: RStudio and the new R Consortium

By jjallaire

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

To paraphrase Yogi Berra, “Predicting is hard, especially about the future”. In 1993, when Ross Ihaka and Robert Gentleman first started working on R, who would have predicted that it would be used by millions in a world that increasingly rewards data literacy? It’s impossible to know where R will go in the next 20 years, but at RStudio we’re working hard to make sure the future is bright.

Today, we’re excited to announce our participation in the R Consortium, a new 501(c)6 nonprofit organization. The R Consortium is a collaboration between the R Foundation, RStudio, Microsoft, TIBCO, Google, Oracle, HP and others. It’s chartered to fund and inspire ideas that will enable R to become an even better platform for science, research, and industry. The R Consortium complements the R Foundation by providing a convenient funding vehicle for the many commercial beneficiaries of R to give back to the community, and will provide the resources to embark on ambitious new projects to make R even better.

We believe the R Consortium is critically important to the future of R and despite our small size, we chose to join it at the highest contributor level (alongside Microsoft). Open source is a key component of our mission and giving back to the community is extremely important to us.

The community of R users and developers have a big stake in the language and its long-term success. We all want free and open source R to continue thriving and growing for the next 20 years and beyond. The fact that so many of the technology industry’s largest companies are willing to stand behind R as part of the consortium is remarkable and we think bodes incredibly well for the future of R.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Exploring SparkR

By Alvaro “Blag” Tejada Galindo

(This article was first published on Blag’s bag of rants, and kindly contributed to R-bloggers)

A colleague from work, asked me to investigate about Spark and R. So the most obvious thing to was to investigate about SparkR -;)

I installed Scala, Hadoop, Spark and SparkR…not sure Hadoop is needed for this…but I wanted to have the full picture -:)

Anyway…I came across a piece of code that reads lines from a file and count how many lines have a “a” and how many lines have a “b”…

For this code I used the lyrics of Girls Not Grey by AFI

SparkR.R
library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken


0.3167355 seconds…pretty fast…I wonder how regular R will behave?

PlainR.R
library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logfile<-read.table(logFile,header = F, fill = T)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Nice…0.01522398 seconds…wait…what? Isn’t Spark supposed to be pretty fast? Well…I remembered that I read somewhere that Spark shines with big files…

Well…I prepared a file with 5 columns and 1 million records…let’s see how that goes…

SparkR.R
library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

26.45734 seconds for a million records? Nice job -:) Let’s see if plain R wins again…
PlainR.R
library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logfile<-read.csv(logFile,header = F)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
48.31641 seconds? Look like Spark was almost twice as fast this time…and this is a pretty simple example…I’m sure that when complexity arises…the gap is even bigger…
And sure…I know that a lot of people can take my plain R code and make it even faster than Spark…but…this is my blog…not theirs -;)
I will come back as soon as I learn more about SparkR -:D
Greetings,
Blag.
Development Culture.

To leave a comment for the author, please follow the link and comment on his blog: Blag’s bag of rants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

My Yahoo talk is now online

By Rob J Hyndman

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

Last week I gave a talk in the Yahoo! Big Thinkers series. The video of the talk is now online and embedded below.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Get by with a little (R) help from your friends (at GitHub)

By hrbrmstr

RStudio

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

@JennyBryan posted her slides from the 2015 R Summit and they are a must-read for instructors and even general stats/R-folk. She’s one of the foremost experts in R+GitHub and her personal and class workflows provide solid patterns worth emulation.

One thing she has mentioned a few times—and included in her R Summit talk—is the idea that you can lean on GitHub when official examples of a function are “kind of thin”. She uses a search for vapply as an example, showing how to search for uses of vapply in CRAN (there’s a read-only CRAN mirror on GitHub) and in GitHub R code in general.

I remember throwing together a small function to kick up a browser from R for those URLs (in a response to one of her tweets), but realized this morning (after reading her slides last night) that it’s possible to not leave RStudio to get these GitHub search results (or, at least the first page of results). So, I threw together this gist which, when sourced, provides a ghelp function. This is the code:

ghelp <- function(topic, in_cran=TRUE) {
 
  require(htmltools) # for getting HTML to the viewer
  require(rvest)     # for scraping & munging HTML
 
  # github search URL base
  base_ext_url <- "https://github.com/search?utf8=%%E2%%9C%%93&q=%s+extension%%3AR"
  ext_url <- sprintf(base_ext_url, topic)
 
  # if searching with user:cran (the default) add that to the URL  
  if (in_cran) ext_url <- paste(ext_url, "+user%3Acran", sep="", collapse="")
 
  # at the time of writing, "rvest" and "xml2" are undergoing some changes, so
  # accommodate those of us who are on the bleeding edge of the hadleyverse
  # either way, we are just extracting out the results <div> for viewing in 
  # the viewer pane (it works in plain ol' R, too)
  if (packageVersion("rvest") < "0.2.0.9000") { 
    require(XML)
    pg <- html(ext_url)
    res_div <- paste(capture.output(html_node(pg, "div#code_search_results")), collapse="")
  } else {
    require(xml2)
    pg <- read_html(ext_url)
    res_div <- as.character(html_nodes(pg, "div#code_search_results"))
  }
 
  # clean up the HTML a bit   
  res_div <- gsub('How are these search results? <a href="/contact">Tell us!</a>', '', res_div)
  # include a link to the results at the top of the viewer
  res_div <- gsub('href="/', 'href="http://github.com/', res_div)
  # build the viewer page, getting CSS from github-proper and hiding some cruft
  for_view <- sprintf('<html><head><link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github/index-4157068649cead58a7dd42dc9c0f2dc5b01bcc77921bc077b357e48be23aa237.css" media="all" rel="stylesheet" /><style>body{padding:20px}</style></head><body><a href="%s">Show on GitHub</a><hr noshade size=1/>%s</body></html>', ext_url, res_div)
  # this makes it show in the viewer (or browser if you're using plain R)
  html_print(HTML(for_view))
 
}

Now, when you type ghelp("vapply"), you’ll get:

in the viewer pane (and similar with ghelp("vapply", in_cran=FALSE)). Clicking the top link will take you to the search results page on GitHub (in your default web browser), and all the other links will pop out to a browser as well.

If you’re the trusting type, you can devtools::source_gist('32e9c140129d7d51db52') or just add this to your R startup functions (or add it to your personal helper package).

There’s definitely room for some CSS hacking and it would be fairly straightforward to get all the search results into the viewer by following the pagination links and stitching them all together (an exercise left to the reader).

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Generalized Linear Mixed Models: the FAQ

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Mixed models (which include random effects, essentially parameters drawn from a random distribution) are tricky beasts. Throw non-Normal distributions into the mix for Generalized Linear Mixed Models (GLMMs), or go non-linear, and things get trickier still. It was a new field of Statistics when I was working on the Oswald package for S-PLUS, and even 20 years later some major questions have yet to be fully answered (like, how do you calculate the degrees of freedom for a significance test?).

These days lme4, nlme and MCMCglmm are the go-to R packages for mixed models, and if you’re using them you likely have questions. The r-sig-mixed-models FAQ is a good compendium of answers, and includes plenty of references for further reading. You can also join in the discussions on mixed models at the r-sig-mixed-models mailing list.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News