Jug: Easily Create R APIs

By FishyOperations

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

Jug stands for Just Unified Galloping. Okay, okay, it’s just a play on words coming from a Flask (Python) background.

Jug is my attempt to create a simple small web framework that allows you to turn your (existing) R functions into an API. Having the wonderful httpuv package at my disposal made this very easy for me.

So, how does this work?

Let’s say I have the following function:

say_hello_to<-function(name) paste("Hello", name)

Sometimes you would be in a situation where you want to send a GET request to a server and let a function (let’s say the one above) return it’s result. Thus, I want to expose the above function to allow HTTP GET requests to acces it. Using Jug I could do something like this:


jug() %>%
  gett("/", decorate(say_hello_to)) %>%
Serving the jug at

However, when I run this code and post a GET request to the URL, the following happens:

$ curl
ERROR: argument "name" is missing, with no default

Obviously, because the function say_hello_to requires the parameter name. A second attempt has better results:

$ curl
Hello Bart

However, besides exposing exsting fuctions, Jug is quite flexible. It has a middleware system, which allows to build up a (http) web server layer by layer. For example:


jug() %>%
  postt("/", function(req, res){
  }) %>%
  gett("/", function(req, res){
    paste("Hello",my_first_name, my_last_name)
  }) %>%

$ curl --data "first_name=Bart&last_name=Blitzers"
$ curl
Hello Bart Blitzers

So, enough for now. More on this later…

A (very) early version can be found at github.com/Bart6114/jug.

To leave a comment for the author, please follow the link and comment on their blog: FishyOperations.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R User Groups Highlight R Creativity

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker’s perspective, presenting at an R user Group meeting must be the rough equivalent of doing “stand-up” at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don’t necessarily ignite R creativity (people don’t do their best work just to present at an R User group meeting), they do help to shine the spotlight on some really good stuff.

I attend all of the Bay Area useR group meetings, and quite a few other R related events throughout the year, but I only get to experience a small fraction of what is going on in the R world. In the spirit of sharing the “wish I was there” feeling, here are a few recent user group presentations from around the globe that look like they were informative, entertaining and motivating.

Tommy O’Dell gave “Welcome to dply” talk to the Western Australia R Group (WARG) on September 10th. This is a very good presentation until near the very end when it becomes an absolutely great presentation!! Apparently, motivated by a desire to use dplyr with R 2.12, an older R version of R not supported by dplyr, Tommy deconstructed the dplyr “magic” to write his own package, rdplyr. This is a wonderful example of how curiosity and open source can open up many possibilities. The following slide comes from the section where Tommy explains some of the problems he encountered and how he worked through them.

On the 16th of September, Kevin Little gave a talk to MadR about how he recovered after “hitting the wall” in failed first attempt to interface to the SurveyMonkey API using the Rmonkey package. Kevin’s description of how he worked through the process which included wading into some JSON scripting is a motivational case study. Kevin wrote a blog post that provides background for the project and has made his slides available here.

Also in September Jim Porzak, a long-time contributor to the San Francisco Bay Area R community, described a detailed customer segmentation analysis in a presentation to BARUG. The following slide examines the stability of the clusters.

Finally, there is a small treasure trove of relatively recent work at the BaselR presentations page. These include a presentation from Aimee Gott on the Mango Solutions development environment and one from Anne Kuemmel on using simulations to calculate confidence intervals in pharma applications. Also have a look at Daniel Sabanes Bove’s presentation on using R to produce Microsoft PowerPoint presentations, and some thoughtful advice from Reinhold Koch on how to go about creating a lively R community within your company.




Let us all adopt this mindset!!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ChIPseq data mining with ChIPseeker

By ygc

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

ChIP-seq is rapidly becoming a common technique and there are a large number of dataset available in the public domain. Results from individual experiments provide a limited understanding of chromatin interactions, as there is many factors cooperate to regulate transcription. Unlike other tools that designed for single dataset, ChIPseeker is designed for comparing profiles of ChIP-seq datasets at different levels.

We provide functions to compare profiles of peaks binding to TSS regions, annotation, and enriched functional profiles. More importantly, ChIPseeker incorporates statistical testing of co-occurrence of different ChIP-seq datasets and can be used to identify co-factors.

?View Code RSPLUS

> library(ChIPseeker)
> ff=getSampleFiles()
> x = enrichPeakOverlap(ff[[5]], unlist(ff[1:4]), nShuffle=10000, pAdjustMethod="BH", chainFile=NULL)
>> permutation test of peak overlap...		 2015-09-24 14:23:43
  |======================================================================| 100%
> x
ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz
                                                      tSample qLen tLen N_OL
ARmo_0M                       GSM1174480_ARmo_0M_peaks.bed.gz 1663  812    0
ARmo_1nM                     GSM1174481_ARmo_1nM_peaks.bed.gz 1663 2296    8
ARmo_100nM                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663 1359    3
CBX6_BF    GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663 1331  968
               pvalue   p.adjust
ARmo_0M    0.88901110 0.88901110
ARmo_1nM   0.15118488 0.30236976
ARmo_100nM 0.37296270 0.49728360
CBX6_BF    0.00009999 0.00039996

The enrichment analysis of peak overlap is based on permutation test. nShuffle of random ChIP data were generated to estimate the background null distribution of the overlap and p-value is then calculated by the probability of observing more extreme overlap. Multiple comparison correction is also incorporated.

The most exciting feature in ChIPseeker is that it collected more than 18,000 bed file information from GEO database and make this co-factor inference available to the community. With these datasets, we can compare our own dataset to those deposited in GEO to identify co-occurrence binding proteins that maybe cooperated with the one we are interested in. Hypothesis can be generated by this inference and serve as a starting point for further study.


G Yu, LG Wang, QY He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015, 31(14):2382-2383. PMID:25765347.

Related Posts

To leave a comment for the author, please follow the link and comment on their blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

rleafmap: R Markdown in interactive popups

By franzk


(This article was first published on Piece of K » R_EN, and kindly contributed to R-bloggers)

This is the second ” big feature coming with branch 0.2 of rleafmap (now on CRAN!). With this new version you can write popups content in R Markdown which will be processed when you generate the map. This can be useful to format popups using markdown syntax (if you need more control remind that popups can also be formatted with html tags). More interesting with R Markdown is the possibility to include outputs of R code chunks. Thus, results, simulations and plots can be easily embedded within popups content.

To activate R Markdown for a layer you just have to pass the R Markdown code to the popup argument and set popup.rmd to TRUE.

The chunkerize function

You can write your R code chunks manually but you can also use the function chunkerize which tries to make your life simpler. This function has two purposes:

  • It turns a function and its arguments into an R code chunk.
  • It can decompose the elements of a list and use them as values for a given argument.

The latter may be useful when you have different features on your map and you want to execute the same function for each of them but with different data as input. This can be done by tagging the list containing the data with a star character (*). See the following example:

First we load the packages we need:


Then we load the data: a shapefile of regions and a dataframe giving the evolution of the population for each region.

reg <- readShapePoly("regions-20140306-100m")
reg <- reg1
pop <- read.csv("population.txt", sep = "t", row.names = 1)
pop <- pop[rev(rownames(pop)), sort(names(pop))]

We prepare some colors for the map and a legend:

gcol <- rev(heat.colors(5))
gcut <- cut(as.numeric(pop["2012", ]),
            breaks = c(0, 2000000, 3000000, 4000000, 8000000, 12000000))
reg.col <- gcol[as.numeric(gcut)]
reg.leg <- layerLegend(style = "polygons",
                       title = "Population",
                       labels = levels(gcut),
                       fill.col = gcol)

We prepare the data for chunkerize: pop is already a list, since it is a dataframe but for clarity we turn it into a simple list.

Year <- as.numeric(rownames(pop))
L <- as.list(pop)
L2 <- as.list(names(L))

Now is the trick. We create a chunk based on a plot function. We provide 5 arguments (names and values). Each arg.values is going to be recycled, except L and L2 which are tagged with a star. In that case, each element of these lists is going to be used as values.

popup <- chunkerize(FUN = plot,
                    arg.names = c("x", "y", "type", "ylab", "main"),
                    arg.values = c("Year", "*L", "'b'", "'Population'", "*L2"))

Now we just have to create the layers and compile the map:

cdbpos.bm <- basemap("cartodb.positron.nolab")
reg.map <- spLayer(reg,
                   fill.col = reg.col,
                   legend = reg.leg,
                   popup = popup,
                   popup.rmd = TRUE)
writeMap(cdbpos.bm, reg.map)

Et voilà ! There is a problem with Rstudio and Firefox. It happens that popups do not appear where they should on the first click. It works fine on Chromium.

To leave a comment for the author, please follow the link and comment on their blog: Piece of K » R_EN.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Understanding empirical Bayes estimation (using baseball statistics)

By David Robinson


(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Which of these two proportions is higher: 4 out of 10, or 300 out of 1000? This sounds like a silly question. Obviously , which is greater than .

But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the first player has a higher proportion of hits, it’s not a lot of evidence: a typical player tends to achieve a hit around 27% of the time, and this player’s could be due to luck. The second player, on the other hand, has a lot of evidence that he’s an above-average batter.

This post isn’t really about baseball, I’m just using it as an illustrative example. (I actually know very little about sabermetrics. If you want a more technical version of this post, check out this great paper). This post is, rather, about a very useful statistical method for estimating a large number of proportions, called empirical Bayes estimation. It’s to help you with data that looks like this:

Success Total
11 104
82 1351
2 26
0 40
1203 7592
5 166

A lot of data takes the form of these success/total counts, where you want to estimate a “proportion of success” for each instance. Each row might represent:

  • An ad you’re running: Which of your ads have higher clickthrough rates, and which have lower? (Note that I’m not talking about running an A/B test comparing two options, but rather about ranking and analyzing a large list of choices.)
  • A user on your site: In my work at Stack Overflow, I might look at what fraction of a user’s visits are to Javascript questions, to guess whether they are a web developer. In another application, you might consider how often a user decides to read an article they browse over, or to purchase a product they’ve clicked on.

When you work with pairs of successes/totals like this, you tend to get tripped up by the uncertainty in low counts. does not mean the same thing as ; nor does mean the same thing as . One approach is to filter out all cases that don’t meet some minimum, but this isn’t always an option: you’re throwing away useful information.

I previously wrote a post about one approach to this problem, using the same analogy: Understanding the beta distribution (using baseball statistics). Using the beta distribution to represent your prior expectations, and updating based on the new evidence, can help make your estimate more accurate and practical. Here I’ll demonstrate the related method of empirical Bayes estimation, where the beta distribution is used to improve a large set of estimates. What’s great about this method is that as long as you have a lot of examples, you don’t need to bring in prior expectations.

Here I’ll apply empirical Bayes estimation to a baseball dataset, with the goal of improving our estimate of each player’s batting average. I’ll focus on the intuition of this approach, but will also show the R code for running this analysis yourself. (So that the post doesn’t get cluttered, I don’t show the code for the graphs and tables, only the estimation itself. But you can find all this post’s code here)

Working with batting averages

In my original post about the beta distribution, I made some vague guesses about the distribution of batting averages across history, but here we’ll work with real data. We’ll use the Batting dataset from the excellent Lahman package. We’ll prepare and clean the data a little first, using dplyr and tidyr:


career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(Pitching, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# use names along with the player IDs
career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID") %>%

Above, we filtered out pitchers (generally the weakest batters, who should be analyzed separately). We summarized each player across multiple years to get their career Hits (H) and At Bats (AB), and batting average. Finally, we added first and last names to the dataset, so we could work with them rather than an identifier:

## Source: local data frame [9,256 x 4]
##                 name     H    AB average
##                (chr) (int) (int)   (dbl)
## 1         Hank Aaron  3771 12364  0.3050
## 2       Tommie Aaron   216   944  0.2288
## 3          Andy Abad     2    21  0.0952
## 4        John Abadie    11    49  0.2245
## 5     Ed Abbaticchio   772  3044  0.2536
## 6        Fred Abbott   107   513  0.2086
## 7        Jeff Abbott   157   596  0.2634
## 8        Kurt Abbott   523  2044  0.2559
## 9         Ody Abbott    13    70  0.1857
## 10 Frank Abercrombie     0     4  0.0000
## ..               ...   ...   ...     ...

I wonder who the best batters in history were. Well, here are the ones with the highest batting average:

name H AB average
Jeff Banister 1 1 1
Doc Bass 1 1 1
Steve Biras 2 2 1
C. B. Burns 1 1 1
Jackie Gallagher 1 1 1

Err, that’s not really what I was looking for. These aren’t the best batters, they’re just the batters who went up once or twice and got lucky. How about the worst batters?

name H AB average
Frank Abercrombie 0 4 0
Horace Allen 0 7 0
Pete Allen 0 4 0
Walter Alston 0 1 0
Bill Andrus 0 9 0

Also not what I was looking for. That “average” is a really crummy estimate. Let’s make a better one.

Step 1: Estimate a prior from all your data

Let’s look at the distribution of batting averages across players.

(For the sake of estimating the prior distribution, I’ve filtered out all players that have fewer than 500 at-bats, since we’ll get a better estimate from the less noisy cases. I show a more principled approach in the Appendix).

The first step of empirical Bayes estimation is to estimate a beta prior using this data. Estimating priors from the data you’re currently analyzing is not the typical Bayesian approach- usually you decide on your priors ahead of time. There’s a lot of debate and discussion about when and where it’s appropriate to use empirical Bayesian methods, but it basically comes down to how many observations we have: if we have a lot, we can get a good estimate that doesn’t depend much on any one individual. Empirical Bayes is an approximation to more exact Bayesian methods- and with the amount of data we have, it’s a very good approximation.

So far, a beta distribution looks like a pretty appropriate choice based on the above histogram. (What would make it a bad choice? Well, suppose the histogram had two peaks, or three, instead of one. Then we might need a mixture of Betas, or an even more complicated model). So we know we want to fit the following model:

We just need to pick and , which we call “hyper-parameters” of our model. There are many methods in R for fitting a probability distribution to data (optim, mle, bbmle, etc). You don’t even have to use maximum likelihood: you could use the mean and variance, called the “method of moments”. But we’ll use the fitdistr function from MASS.

# just like the graph, we have to filter for the players we actually
# have a decent estimate of
career_filtered <- career %>%
    filter(AB >= 500)

m <- MASS::fitdistr(career_filtered$average, dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

This comes up with and . How well does this fit the data?


Not bad! Not perfect, but something we can work with.

Step 2: Use that distribution as a prior for each individual estimate

Now when we look at any individual to estimate their batting average, we’ll start with our overall prior, and update based on the individual evidence. I went over this process in detail in the original Beta distribution post: it’s as simple as adding to the number of hits, and to the total number of at-bats.

For example, consider our hypothetical batter from the introduction that went up 1000 times, and got 300 hits. We would estimate his batting average as:

How about the batter who went up only 10 times, and got 4 hits. We would estimate his batting average as:

Thus, even though , we would guess that the batter is better than the batter!

Performing this calculation for all the batters is simple enough:

career_eb <- career %>%
    mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))


Now we can ask: who are the best batters by this improved estimate?

name H AB average eb_estimate
Rogers Hornsby 2930 8173 0.358 0.355
Shoeless Joe Jackson 1772 4981 0.356 0.350
Ed Delahanty 2596 7505 0.346 0.343
Billy Hamilton 2158 6268 0.344 0.340
Harry Heilmann 2660 7787 0.342 0.339

Who are the worst batters?

name H AB average eb_estimate
Bill Bergen 516 3028 0.170 0.178
Ray Oyler 221 1265 0.175 0.191
John Vukovich 90 559 0.161 0.196
John Humphries 52 364 0.143 0.196
George Baker 74 474 0.156 0.196

Notice that in each of these cases, empirical Bayes didn’t simply pick the players who had 1 or 2 at-bats. It found players who batted well, or poorly, across a long career. What a load off our minds: we can start using these empirical Bayes estimates in downstream analyses and algorithms, and not worry that we’re accidentally letting or cases ruin everything.

Overall, let’s see how empirical Bayes changed all of the batting average estimates:


The horizontal dashed red line marks – that’s what we would guess someone’s batting average was if we had no evidence at all. Notice that points above that line tend to move down towards it, while points below it move up.

The diagonal red line marks . Points that lie close to it are the ones that didn’t get shrunk at all by empirical Bayes. Notice that they’re the ones with the highest number of at-bats (the brightest blue): they have enough evidence that we’re willing to believe the naive batting average estimate.

This is why this process is sometimes called shrinkage: we’ve moved all our estimates towards the average. How much it moves these estimates depends on how much evidence we have: if we have very little evidence (4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out of 1000) we move it only a little. That’s shrinkage in a nutshell: Extraordinary outliers require extraordinary evidence.

Conclusion: So easy it feels like cheating

Recall that there were two steps in empirical Bayes estimation:

  1. Estimate the overall distribution of your data.
  2. Use that distribution as your prior for estimating each average.

Step 1 can be done once, “offline”- analyze all your data and come up with some estimates of your overall distribution. Step 2 is done for each new observation you’re considering. You might be estimating the success of a post or an ad, or classifying the behavior of a user in terms of how often they make a particular choice.

And because we’re using the beta and the binomial, consider how easy that second step is. All we did was add one number to the successes, and add another number to the total. You can build that into your production system with a single line of code that takes nanoseconds to run.

// We hired a Data Scientist to analyze our Big Data
// and all we got was this lousy line of code.
float estimate = (successes + 78.7) / (total + 303.5);

That really is so simple that it feels like cheating- like the kind of “fudge factor” you might throw into your code, with the intention of coming back to it later to do some real Machine Learning.

I bring this up to disprove the notion that statistical sophistication necessarily means dealing with complicated, burdensome algorithms. This Bayesian approach is based on sound principles, but it’s still easy to implement. Conversely, next time you think “I only have time to implement a dumb hack,” remember that you can use methods like these: it’s a way to choose your fudge factor. Some dumb hacks are better than others!

But when anyone asks what you did, remember to call it “empirical Bayesian shrinkage towards a Beta prior.” We statisticians have to keep up appearances.

Appendix: How could we make this more complicated?

We’ve made some enormous simplifications in this post. For one thing, we assumed all batting averages are drawn from a single distribution. In reality, we’d expect that it depends on some known factors. For instance, the distribution of batting averages has changed over time:


Ideally, we’d want to estimate a different Beta prior for each decade. Similarly, we could estimate separate priors for each team, a separate prior for pitchers, and so on. One useful approach to this is Bayesian hierarchical modeling (as used in, for example, this study of SAT scores across different schools).

Also, as alluded to above, we shouldn’t be estimating the distribution of batting averages. Really, we should use all of our data to estimate the distribution, but give more consideration to the players with a higher number of at-bats. This can be done by fitting a beta-binomial distribution. For instance, we can use the dbetabinom.ab function from VGAM, and the mle function:


# negative log likelihood of data given alpha; beta
ll <- function(alpha, beta) {
  -sum(dbetabinom.ab(career$H, career$AB, alpha, beta, log = TRUE))

m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B")
## alpha  beta 
##    75   222

We end up getting almost the same prior, which is reassuring!

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News