Playing around with #rstats twitter data

By tony.fischetti@gmail.com

facebook

(This article was first published on On the lambda » R, and kindly contributed to R-bloggers)

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag “#rstats” that denotes that a tweeter is tweeting about R.

As a warning, I don’t know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn’t really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:

#!/usr/bin/perl

while(<>){
    chomp;
    $_ = lc($_);
    $_ =~ s/#rstats//g;
    my @matches;
    push @matches, /(#w+)/;
    print join "n" => @matches if @matches;
}

which uses the regular expression “(#w+)” to search for hashtags after removing “#rstats” from every tweet.

On the unix command-line, I put these other hashtags into a file and sorted via these commands:

cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt

sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt

After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :

5258 #datascience
1665 #python
1625 #bigdata
1542 #r
1451 #dataviz
1360 #ggplot2
 852 #statistics
 783 #dplyr
 749 #machinelearning
 743 #analytics

Neat-o. The presence of “#python” and “#ggplot2” in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:

1665 #python
 423 #d3js (plus 72 for #d3) (plus 2 for #js)
 343 #sas
 312 #julialang (plus 43 for #julia)
 240 #fsharp
 140 #spss  (plus 7 for #ibmspss)
 102 #stata
  75 #matlab
  55 #sql
  38 #java

1360 #ggplot2  (plus 298 for ggplot)  (plus for 6 #gglot2) (plus 4 for #ggpot)
 783 #dplyr
 663 #shiny
 557 #rcpp (plus 22 for rcpp11)
 251 #knitr
 156 #magrittr
 105 #lme4
  93 #ggvis   (plus 11 for #ggivs)
  65 #datatable
  46 #rneo4j

You can view the full list here and here.

I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about–at least as far as hashtags-applied-to-a-package go.

#strangehashtags
Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.

Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:

library(dplyr)
THE_DATA %>%
  group_by(User) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> prolific.rstats.tweeters

Here is the top 10 (you can see the rest here.)

@Rbloggers	1081
@hadleywickham	498
@timelyportfolio	427
@recology_	419
@revodavid	210
@chlalanne	209
@adolfoalvarez	199
@RLangTip	175
@jmgomez	160

Nothing terribly surprising here.

Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.

Answering this question gave me an excuse to use Hadley Wickham’s new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier–in part by leveraging the expressive power of the magrittr package.

To get the total number of tweets for a particular tweeter, I wrote the following function:

library(rvest)
library(magrittr)
get.num.tweets <- function(handle){
  tryCatch({
    unraw <- function(raw_str){
      raw_str <- sub(",", "", raw_str)    # remove commas if any
      if(grepl("K", raw_str)){
        return(as.numeric(sub("K", "", raw_str))*1000)   # in thousands
      }
      return(as.numeric(raw_str))
    }
    html(paste0("http://twitter.com/", sub("@", "", handle))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text() %>%
      unraw
    },
    error=function(cond){return(NA)})
}

The real logic (and beauty) of which is contained only in the last few lines:

    html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text()

The CSS element that houses the number of total tweets from a useR’s twitter page was found easily using SelectorGadget.

After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I’m considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.

The top 10 tweeteRs were as follows:

              User count num.of.tweets     ratio 
1     @medzihorsky     9            28 0.3214286 
2        @statworx     5            16 0.3125000 
3    @LearnRinaDay   114           404 0.2821782 
4  @RforExcelUsers     4            15 0.2666667 
5     @showmeshiny    27           102 0.2647059 
6           @tcrug     6            25 0.2400000 
7   @DailyRpackage   155           666 0.2327327 
8   @R_Programming    49           250 0.1960000 
9        @hexadata     8            41 0.1951220 
10     @Deep_RHelp    11            58 0.1896552 

In case you were wondering, Trey Causey still “won” by a long shot:

> tweeters[which(tweeters$User=="@tdhopper"),]   
Source: local data frame [1 x 4]                 
                                                 
       User count num.of.tweets        ratio     
1 @tdhopper     8         26700 0.0002996255     
> tweeters[which(tweeters$User=="@treycausey"),] 
Source: local data frame [1 x 4]                 
                                                 
         User count num.of.tweets      ratio     
1 @treycausey    50         28700 0.00174216

Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R “thought leaders” (whatever that is). Most tweets about R don’t use the #rstats hashtag, anyway.

Again, I would’t read too much into this 🙂

share this: google_plusredditpinterestlinkedintumblrmail

To leave a comment for the author, please follow the link and comment on his blog: On the lambda » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Tools in Tandem – SQL and ggplot. But is it Really R?

By Tony Hirst

lapsled_demo

(This article was first published on OUseful.Info, the blog… » Rstats, and kindly contributed to R-bloggers)

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database
library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Run a query
q='SELECT code, grid, year, COUNT(l.lap) AS Laps 
    FROM (SELECT grid, raceId, driverId from results) rg,
        lapTimes l, races r, drivers d 
    WHERE rg.raceId=l.raceId AND d.driverId=l.driverId
          AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId 
    GROUP BY grid, driverRef, year 
    ORDER BY year'

driverlapsledfromgridposition=dbGetQuery(ergastdb,q)

In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.

The chart I have opted for is a text plot faceted by year:

The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition)
g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed')
g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45))
g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE)
g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?

For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog… » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Scalable Machine Learning for Big Data Using R and H2O

By Daniel Emaasit

(This article was first published on Data Science Las Vegas (DSLV) » R, and kindly contributed to R-bloggers)

Part I

Part II

H2O is an open source parallel processing engine for machine learning on Big Data. This prediction engine is designed by, h20, a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms to run on HDFS, S3, SQL and NoSQL.

We were honored to have Tom Kraljevic (Vice President of Engineering at H2O) demonstrate how this prediction engine is suited for machine learning on Big Data from within R. Yes, that’s right, from within R. Most R users will attest to running into memory issues when crunching millions or billions of data records. That’s what H2o is designed to address. So it was no surprise that most of the R users in attendance including myself were impressed when Tom said:

“R tells H2O to perform a task…and then H2O returns the result back to R, which is a tiny result….but you never actually transfer the data to R…That’s the magic behind the scalability of H2O with R.”

This feature appealed to me. The data never flows through R!!. R requires a reference object to the H2O instance because it uses a REST API to send functions to H2O. Data sets are not transmitted directly through the REST API. Instead, the user sends a command (for example, an HDFS path to the data set) either through the browser or via the REST API to ingest data from disk.

You can find the slides to this presentation by clicking here or copy and paste the following URL into your web browser (https://github.com/h2oai/h2o-meetups/tree/master/2015_02_23_Scalable_ML_Using_R). You can also watch Tom’s presentation in a series of two videos shown above.

Another takeaway from this meetup was that H2O provides a combination of extraordinary math, backed by some of the most knowledgeable experts in Machine Learning: Stanford professors Trevor Hastie, Rob Tibshirani and Steven Boyd. It is also easy to use within R. Their package is available on CRAN. You can get started by launching and initializing H2O from within R using a few lines of code.

View this code snippet on GitHub.

The post Scalable Machine Learning for Big Data Using R and H2O appeared first on Data Science Las Vegas (DSLV).

To leave a comment for the author, please follow the link and comment on his blog: Data Science Las Vegas (DSLV) » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppEigen 0.3.2.4.0

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of RcppEigen is now on CRAN and in Debian. It synchronizes the Eigen code with the 3.2.4 upstream release, and updates the RcppEigen.package.skeleton() package creation helper to use the kitten() function from pkgKitten for enhanced package creation.

The NEWS file entry follows.

Changes in RcppEigen version 0.3.2.4.0 (2015-02-23)

  • Updated to version 3.2.4 of Eigen

  • Update RcppEigen.package.skeleton() to use pkgKitten if available

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Science/Statistics/R @Google

By Szilard Pafka

(This article was first published on Data Science Los Angeles » R, and kindly contributed to R-bloggers)

This meetup will be hosted by Google and we’ll have Peter Lipman and Pete Meyer talk about the data science/statistics projects they have been working on at Google.

Human Evaluation Comparisons: Common Framework with Applications

We designed a study to determine whether being on device influenced a quality evaluator’s perception of ads quality. Here, we present a common framework, developed into an internal R package, that was used to analyze the study.

Bio: Peter Lipman received his Ph.D in Biostatistics from Harvard University in 2011. He joined Google in 2013 after working in the pharmaceutical industry as a clinical trial statistician in early-stage oncology research. Peter works in the LAX office on the Ads Human Evaluation team, helping internal clients measure the quality of ads in the Google network using human computation.

Tracking Changes in Evaluators’ Weekly Work Schedules

The quality evaluators who provide ads quality evaluations determine their own schedules, with the downside that they may sometimes login and find no work available. We investigated using messaging to help them adjust their schedules to ensure work was available. We describe using circular earthmover distributions for identifying and measuring changes in weekly work schedules along with some negative binomial models for comparing the number of ratings done by groups of raters during specific blocks of time.

Bio: Pete Meyer received his Ph.D. in Statistics from the University of Chicago in 1993. He joined Google after 14 years as a biostatistician in the Dept. of Preventive Medicine at Rush University Medical Center in Chicago. He has worked on a variety of Google ads quality teams over the last 7 years and is currently on the Ads Human Evaluation team in the Google L.A. office.

Date: March 10, 2015 (Tuesday)

Timeline:
– 6:15pm food/bev & networking
– 7:00pm talks starts promptly

You must have a confirmed RSVP and please arrive by 6:55pm the latest. Please RSVP for this meetup here on Eventbrite. You must have an RSVP and ID with names matching as Google security will be checking IDs at the gate.

No more free spots? To increase your chance to get in next time, consider signing up to the DataScience.LA mailing list (we announce all new meetups on that mailing list first).

Venue: Google, 340 Main Street, Venice, CA 90291

To leave a comment for the author, please follow the link and comment on his blog: Data Science Los Angeles » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Career NBA: The Road Least Traveled

By Patrick Rhodes

(This article was first published on Graph of the Week, and kindly contributed to R-bloggers)

The bell rings – time to go to practice.
It’s a long, lonely road to the NBA

It’s quite rare for a high school athlete to receive a sports scholarship to even a single college, much less multiple schools. As we’ll come to see, he’s quite the statistical outlier in the world of basketball. Most do not play beyond high school. Those that do rarely possess the world-class talent to play in the NBA (National Basketball Association). That being said, what are Jarnell’s chances that he could make a career playing in the NBA?

This article was written by Patrick Rhodes – the the author of “Graph of the Week” – for Statistics Views and published on February 27, 2015. Read the rest of this article there.

To leave a comment for the author, please follow the link and comment on his blog: Graph of the Week.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Does Balancing Classes Improve Classifier Performance?

By Nina Zumel

NewImage

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.

On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.

So let’s run a small experiment to investigate this question.

Experimental Setup

We used the ISOLET dataset, available at the UCI Machine Learning repository. The task is to recognize spoken letters. The training set consists of 120 speakers, each of whom uttered the letters A-Z twice; 617 features were extracted from the utterances. The test set is another 30 speakers, each of whom also uttered A-Z twice.

Our chosen task was to identify the letter “n”. This target class has a native prevalence of about 3.8% in both test and training, and is to be identified from out of several other distinct co-existing populations. This is similar to a fraud detection situation, where a specific rare event has to be a population of disparate “innocent” events.

We trained our models against a training set where the target was present at its native prevalence; against training sets where the target prevalence was enriched by resampling to twice, five times, and ten times its native prevalence; and against a training set where the target prevalence was enriched to 50%. This replicates some plausible enrichment scenarios: enriching the rare class by a large multiplier, or simply balancing the classes. All training sets were the same size (N=2000). We then ran each model against the same test set (with the target variable at its native prevalence) to evaluate model performance. We used a threshold of 50% to assign class labels (that is, we labeled the data by the most probable label). To get a more stable estimate of how enrichment affected performance, we ran this loop ten times and averaged the results for each model type.

We tried three model types:

  • cv.glmnet from R package glmnet: Regularized logistic regression, with alpha=0 (L2 regularization, or ridge). cv.glmnet chooses the regularization penalty by cross-validation.
  • randomForest from R package randomForest: Random forest with the default settings (500 trees, nvar/3, or about 205 variables drawn at each node).
  • ksvm from R pacakge kernlab: Soft-margin SVM with the radial basis kernel and C=1

Since there are many ways to resample the data for enrichment, here’s how I did it. The target variable is assumed to be TRUE/FALSE, with TRUE as the class of interest (the rare one). dataf is the data frame of training data, N is the desired size of the enriched training set, and prevalence is the desired target prevalence.

makePrevalence = function(dataf, target, 
                          prevalence, N) {
  # indices of T/F
  tset_ix = which(dataf[[target]])
  others_ix = which(!dataf[[target]])
  
  ntarget = round(N*prevalence)
  
  heads = sample(tset_ix, size=ntarget, 
                 replace=TRUE)
  tails = sample(others_ix, size=(N-ntarget), 
                 replace=TRUE)
  
  dataf[c(heads, tails),]
}

Training at the Native Target Prevalence

Before we run the full experiment, let’s look at how each of these three modeling approaches does when we fit models the obvious way — where the training and test sets have the same distribution:

## [1] "Metrics on training data"
## accuracy precision   recall specificity         label
##   0.9985 1.0000000 0.961039     1.00000      logistic
##   1.0000 1.0000000 1.000000     1.00000 random forest
##   0.9975 0.9736842 0.961039     0.99896           svm
## [1] "Metrics on test data"
##  accuracy precision    recall specificity         label
## 0.9807569 0.7777778 0.7000000   0.9919947      logistic
## 0.9717768 1.0000000 0.2666667   1.0000000 random forest
## 0.9846055 0.7903226 0.8166667   0.9913276           svm

We looked at four metrics. Accuracy is simply the fraction of datums classified correctly. Precision is the fraction of datums classified as positive that really were; equivalently, it’s an estimate of the conditional probability of a datum being in the positive class, given that it was classified as positive. Recall (also called sensitivity or the true positive rate) is the fraction of positive datums in the population that were correctly identified. Specificity is the true negative rate, or one minus the false positive rate: the number of negative datums correctly identified as such.

As the table above shows, random forest did perfectly on the training data, and the other two did quite well, too, with nearly perfect precision/specificity and high recall. However, random forest’s recall plummeted on the hold-out set, to 27%. The other two models degraded as well (logistic regression more than SVM), but still manage to retain decent recall, along with good precision and specificity. Random forest also has the lowest accuracy on the test set (although 97% still looks pretty good — another reason why accuracy is not always a good metric to evaluate classifiers on. In fact, since the target prevalence in the data set is only about 3.8%, a model that always returned FALSE would have an accuracy of 96.2%!).

One could argue that if precision is the goal, then random forest is still in the running. However, remember that the goal here is to identify a rare event. In many such situations (like fraud detection) one would expect that high recall is the most important goal, as long as precision/specificity are still reasonable.

Let’s see if enriching the target class prevalence during training improves things.

How Enriching the Training Data Changes Model Performance

First, let’s look at accuracy.

The x-axis is the prevalence of the target in the training data; the y-axis gives the accuracy of the model on the test set (with the target at its native prevalence), averaged over ten draws of the training set. The error bars are the bootstrap estimate of the 98% confidence interval around the mean, and the values for the individual runs appear as transparent dots at each value. The dashed horizontal represents the accuracy of a model trained at the target class’s true prevalence, which we’ll call the model’s baseline performance. Logistic regression degraded the most dramatically of the three models as target prevalence increased. SVM degraded only slightly. Random forest improved, although its best performance (when training at about 19% prevalence, or five times native prevalence) is only slightly better than SVM’s baseline performance, and its performance at 50% prevalence is worse than the baseline performance of the other two classifiers.

Logistic regression’s degradation should be no surprise. Logistic regression optimizes deviance, which is strongly distributional; in fact, logistic regression (without regularization) preserves the marginal probabilities of the training data. Since logistic regression is so well calibrated to the training distribution, changes in the distribution will naturally affect model performance.

The observation that SVM’s accuracy stayed very stable is consistent with my surmise that SVM’s training procedure is not strongly dependent on the class distributions.

Now let’s look at precision:

NewImage

All of the models degraded on precision, random forest the most dramatically (since it started at a higher baseline), SVM the least. SVM and logistic regression were comparable at baseline.

Let’s look at recall:

NewImage

Enrichment improved the recall of all the classifiers, random forest most dramatically, although its best performance, at 50% enrichment, is not really any better than SVM’s baseline recall. Again, SVM’s recall moved the least.

Finally, let’s look at specificity:

NewImage

Enrichment degraded all models’ specificity (i.e. they all make more false positives), logistic regression’s the most dramatically, SVM’s the least.

The Verdict

Based on this experiment, I would say that balancing the classes, or enrichment in general, is of limited value if your goal is to apply class labels. It did improve the performance of random forest, but mostly because random forest was a rather poor choice for this problem in the first place (It would be interesting to do a more comprehensive study of the effect of target prevalence on random forest. Does it often perform poorly with rare classes?).

Enrichment is not a good idea for logistic regression models. If you must do some enrichment, then these results suggest that SVM is the safest classifier to use, and even then you probably want to limit the amount of enrichment to less than five times the target class’s native prevalence — certainly a far cry from balancing the classes, if the target class is very rare.

The Inevitable Caveats

The first caveat is that we only looked at one data set, only three modeling algorithms, and only one specific implementation of each of these algorithms. A more thorough study of this question would consider far more datasets, and more modeling algorithms and implementations thereof.

The second caveat is that we were specifically supplying class labels, using a threshold. I didn’t show it here, but one of the notable issues with the random forest model when it was applied to hold-out was that it no longer scored the datums along the full range of 0-1 (which it did, on the training data); it generally maxed out at around 0.6 or 0.7. This possibly makes using 0.5 as the threshold suboptimal. The following graph was produced with a model trained with the target class at native prevalence, and evaluated on our test set.

NewImage

The x-axis corresponds to different thresholds for setting class labels, ranging between 0.25 (more permissive about marking datums as positive) and 0.75 (less permissive about marking datums as classifiers). You can see that the random forest model (which didn’t score anything in the test set higher than 0.65) would have better accuracy with a lower threshold (about 0.3). The other two models have fairly close to optimal accuracy at the default threshold of 0.5. So perhaps it’s not fair to look at the classifier performance without tuning the thresholds. However, if you’re tuning a model that was trained on enriched data, you still have to calibrate the threshold on un-enriched data — in which case, you might as well train on un-enriched data, too. In the case of this random forest model, its best accuracy (at threshold=0.3) is about as good as random forest’s accuracy when trained on a balanced data set, again suggesting that balancing the training set doesn’t contribute much. Tuning the threshold may be enough.

However, suppose we don’t need to assign class labels? Suppose we only need the score to sort the datums, hoping to sort most of the items of interest to the top? This could be the case when prioritizing transactions to be investigated as fraudulent. The exact fraud score of a questionable transaction might not matter — only that it’s higher than the score of non-fraudulent events. In this case, would enrichment or class balancing improve? I didn’t try it (mostly because I didn’t think of it until halfway through writing this), but I suspect not.

Conclusions

  • Balancing class prevalence before training a classifier does not across-the-board improve classifier performance.
  • In fact, it is contraindicated for logistic regression models.
  • Balancing classes or enriching target class prevalence may improve random forest classifiers.
  • But random forest models may not be the best choice for very unbalanced classes.
  • If target class enrichment is necessary (perhaps because of data scarcity issues), SVM may be the safest choice for modeling.

A knitr document of our experiment, along with the accompanying R markdown file, can be downloaded here, along with a copy of the ISOLET data.

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

John Chambers Statistical Software Award 2015

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In 1998 John M. Chambers (now a member of R-core) won the ACM Software System Award for the S Language, which (in the words of the committee) “forever altered how people analyze, visualize, and manipulate data”. John graciously donated the prizemoney to support budding researchers in statistical computing: his Statistical Software Award has been granted annually since 2000.

For the 2015 award, an individual or a team can apply:

Teams of up to 3 people can participate in the competition, with the cash award being split among team members. The travel allowance will be given to just one individual in the team, who will be presented the award at JSM. To be eligible, the team must have designed and implemented a piece of statistical software. The individual within the team indicated to receive the travel allowance must have begun the development while a student, and must either currently be a student, or have completed all requirements for her/his last degree after January 1, 2014.

Most of the previous winners have used R in their submissions. For more details on the award, which carries a cash award of $1000, plus a substantial allowance for travel to the annual Joint Statistical Meetings (JSM), follow the link below. The deadline for submissions has been extended to 5:00pm EST, Tuesday Friday, March 6, 2015 (the date in the link has not been updated).

John M. Chambers Statistical Software Award 2015: Announcement

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppArmadillo 0.4.650.1.1 (and also 0.4.650.2.0)

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new Armadillo release 4.650.1 was released by Conrad a few days ago. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

It turned out that this release had one shortcoming with respect to the C++11 RNG initializations in the R use case (where we need to protect the users from the C++98 RNG deemed unsuitable by the CRAN gatekeepers). And this lead to upstream release 4.650.1 which we wrapped into RcppArmadillo 0.4.650.1.1. As before this, was tested against all 107 reverse dependencies of RcppArmadillo on the CRAN repo.

This version is now on CRAN, and was just uploaded to Debian. Its changes are summarized below based on the NEWS.Rd file.

Changes in RcppArmadillo version 0.4.650.1.1 (2015-02-25)

  • Upgraded to Armadillo release Version 4.650.1 (“Intravenous Caffeine Injector”)

    • added randg() for generating random values from gamma distributions (C++11 only)

    • added .head_rows() and .tail_rows() to submatrix views

    • added .head_cols() and .tail_cols() to submatrix views

    • expanded eigs_sym() to optionally calculate eigenvalues with smallest/largest algebraic values fixes for handling of sparse matrices

  • Applied small correction to main header file to set up C++11 RNG whether or not the alternate RNG (based on R, our default) is used

Now, it turns out that another small fix was needed for the corner case of a submatrix within a submatrix, ie V.subvec(1,10).tail(5). I decided not to re-release this to CRAN given the CRAN Repository Policy preference for releases “no more than every 1–2 months”.

But fear not, for we now have drat. I created a drat package repository in the RcppCore account (to not put a larger package into my main drat repository often used via a fork to initialize a drat). So now with these two simple commands

## if needed, first install 'drat' via:   install.packages("drat")
drat:::add("RcppCore")
update.packages()

you will get the newest RcppArmadillo via this drat package repository. And course install.packages("RcppArmadillo") would also work, but takes longer to type 🙂

Lastly, courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Compiling CoffeeScript in R with the js package

By Jeroen Ooms

opencpu logo

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

A new release of the js package has made it’s way to CRAN. This version adds support for compiling Coffee Script. Along with the uglify and jshint tools already in there, the package now provides a very complete suite for compiling, validating, reformating, optimizing and analyzing JavaScript code in R.

Coffee Script

According to its website, CoffeeScript is a little language that compiles into JavaScript. It is an attempt to expose the good parts of JavaScript in a simple way. The coffee_compile function binds to the coffee script compiler. A hello world example from the package vignette:

# Hello world
cat(coffee_compile("square = (x) -> x * x"))

This outputs the following JavaScript code:

(function() {
  var square;

  square = function(x) {
    return x * x;
  };

}).call(this);

Or to compile without the closure:

# Hello world
cat(coffee_compile("square = (x) -> x * x", bare = TRUE))
var square;

square = function(x) {
  return x * x;
};

The package vignette includes some more examples.

Why coffee script?

Coffee script is not some sort of widget factory or other “use JavaScript without learning JavaScript” tool kit. From the website:

The golden rule of CoffeeScript is: “It’s just JavaScript”. The code compiles one-to-one into the equivalent JS, and there is no interpretation at runtime. You can use any existing JavaScript library seamlessly from CoffeeScript (and vice-versa). The compiled output is readable and pretty-printed, will work in every JavaScript runtime, and tends to run as fast or faster than the equivalent handwritten JavaScript.

CoffeeScript is popular among web developers for writing JavaScript applications using a syntax that is more readable and less error prone, but without being constrained by some sort of framework. CoffeeScript is often used in conjunction with an HTML templating engine such as jade (see rjade) and a CSS pre-processor such as Less or SASS or Stylus.

Together, these tools are helpful in organizing and maintaining a non-trivial web applications. Given the recent mass adoption of HTML/JavaScipt based widgets and vizualisation in the R community, they can be a valuable addition to the R developer tool kit as well.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News