Shiny Developer Conference

By John Mount

Prof

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Really enjoying RStudio‘s Shiny Developer Conference | Stanford University | January 2016.

Winston Chang just demonstrated profvis, really slick. You can profile code just by wrapping it in a profvis({}) block and the results are exported as interactive HTML widgets.

For example, running the R code below:

if(!('profvis' %in% rownames(installed.packages()))) {
  devtools::install_github('rstudio/profvis')
}
library('profvis')

nrow = 10000
ncol = 1000
data <- as.data.frame(matrix(rnorm(nrow*ncol),
                             nrow=nrow,ncol=ncol))

profvis({
  d <- data
  means <- apply(d,2,mean)
  for(i in seq_along(means)) {
    d[[i]] <- d[[i]] - means[[i]]
  }
})

Produces an interactive version of the following profile information:

Definitely check it out!

Many other great presentations, this one is just particularly easy to share.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows

By tomizono

(This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers)

CRAN package XML has something wrong at parsing html pages encoded in cp932 (shift-jis). In this report, I will show these issues and also their solutions which is workable at user side.

I found the issues are common at least on both Windows 7 and 10 with Japanese language. Though other versions and languages are not checked, the issues may common on world wide Windows with non-European multibyte languages encoded in national locales, not in utf-8.

Versions on my machines:

Windows 7 + R 3.2.3 + XML 3.98-1.3
Mac OS X 10.9.5 + R 3.2.0 + XML 3.98-1.3

Locales:

Windows
> Sys.getlocale('LC_CTYPE')
[1] "Japanese_Japan.932"
Mac
> Sys.getlocale('LC_CTYPE')
[1] "ja_JP.UTF-8"

1. incident

# Mac
library(XML)
src <- 'http://www.taiki.pref.ibaraki.jp/data.asp'
t1 <- as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
        skip.rows=2:48, encoding='shift-jis')[1,1]
      )
> t1 # good
[1] "最新の観測情報  (2016年1月17日  8時)"

Above small R script was written by me when I improved my PM 2.5 script in the previous article. This was working on my Mac, but not on Windows PC at my office.

Of course a small change was needed for Windows to handle the difference of locales.

# Windows
t2 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis')
> t2 # bad
[1] NA

It completely failed.

I found this problem occurs depending the text in the html. So we must know “when and how” to avoid the error. This report is to show the solutions. Technical details will be shown in the next report.

2. solutions

2-1. No-Break Space (U+00A0)

Unicode character No-Break Space (U+00A0): 
    xc2xa0 in utf-8
    &nbsp;, &#160; or &#xa0; in html

When a shift-jis encoded html has u+00a0 as html entity, such as  , the package XML brings a issue. More strictly, it’s not originated from the package XML but from function iconv. Function iconv returns NA when it tries to convert u+00a0 into shift-jis. But we must be aware of this issue at using the package XML because it always comes with famous html entity  .

A solution is to use an option sub= in function iconv, which can convert unknown characters into a specific one instead of NA.

sub=''
sub=' '
sub='byte'
# Windows
t3 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis', sub=' ')
> t3 # bad
The result is a broken string and not shown.  

This can be a solution of the u+00a0 issue in shift-jis encoded page. But unfortunately, the above t3 still fails because there is another issue on that html page.

2-2. trim

An option trim= is commonly used in text functions of package XML, in such as readHTMLTable and xmlValue. With trim=TRUE, a text removed space characters such as t or r from both ends of the node text is returned. This option is very useful to treat html pages, because they usually have a plenty of spaces and line feeds.

But trim=TRUE is not safe when a shift-jis encoded html is read on a Windows PC with shift-jis (cp932) locale. This issue is serious and the text string is completely destroyed.

Additionally, we must be aware of the default value of this option; trim=FALSE for xmlValue, and trim=TRUE for readHTMLTable.

A solution is to use trim=FALSE and to remove spaces with function gsub after we get a pure string.

# Windows
t4 <- gsub('s', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t4 # good
[1] "最新の観測情報(2016年1月17日8時)"

The regular expression of gsub is safe to the platform locale.

More precisely, the t4 above is not same as the result of trim=TRUE. That regular expression remove all spaces in the sentence, although it doesn’t matter in Japanese language.

We may want to improve this as:

gsub('(^s+)|(s+$)', '', x)
# Windows
t5 <- gsub('(^s+)|(s+$)', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t5 # very good
[1] "最新の観測情報 (2016年1月17日 8時)"

Finally, two issues are solved. We get a script workable on Windows.

Strictly the t1 and the t5 are different. Spaces in t5 is u+0020, while these in t1 is u+00a0.

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hillary Clinton’s Biggest 2016 Rival: Herself

By Francis Smart

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)
In a recent post I noted that despite Bernie Sanders doing better in many important indicators, Obama 2008 received 3x more media coverage than Sanders 2016.

Reasonably, a reader of my blog noted that not all coverage was equal, that a presidential hopeful might be happier having no coverage than negative coverage. So I decided to do some textual analysis of the headlines comparing Sanders and Clinton in 2016 and Obama and Clinton in 2008.

I looked at 4200 headlines mentioning either Obama in 2007/08, Sanders 2015/16, or Clinton 2007/08 or 2015/16 scraped from major news sources: Google News, Yahoo News, Fox, New York Times, Huffington Post, and NPR (From January 1st, 2007 to January, 2008 and January 1st, 20015 to January, 2016).

First I constructed word clouds for the Clinton and Sanders race.

Figure 1: Hillary Clinton’s 2015/2015 headline word cloud. Excluding “hillary” and “clinton” as terms when constructing the cloud.
Figure 2: Bernie Sanders headline word cloud. Excluding “bernie” when constructing the cloud.

From looking at the differences between Figure 1 and Figure 2, there appears to be some pretty significant differences. First off, the most frequent term in Figure 2 is “Clinton” followed by a lot of general stuff. “Black” for black vote since there is some concern that Bernie can’t get the black vote perhaps combined with some high profile black political activists endorsing him.

Figure 1 though is a world of difference. Almost every major word is a scandal. Email and emails, ben ghazi, private server, and foundation. Each referencing either the email scandal in which Clinton set up an potentially illegal private server to house her official emails while secretary of state, Ben Ghazi, the affair in which diplomats died as a result of terrorist action which many have blamed on Hillary Clinton, as well as the alleged unethical misuse of Clinton foundation funds as a slush fund for the Clinton families luxurious tastes. Interestingly, “Bruni”, as in Frank Bruni, a New York Times reporter who has taken some heat for his critical reporting of Hillary Clinton has appeared in the cloud.

But is this really so bad? How does these word clouds compare with those of 2007/2008?

Figure 3: The word cloud from 2007/2008 for Hillary Clinton excluding “hillary” and “clinton”.
Figure 4: The word cloud from 2007/2008 for Barack Obama excluding “obama”.

From Figure 2, 3, and 4 we can see a significant and substantive difference from that of Figure 1. In those figures the most newsworthy thing to report is the rivalry for the primary seat. All other issues are dwarfed. With Figure 1, scandals and criticism of Hillary Clinton abound. Looking at these word clouds, I would suspect that the Clinton camp would be happy to have the news coverage they had in the 2008 campaign rather than the coverage they are currently having.

But are these frequency word graphs really a reasonable assessment of the media? What of the overall tone of these many articles?

Figure 5: Sentiment analysis of the news coverage of Clinton 2008 and 2016 and Obama 2008 and Sanders 2016. Scales have been standardized so that a positive rating indicates higher likelihood of emotion being displayed and negative rating indicates lower likelihood of emotion being displayed.

From Figure 5 we can see that headlines mentioning Sanders score the highest on the emotions: anticipation, joy, surprise, trust, and positivism. He also scores the lowest in: anger, fear, sadness, and negativity. While Clinton 2016/2008 score the highest on: anger, disgust, fear, sadness, and negativity and the lowest on: anticipation, joy, trust, and positivism.

Compared with 2008, Clinton 2016 articles appear to: have less anger, anticipation, joy, trust, and fear while also having more disgust, sadness, surprise, negativism, as well as slightly more positivism. Overall, the prospects as gauged from the emotions engendered by the media appear to be pretty bleak for Hillary Clinton.

It is interesting to note that articles about Sanders score emotionally very similar in general to that of Obama in direction except that Sanders seems to be outperforming Obama with higher: anticipation, joy, trust, and positivism while also performing better by getting lower scores in: anger, fear, sadness, and negativism. In only one indicator does Obama do better than Sanders and that is in the emotion disgust. The largest emotional difference between Obama 2008 and Sanders 2016 is that Obama articles scored the lowest on surprise while Sanders have scored the highest.

Overall, we must conclude that at least in terms of emotional tone of articles if not coverage, Sanders is doing significantly better than Hillary and even better than Obama was at this time in the 2008 presidential race.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Strategies to Speedup R Code

By Selva Prabhakaran

raw_vs_with_vectorization

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

The for-loop in R, can be very slow in its raw un-optimised form, especially when dealing with larger data sets. There are a number of ways you can make your logics run fast, but you will be really surprised how fast you can actually go.
This posts shows a number of approaches including simple tweaks to logic design, parallel processing and Rcpp, increasing the speed by orders of several magnitudes, so you can comfortably process data as large as 100 Million rows and more.

Lets try to improve the speed of a logic that involves a for-loop and a condition checking statement (if-else) to create a column that gets appended to the input data frame (df). The code below creates that initial input data frame.

# Create the data frame
col1 

The logic we are about to optimise:
For every row on this data frame (df), check if the sum of all values is greater than 4. If it is, a new 5th variable gets the value “greater_than_4”, else, it gets “lesser_than_4”.

# Original R code: Before vectorization and pre-allocation
system.time({
  for (i in 1:nrow(df)) { # for every row
    if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4
      df[i, 5] 

All the computations below, for processing times, were done on a MAC OS X with 2.6 Ghz processor and 8GB RAM.

Vectorise and pre-allocate data structures

Always initialize your data structures and output variable to required length and data type before taking it to loop for computations. Try not to incrementally increase the size of your data inside the loop. Lets compare how vectorisation improves speed on a range of data sizes from 1000 to 100,000 rows.

# after vectorization and pre-allocation
output  4) {
      output[i] 

Raw Code Vs With vectorisation:

Take statements that check for conditions (if statements) outside the loop

Taking the condition checking outside the loop the speed is compared against the previous version that had vectorisation alone. The tests were done on dataset size range from 100,000 to 1,000,000 rows. The gain in speed is again dramatic.

# after vectorization and pre-allocation, taking the condition checking outside the loop.
output  4  # condition check outside the loop
system.time({
  for (i in 1:nrow(df)) {
    if (condition[i]) {
      output[i] 

Condition Checking outside loops:

Run the loop only for True conditions

Another optimisation we can do here is to run the loop only for condition cases that are ‘True', by initialising (pre-allocating) the default value of output vector to that of ‘False' state. The speed improvement here largely depends on the proportion of ‘True' cases in your data.

The tests compared the performance of this against the previous case (2) on data size ranging from 1,000,000 to 10,000,000 rows. Note that we have increase a ‘0' here. As expected there is a consistent and considerable improvement.

output  4
system.time({
  for (i in (1:nrow(df))[condition]) {  # run loop only for true conditions
    if (condition[i]) {
      output[i] 

Running Loop Only On True Conditions:
running_loop_only_true_conditions

Use ifelse() whenever possible

You can make this logic much simpler and faster by using the ifelse() statement. The syntax is similar to the if function in MS Excel, but the speed increase is phenomenal, especially considering that there is no vector pre-allocation here and the condition is checked in every case. Looks like this is going to be a highly preferred option to speed up simple loops.

system.time({
  output  4, "greater_than_4", "lesser_than_4")
  df$output 

True conditions only vs ifelse:
true_conditions_only_vs_ifelse

Using which()

By using which() command to select the rows, we are able to achieve one-third the speed of Rcpp.

# Thanks to Gabe Becker
system.time({
  want = which(rowSums(df) > 4)
  output = rep("less than 4", times = nrow(df))
  output[want] = "greater than 4"
}) 
# nrow = 3 Million rows (approx)
   user  system elapsed 
  0.396   0.074   0.481 

Use apply family of functions instead of for-loops

Using apply() function to compute the same logic and comparing it against the vectorised for-loop. The results again is faster in order of magnitudes but slower than ifelse() and the version where condition checking was done outside the loop. This can be very useful, but you will need to be a bit crafty when handling complex logic.

# apply family
system.time({
  myfunc  4) {
      "greater_than_4"
    } else {
      "lesser_than_4"
    }
  }
  output 

apply function Vs For loop in R:
apply_Vs_For_loop

Use byte code compilation for functions cmpfun() from compiler package, rather than the actual function itself

This may not be the best example to illustrate the effectiveness of byte code compilation, as the time taken is marginally higher than the regular form. However, for more complex functions, byte-code compilation is known to perform faster. So you should definitely give it a shot.

# byte code compilation
library(compiler)
myFuncCmp 

apply vs for-loop vs byte code compiled functions:
apply-vs-for-loop-vs-byte-code-compiled-functions

Use Rcpp

Lets turn this up a notch. So far we have gained speed and capacity by various strategies and found the most optimal one using the ifelse() statement. What if we add one more zero? Below we execute the same logic but with Rcpp, and with a data size is increased to 100 Million rows. We will compare the speed of Rcpp to the ifelse() method.

library(Rcpp)
sourceCpp("MyFunc.cpp")
system.time (output 

Below is the same logic executed in C++ code using Rcpp package. Save the code below as “MyFunc.cpp” in your R session's working directory (else you just have to sourceCpp from the full filepath). Note: the // [[Rcpp::export]] comment is mandatory and has to be placed just before the function that you want to execute from R.

// Source for MyFunc.cpp
#include 
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector myFunc(DataFrame x) {
  NumericVector col1 = as(x["col1"]);
  NumericVector col2 = as(x["col2"]);
  NumericVector col3 = as(x["col3"]);
  NumericVector col4 = as(x["col4"]);
  int n = col1.size();
  CharacterVector out(n);
  for (int i=0; i 4){
      out[i] = "greater_than_4";
    } else {
      out[i] = "lesser_than_4";
    }
  }
  return out;
}

Rcpp speed performance against ifelse:
Rcpp-speed-performance-against-ifelse

Use parallel processing if you have a multicore machine

Parallel processing:

# parallel processing
library(foreach)
library(doSNOW)
cl  4
# parallelization with vectorization
system.time({
  output 

Remove variables and flush memory as early as possible

Remove objects rm() that are no longer needed, as early as possible in code, especially before going in to lengthy loop operations. Sometimes, flushing gc() at the end of each iteration with in the loops can help.

Use data structures that consume lesser memory

Data.table() is an excellent example, as it reduces the memory overload which helps to speed up operations like merging data.

dt  4) {
      dt[i, col5:="greater_than_4"]  # assign the output as 5th column
    } else {
      dt[i, col5:="lesser_than_4"]  # assign the output as 5th column
    }
  }
})

Dataframe Vs Data.Table:
Dataframe-Vs-Data.Table

Speed Summary

Method: Speed, nrow(df)/time_taken = n rows per second
Raw: 1X, 120000/140.15 = 856.2255 rows per second (normalised to 1)
Vectorised: 738X, 120000/0.19 = 631578.9 rows per second
True Conditions only: 1002X, 120000/0.14 = 857142.9 rows per second
ifelse: 1752X, 1200000/0.78 = 1500000 rows per second
which: 8806X, 2985984/0.396 = 7540364 rows per second
Rcpp: 13476X, 1200000/0.09 = 11538462 rows per second

The numbers above are approximate and are based in arbitrary runs. The results are not calculated for data.table(), byte code compilation and parallelisation methods as they will vary on a case to case basis, depending upon how you apply it.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The R-Podcast Episode 16: Interview with Dean Attali

By Eric

(This article was first published on The R-Podcast (Podcast), and kindly contributed to R-bloggers)

Direct from the first-ever Shiny Developer conference, here is episode 16 of the R-Podcast! In this episode I sit down with Dean Attali for an engaging conversation about his journey to using R, his motivation for creating the innovative shinyjs package, and his perspective on teaching others about R through his support of the innovative and highly-praised Stats 545 course at UBC. In addition you’ll hear about how his previous work prepared him well for using R, his collaboration with the RStudio team, and much more. I hope you enjoy this episode and thanks for listening!

Direct Download: [mp3 format] [ogg format]

Episode 16 Show Notes

Dean Attali (@daattali)

Package Pick

Feedback

  • Leave a comment on this episode’s post
  • Email the show: thercast[at]gmail.com
  • Use the R-Podcast contact page
  • Leave a voicemail at +1-269-849-9780

Music Credits

To leave a comment for the author, please follow the link and comment on their blog: The R-Podcast (Podcast).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The correlation between original and replication effect sizes might be spurious

By Daniel Lakens

(This article was first published on The 20% Statistician, and kindly contributed to R-bloggers)

In the reproducibility project, original effect sizes correlated r=0.51 with the effect sizes of replications. Some researchers find this hopeful.

Less-popularised findings from the “estimating the reproducibility” paper @Eli_Finkel #SPSP2016 pic.twitter.com/8CFJMbRhi8

— Jessie Sun (@JessieSunPsych) January 28, 2016

I don’t think we should be interpreting this correlation at all, because it might very well be completely spurious. One important reason why correlations might be spurious is the presence of different subgroups, as introduction to statistics textbooks explain.
When we consider the Reproducibility Project (note: I’m a co-author of the paper) we can assume there are two subsets, one subgroup consisting of experiments that examine true effects, and one subgroup consisting of experiments that examine effects that are not true. This logically implies that for one subgroup, the true effect size is 0, while for the other, the true effect size is an unknown larger value. Different means in subgroups is a classic case where spurious correlations can emerge.
I find the best way to learn to understand statistics is through simulations. So let’s simulate 100 normally distributed effect sizes from original studies that are comparable to the 100 studies included in the Reproducibility Project, and 100 effect sizes for their replications, and correlate these. We create two subgroups. Forty effect sizes will have true effects (e.g., d = 0.4). The original and replication effect sizes will be correlated (e.g., r = 0.5). Sixty of the effect sizes will have an effect size of d = 0, and a correlation between replication and original studies of r = 0. I’m not suggesting this reflects the truth of the studies in the Reproducibility Project – there’s no way to know. The parameters look sort of reasonable to me, but feel free to explore different choices for parameters by running the code yourself.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

Now, the spuriousness becomes clear. The two groups differ in their means, and if we calculate the correlation over the entire sample, the r = 0.51 we get is not very meaningful (I cut off original studies at d = 0, to simulate publication bias and make the graph more similar to Figure 1 in the paper, but it doesn’t matter for the current point).
So: be careful interpreting correlations when there are different subgroups. There’s no way to know what is going on. The correlation of 0.51 between effect sizes in original and replication studies might not mean anything.

To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New Yorkers, municipal bikes, and the weather

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Like many modern cities, New York offers a public pick-up/drop-off bicycle service (called Citi Bikes). Subscribing City Bike members can grab a bike from almost 500 stations scattered around the city, hop on and ride to their destination, and drop the bike at a nearby station. (Visitors to the city can also purchase day passes.) The City Bike program shares data to the public about the operation of the service: time and location of pick-ups and drop-offs, and basic demographic data (age and gender) of subscriber riders.

Data Scientist Todd Schneider has followed-up on his tour-de-force analysis of Taxi Rides in NYC with a similar analysis of the Citi Bike data. Check out the wonderful animation of bike rides on September 16 below. While the Citi Bike data doesn’t include actual trajectories (just the pick-up and drop-off locations), Todd has “interpolated” these points using Google Maps biking directions. Though these may not match actual routes (and gives extra weight to roads with bike lanes), it’s nonetheless an elegant visualization of bike commuter patterns in the city.

Check out in particular the rush hours of 7-9AM and 4-6PM. September 16 was a Wednesday, but as Todd shows in the chart below, biking patterns are very different on the weekends as the focus switches from commuting to pleasure rides.

Todd also matched the biking data with NYC weather data to take a look at its effect on biking patterns. Unsurprisingly, low temperatures and rain both have a dampening effect (pun intended!) on ridership: one inch of rain deters as many riders as a 24-degree (F) drop in temperature. Surprisingly, snow doesn’t have such a dramatic effect: an inch of snow depresses ridership like a 1.4 degree drop in temperature. (However, Todd’s data doesn’t include the recent blizzard in New York, from which many City Bike stations are still waiting to be dug out.)

Todd conducted all of the analysis and data visualization with the R language (he shares the R code on Github). He mainly used the the RPostgreSQL package for data extraction, the dplyr package for the data manipulation, the ggplot2 package for the graphics, and the minpack.lm package for the nonlinear least squares analysis of the weather impact.

There’s plenty more detail to the analysis, including the effects of age and gender on cycling speed. For the complete analysis and lots more interesting charts, follow the link to the blog post below.

Todd W. Schneider: A Tale of Twenty-Two Million Citi Bikes: Analyzing the NYC Bike Share System

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

2016 Prior Exposure Bayesian Data Analysis workshops for social scientists

By Thom Baguley

(This article was first published on Psychological Statistics, and kindly contributed to R-bloggers)

Mark Andrews and I launched our Prior Exposure Bayesian Data Analysis workshop series last year and are pleased to announce that bookings for year the 2016 workshops 1 and 2 are now open. This is part of the ESRC Advanced Training Initiative.

Further details including booking links and details of bursaries for UK PhD students are available here. The dates are 31 March 2016 and 1 April 2016 and the workshops will be running at Nottingham Trent University (Nottingham, UK).

The first two workshops still have places free (though places are filling up quite fast). They are primarily (but not exclusively) aimed at UK social science PhD students. Last years workshops were attended by students from criminology, politics, demography, psychology, neuroscience, education and many other disciplines. We hope the workshops will also appeal to early career researchers and others doing quantitative social science research (but with little or no Bayesian experience).

The ESRC is supporting us with bursary funding for travel and subsistence (see web site for details). These are eligible to all UK social science PhD students (not just for those with ESRC funding), but funded places are limited.

We will run similar workshops in 2017 and hope to offer additional training opportunities beyond that (although the ESRC funding will end at that point).

In a change from last year we are also putting on an optional one-day R workshop before the workshops. Please email Mark Andrews or myself if you are interested in attending this (but priority will be given to students registered on one or both workshops).

P.S. The registration cost for each workshop is £10 (for postgrads) and £20 (or others) – the information is buried in the booking link but we’ll try and make that clearer … The workshops are non-profit so this fee is to cover basic running costs (e.g., lunch etc.) and we will try and keep these low costs for subsequent workshops.

To leave a comment for the author, please follow the link and comment on their blog: Psychological Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Cricket analytics with cricketr!!!

By Tinniam V Ganesh

cricket

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

My ebook “Cricket analytics with cricketr’ has been published in Leanpub. You can now download the book (hot off the press!) for all formats to your favorite device (mobile, iPad, tablet, Kindle) from the link “Cricket analytics with cricketr”. The book has been published in the following formats namely

  • PDF (for your computer)
  • EPUB (for iPad or tablets. Save the file cricketr.epub to Google Drive/Dropbox and choose “Open in” iBooks for iPad)
  • MOBI (for Kindle. For this format, I suggest that you download & install SendToKindle for PC/Mac. You can then right click the downloaded cricketr.mobi and choose SendToKindle. You will need to login to your Kindle account)

Leanpub uses a variable pricing model. I have priced the book attractively (I think!) at $4.99 with a minimum price of $0.00 (FREE!!!). Do download the book and hope you have many happy hours reading it.

I am including my preface in the book below

Preface
Cricket has been the “national passion” of India for decades. As a boy I was also held in thrall by a strong cricketing passion like many. Cricket is a truly fascinating game! I would catch the sporting action with my friends as we crowded around a transistor that brought us live, breathless radio commentary. We also spent many hours glued to live cricket action on the early black and white TVs. This used to be an experience of sorts, as every now and then a part of the body of the players, would detach itself and stretch to the sides. But it was enjoyable all the same.

Nowadays broadcast technology has improved so much and we get detailed visual analysis of the how each bowler varies the swing and length of the delivery. We are also able to see the strokes of batsman in slow motion. Similarly computing technology has also advanced by leaps and bounds and we can analyze players in great detail with a few lines of code in languages like R, Python etc.

In 2015, I completed Machine Learning from Stanford at Coursera. I was looking around for data to play around with, when it suddenly struck me that I could do some regression analysis of batting records. In the subsequent months, I took the Data Science Specialization from John Hopkins University, which triggered more ideas in me. One thing led to another and I managed to put together an R package called ‘cricketr’. I developed this package over 7 months adding and refining functions. Finally, I managed to submit the package to CRAN. During the development of the package for different formats of the game I wrote a series of posts in my blog.

This book is a collection of those cricket related posts. There are 6 posts based on my R package cricketr. I have also included 2 earlier posts based on R which I wrote before I created my R package. Finally, I also include another 2 cricket posts based on Machine Learning in which I used the language Octave.

My cricketr’ package is a first, for cricket analytics, howzzat! and I am certain that it won’t be the last. Cricket is a wonderful pitch for statisticians, data scientists and machine learning experts. So you can expect some cool packages in the years to come.

I had a great time developing the package. I hope you have a wonderful time reading this book. Do remember to download from “Cricket analytics with cricketr

Feel free to get in touch with me anytime through email included below

Tinniam V Ganesh
tvganesh.85@gmail.com
January 28, 2016

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

FQDN (Fully Qualified Domain Names) in R

By Mango Blogger

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Steph Locke

Get the fully qualified domain name for your machine

This is just a quick post, to mention how you can get your computer name with the domain it is registered in i.e. Â the fully qualified domain name (FQDN) by using R.

Base R

In Windows to get the computer name in it’s fully qualified form you need to do:

paste(Sys.getenv("COMPUTERNAME"),
  Sys.getenv("USERDNSDOMAIN"),
  sep=".")
## [1] "SLOCKE.MANGO.LOCAL"

In Linux, you can use:

Sys.getenv("HOSTNAME")
## [1] ""

Each of these returns an empty string in the other OS so you can concatenate them to get the FQDN.

getFQDN <- function (){
  tolower(
           paste0( paste(Sys.getenv("COMPUTERNAME"),
                        Sys.getenv("USERDNSDOMAIN"),
                        sep=".")
                 , Sys.getenv("HOSTNAME")
                 )
  )
}
getFQDN()
## [1] "slocke.mango.local"

Using iptools

Alternatively, we could use the iptools package (available from CRAN) which now also works in Windows.

library(iptools)
ip_to_hostname("127.0.0.1")
## [[1]]
## [1] "slocke.Mango.local"

For more info on iptools, check out the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News