14 (new) R jobs from around the world (for 2015-11-30)

By Tal Galili

r_jobs

This is the bi-monthly R-bloggers post (for 2015-11-30) for new R Jobs.

To post your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

  1. Freelance
    Develop a small Shiny App
    Global Sourcing Group – Posted by Sudhir
    Machanaikanahalli
    Karnataka, India
    17 Nov2015
  2. Full-Time
    Application Developer @ Boulder, Colorado, United States
    The Cadmus Group – Posted by sonia.brightman
    Boulder
    Colorado, United States
    17 Nov2015
  3. Full-Time
    Data Scientist @ New York (> $100K/year)
    Cornerstone Research – Posted by mdecesar
    New York
    New York, United States
    20 Nov2015
  4. Full-Time
    Statistician – Health Economics Outcomes Research @ Utrecht, Netherlands
    Mapi – Posted by andreas.karabis
    Utrecht
    Utrecht, Netherlands
    30 Nov2015
  5. Full-Time
    Insights Analyst @ Auckland, New Zealand
    Experian – Posted by CaroleDuncan
    Auckland
    Auckland, New Zealand
    30 Nov2015
  6. Freelance
    Install test+productionserver + development of Web API for exisiting R Package
    cure-alot – Posted by cure-alot
    Anywhere
    28 Nov2015
  7. Internship
    Content Development Intern @ Cambridge, United States ($15/hour)
    DataCamp – Posted by nickc123
    Cambridge
    Massachusetts, United States
    26 Nov2015
  8. Full-Time
    Data Scientist @ Yakum, Israel
    Intel – Posted by omrimendels
    Yakum
    Center District, Israel
    26 Nov2015
  9. Full-Time
    Junior Data Scientist (R Focus) @ San Mateo, California, United States
    Scientific Revenue – Posted by wgrosso
    San Mateo
    California, United States
    24 Nov2015
  10. Internship
    Trainee Data Analytics & Testing @ Unterföhring, Bayern, Germany
    ProSiebenSat.1 Digital GmbH – Posted by meinhardploner
    Unterföhring
    Bayern, Germany
    24 Nov2015
  11. Full-Time
    Tenure-track Assistant Professor in Computational Biology @ Portland, Oregon, United States
    Oregon Health & Science University – Posted by takabaya
    Portland
    Oregon, United States
    23 Nov2015
  12. Full-Time
    Junior Consumer Insights Analyst @ Düsseldorf, Nordrhein-Westfalen, Germany
    trivago GmbH – Posted by trivago GmbH
    Düsseldorf
    Nordrhein-Westfalen, Germany
    23 Nov2015
  13. Internship
    SHRM Temporary – Certification @ Alexandria, Virginia, United States
    Society for Human Resource Management – Posted byLizPS
    Alexandria
    Virginia, United States
    18 Nov2015
  14. Full-Time
    Senior Analyst @ London, England, United Kingdom
    Bupa – Posted by MCarolan
    London
    England, United Kingdom
    18 Nov2015

Job seekers: please follow the links below to learn more and apply for your job of interest:

(In R-users.com you may see all the R jobs that are currently available)

(you may also look at previous R jobs posts).

Source:: R News

Emojis in ggplot graphics

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R user David Lawrence Miller has created an extension for R’s ggplot2 package that allows you to use emojis as plotting symbols. The emoGG package (currently only available on github) adds the geom_emoji geom to ggplot2, which uses an emoji code to identify the plotting symbol. For example:

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_emoji(emoji=”1f337″)

(You can look up emoji symbol codes using the emoji_search function.) The package is still in its early days and has some rough edges: legends aren’t yet supported, and as you can see in the chart above varying emoji color by a data variable doesn’t work. You can, however, create charts including multiple emojis, as in this chart of the mtcars dataset using different symbols for manual and automatic transmission cars:

You can find more information about the emoGG package, including details on how to install it, at its github repository linked below.

github (dill): emoGG

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

laptop-friendly analysis of the census of 82 countries with r and monetdb

By Anthony Damico

(This article was first published on asdfree, and kindly contributed to R-bloggers)

the integrated public use microdata series international (ipumsi) has been my white whale since i started in survey research. non-demographers, perhaps think of this repository as a martryoshka varanasi-kaaba-ark of the covenant: nothing compares. the minnesota population center amassed half a billion person-level records from national statistics offices across the globe. it’s all free and ready for download, so long as you have a project idea and an institutional affiliation. so my turn to talk? because now the software needed for analysis is free as well, and markedly superior to anything that’s available for purchase. 277 censuses later, roll credits. these tutorials maniacally document every step necessary to

click here to get started working with ipums international

notes: unless you plan to make severe edits to my example code, individual extracts must contain a single year and a single country and be formatted as a csv. the actual extract link can simply be copied and pasted into your r script from the url highlighted in the screenshot below. each extract should include the variables “serial”, “strata”, and “perwt” if you plan on calculating statistics to be shared anywhere beyond fingerpainting class. these census files cannot be treated as simple random samples, those three columns contain the information necessary for my scripts to handle everything correctly.

confidential to sas, spss, stata, and sudaan users: neil armstrong would give pogo sticks the same look i’m giving your softwares right now. time to reserve your spot on apollo eleven. time to transition to r.

To leave a comment for the author, please follow the link and comment on their blog: asdfree.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Estimating mixed graphical models

By Jonas Haslbeck – r

center

(This article was first published on Jonas Haslbeck – r, and kindly contributed to R-bloggers)

Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes variables of different domains.

As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.

There are methods available to estimate mixed graphical models from mixed continuous data, however, these usually have two drawbacks: first, there is a possible information loss due to necessary transformations and second, they cannot incorporate (nominal) categorical variables (for an overview see here). A new method implemented in the R-package mgm addresses these limitations.

In the following, we use the mgm-package to estimate the conditional independence network in a dataset of questionnaire responses of individuals diagnosed with Autism Spectrum Disoder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).

The dataset consists of responses of 3521 individuals to a questionnaire including 28 variables of domains continuous, count and categorical.

dim(data)
## [1] 3521   28

round(data[1:4, 1:5],2)
##      sex IQ agediagnosis opennessdiagwp successself
## [1,]   1  6            0              1        1.92
## [2,]   2  6            7              1        5.40
## [3,]   1  5            4              2        5.66
## [4,]   1  6            8              1        8.00

We now use our knowledge about the variables to specify the domain (or type) of each variable and the number of categories for categorical variables (for non-categorical variables we choose 1). “c”, “g”, “p” stands for categorical, Gaussian and Poisson (count), respectively.

type <- c("c", "g", "g", "c", "c", "g", "c", "c", "p", "p",
          "p", "p", "p", "p", "c", "p", "c", "g", "p", "p",
          "p", "p", "g", "g", "g", "g", "g", "g", "c", "c",
          "g")

cat <- c(2, 1, 1, 3, 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 2, 1, 4,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1)

The estimation algorithm requires us to make an assumption about the highest order interaction in the true graph. Here we assume that there are at most pairwise interactions in the true graph and set d = 2. The algorithm includes an L1-penalty to obtain a sparse estimate. We can select the regularization parameter lambda using cross validation (CV) or the Extended Bayesian Information Criterion (EBIC). Here, we choose the EBIC, which is known to be a bit more conservative than CV but is computationally faster.

library(mgm)

fit <- mgmfit(data, type, cat, lamda.sel="EBIC", d=2)

The fit function returns all estimated parameters and a weighted and unweighted (binarized) adjacency matrix. Here we use the qgraph package to visualize the graph:

# group variables
group_list <- list("Demographics"=c(1,14,15,28), 
                "Psychological"=c(2,4,5,6,18,20,21),
                "Social environment" = c(7,16,17,19,26,27),
                "Medical"=c(3,8,9,10,11,12,13,22,23,24,25))

# define nice colors
group_cols <- c("#E35959","#8FC45A","#4B71B3","#E8ED61")

# plot
library(qgraph)
qgraph(fit$adj, 
       vsize=3, layout="spring", 
       edge.color = rgb(33,33,33,100, 
       maxColorValue = 255), 
       color=group_cols,
       border.width=1.5,
       border.color="black",
       groups=group_list,
       nodeNames=datalist$colnames,
       legend=TRUE, 
       legend.mode="groups",
       legend.cex=.75)

A reproducible example can be found in the examples of the package or more elaboratly explained in the corresponding paper. Here is a paper explaining the theory behind the implemented algorithm.

Computationally efficient methods for Gaussian data are implemented in the huge package and the glasso package. For binary data, there is the IsingFit package.

Great free resources about graphical models are Chapter 17 in the freely available book The Elements of Statistical Learning and the Coursera course Probabilistic Graphical Models.

To leave a comment for the author, please follow the link and comment on their blog: Jonas Haslbeck – r.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Interactive association rules exploration app

By Andrew Brooks – R

Screen Shot 2015-11-29 at 11.36.54 AM

(This article was first published on Andrew Brooks – R, and kindly contributed to R-bloggers)

In a previous post, I wrote about what I use association rules for and mentioned a Shiny application I developed to explore and visualize rules. This post is about that app. The app is mainly a wrapper around the arules and arulesViz packages developed by Michael Hahsler.

Features

  • train association rules
    • interactively adjust confidence and support parameters
    • sort rules
    • sample just top rules to prevent crashes
    • post process rules by subsetting LHS or RHS to just variables/items of interest
    • suite of interest measures
  • visualize association rules
    • grouped plot, matrix plot, graph, scatterplot, parallel coordinates, item frequency
  • export association rules to CSV

How to get

Option 1: Copy the code below from the arules_app.R gist

Option2: Source gist directly.

library('devtools')
library('shiny')
library('arules')
library('arulesViz')
source_gist(id='706a28f832a33e90283b')

Option 3: Download the Rsenal package (my personal R package with a hodgepodge of data science tools) and use the arulesApp function:

library('devtools')
install_github('brooksandrew/Rsenal')
library('Rsenal')
?Rsenal::arulesApp

How to use

arulesApp is intended to be called from the R console for interactive and exploratory use. It calls shinyApp which spins up a Shiny app without the overhead of having to worry about placing server.R and ui.R. Calling a Shiny app with a function also has the benefit of smooth passing of parameters and data objects as arguments. More on shinyApp here.

arulesApp is currently highly exploratory (and highly unoptimized). Therefore it works best for quickly iterating on rule training and visualization with low-medium sized datasets. Check out Michael Hahsler’s arulesViz paper for a thorough description of how to interpret the visualizations. There is a particularly useful table on page 24 which compares and summarizes the visualization techniques.

Simply call arulesApp from the console with a data.frame or transaction set for which rules will be mined from:

library('arules') contains Adult and AdultUCI datasets

data('Adult') # transaction set
arulesApp(Adult, vars=40)

data('AdultUCI') # data.frame
arulesApp(AdultUCI)

Here are the arguments:

  • dataset data.frame, this is the dataset that association rules will be mined from. Each row is treated as a transaction. Seems to work OK when a the S4 transactions class from arules is used, however this is not thoroughly tested.
  • bin logical, TRUE will automatically discretize/bin numerical data into categorical features that can be used for association analysis.
  • vars integer, how many variables to include in initial rule mining
  • supp numeric, the support parameter for initializing visualization. Useful when it is known that a high support is needed to not crash computationally.
  • conf numeric, the confidence parameter for initializing visualization. Similarly useful when it is known that a high confidence is needed to not crash computationally.

Screenshots

Association rules list view

Scatterplot

Screen Shot 2015-11-29 at 11.40.47 AM

Graph

Screen Shot 2015-11-29 at 11.39.53 AM

Grouped Plot

Screen Shot 2015-11-29 at 11.40.58 AM

Parallel Coordinates

Screen Shot 2015-11-29 at 11.37.30 AM

Matrix

Screen Shot 2015-11-29 at 11.40.39 AM

Item frequency

Screen Shot 2015-11-29 at 11.36.19 AM

Code

To leave a comment for the author, please follow the link and comment on their blog: Andrew Brooks – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The hidden benefits of open-source software

By Rob J Hyndman

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.

The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.

I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators.

  1. It leads to a far greater impact on international practice than anything else you can to do promote your new methodology or algorithms. Surely this is something we want to do as university-based researchers. My forecasting algorithms have had a big impact, not because I wrote a few papers in statistical journals, but because I also wrote some R packages that allow anyone to try out my methods on their own data.
  2. As a result of other researchers being able to implement your ideas easily, your work gets cited much more frequently. Anyone who uses your open source software is obliged to cite the software product and the underlying research papers that describe the methods and algorithms. Citations are used as a crude measure of prestige within universities. If you want to get promoted, having lots of citations helps. Citations also feed into university rankings.
  3. Instead of charging for the software, you can charge for a consulting service to help people use, modify and integrate your software into other systems. In my own group, this approach helps fund two post-docs and several research assistants. I think we easily generate more dollars in consulting income every year than we would ever get from software sales, while simultaneously changing the way forecasting is conceived and implemented all over the world.
  4. More skilled jobs are created as we need consultants to undertake new projects as the service grows. We are training people to tackle and solve difficult data science problems. In contrast, commercial software vendors employ people to generate sales.
  5. The consulting projects lead to new research ideas and new software tools that help fuel the ongoing research enterprise. A condition of all my consulting projects is that any new ideas and software that arise as a result can be published.
  6. This approach makes the research we do more useful. We tackle research problems that are motivated by consulting projects, and will therefore tend to be more relevant and applicable than if we just did research in isolation.
  7. When a large number of researchers follow this model, we have wonderful repositories of open-source software such as CRAN. This reduces research costs, and allows quick implementation and adaption of other people’s research ideas.

To leave a comment for the author, please follow the link and comment on their blog: Hyndsight » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

gtrends 1.3.0 now on CRAN: Google Trends in R

By Thinking inside the box

Example of gtrendsR query and plot

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Sometime earlier last year, I started to help Philippe Massicotte with his gtrendsR package—which was then still “hiding” in relatively obscurity on BitBucket. I was able to assist with a few things related to internal data handling as well as package setup and package builds–but the package is really largely Philippe’s. But then we both got busy, and it wasn’t until this summer at the excellent useR! 2015 conference that we met and concluded that we really should finish the package. And we both remained busy…

Lo and behold, following a recent transfer to this GitHub repository, we finalised a number of outstanding issues. And Philippe was even kind enough to label me a co-author. And now the package is on CRAN as of yesterday. So install.packages("gtrendsR") away and enjoy!

Here is a quiick demo:

## load the package, and if options() are set appropriately, connect
## alternatively, also run   gconnect("someuser", "somepassword")
library(gtrendsR)

## using the default connection, run a query for three terms
res <- gtrends(c("nhl", "nba", "nfl"))

## plot (in default mode) as time series
plot(res)

## plot via googeVis to browser
## highlighting regions (probably countries) and cities
plot(res, type = "region")
plot(res, type = "cities")

The time series (default) plot for this query came out as follows a couple of days ago:

One really nice feature of the package is the rather rich data structure. The result set for the query above is actually stored in the package and can be accessed. It contains a number of components:

R> data(sport_trend)
R> names(sport_trend)
[1] "query"     "meta"      "trend"     "regions"   "topmetros"
[6] "cities"    "searches"  "rising"    "headers"  
R>

So not only can one look at trends, but also at regions, metropolitan areas, and cities — even plot this easily via package googleVis which is accessed via options in the default plot method. Furthermore, related searches and rising queries may give leads to dynamics within the search.

Please use the standard GitHub issue system for bug reports, suggestions and alike.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R kernel in Jupyter notebook 3

By ygc

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

I followed the post, Installing an R kernel for IPython/jupyter notebook 3 on OSX, to install jupyter with python3 and R kernels in my iMac.

I have elementaryOS on my Macbook Pro and also want to have jupyter on it. The installation process is quite similar.

Install Jupyter

1
2
sudo apt-get install python3-pip
sudo pip3 install jupyter

Then we can use the following command to start jupyter:

1
ipython notebook

Install IRkernel

To compile IRkernel, we should firstly have zmq lib installed.

1
sudo apt-get install libzmq3-dev python-zmq

In R, run the following command to install IRkernel:

?View Code RSPLUS

1
2
3
4
5
install.packages(c('rzmq','repr','IRkernel','IRdisplay'),
                  repos = c('http://irkernel.github.io/',     
                  getOption('repos')),
                  type = 'source')
IRkernel::installspec()

Now we can use R in jupyter. Inline image is a great feature especially for demonstration.

With many phylogenetic packages available in R and my package ggtree, R in jupyter can be a great environment for phylogenetics.

Related Posts

To leave a comment for the author, please follow the link and comment on their blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sixer – R package cricketr’s new Shiny avatar

By Tinniam V Ganesh

sxr-1

(This article was first published on Giga thoughts … » R, and kindly contributed to R-bloggers)

In this post I create a Shiny App, Sixer, based on my R package cricketr. I had developed the R package cricketr, a few months back for analyzing the performances of batsman and bowlers in all formats of the game (Test, ODI and Twenty 20). This package uses the statistics info available in ESPN Cricinfo Statsguru. I had written a series of posts using the cricketr package where I chose a few batsmen, bowlers and compared their performances of these players. Here I have created a complete Shiny app with a lot more players and with almost all the features of the cricketr package. The motivation for creating the Shiny app was to

  • To show case the ‘cricketr’ package and to highlight its functionalities
  • Perform analysis of more batsman and bowlers
  • Allow users to interact with the package and to allow them to try out the different features and functions of the package and to also check performances of some of their favorite crickets

a) You can try out the interactive Shiny app Sixer at – Sixer
b) The code for this Shiny app project can be cloned/forked from GitHub – Sixer
Note: Please be mindful of ESPN Cricinfo Terms of Use.

In this Shiny app I have 4 tabs which perform the following function
1. Analyze Batsman
This tab analyzes batsmen based on different functions and plots the performances of the selected batsman. There are functions that compute and display batsman’s run-frequency ranges, Mean Strike rate, No of 4’s, dismissals, 3-D plot of Runs scored vs Balls Faced and Minutes at crease, Contribution to wins & losses, Home-Away record etc. The analyses can be done for Test cricketers, ODI and Twenty 20 batsman. I have included most of the Test batting giants including Tendulkar, Dravid, Sir Don Bradman, Viv Richards, Lara, Ponting etc. Similarly the ODI list includes Sehwag, Devilliers, Afridi, Maxwell etc. The Twenty20 list includes the Top 10 Twenty20 batsman based on their ICC rankings

2. Analyze bowler
This tab analyzes the bowling performances of bowlers, Wickets percentages, Mean Economy Rate, Wickets at different venues, Moving average of wickets etc. As earlier I have all the Top bowlers including Warne, Muralidharan, Kumble- the famed Indian spin quartet of Bedi, Chandrasekhar, Prasanna, Venkatraghavan, the deadly West Indies trio of Marshal, Roberts and Holding and the lethal combination of Imran Khan, Wasim Akram and Waqar Younis besides the dangerous Dennis Lillee and Jeff Thomson. Do give the functions a try and see for yourself the performances of these individual bowlers

3. Relative performances of batsman
This tab allows the selection of multiple batsmen (Test, ODI and Twenty 20) for comparisons. There are 2 main functions Relative Runs Frequency performance and Relative Mean Strike Rate

4. Relative performances of bowlers
Here we can compare bowling performances of multiple bowlers, which include functions Relative Bowling Performance and Relative Economy Rate. This can be done for Test, ODI and Twenty20 formats
Some of my earlier posts based on the R package cricketr include
1. Introducing cricketr!: An R package for analyzing performances of cricketers
2. Taking cricketr for a spin – Part 1
3. cricketr plays the ODIs
4. cricketr adapts to the Twenty20 International
5. cricketr digs the Ashes

Do try out the interactive Sixer Shiny app – Sixer
You can clone the code from Github – Sixer

There is not much in way of explanation. The Shiny app’s use is self-explanatory. You can choose a match type ( Test,ODI or Twenty20), choose a batsman/bowler from the drop down list and select the plot you would like to seeHere a few sample plots
A. Analyze batsman tab
i) Batsman – Brian Lara , Match Type – Test, Function – Mean Strike Rate
ii) Batsman – Shahid Afridi, Match Type – ODI, Function – Runs vs Balls faced
iii) Batsman – Chris Gayle, Match Type – Twenty20 Function – Moving Average
sxr-3B. Analyze bowler tab

i. Bowler – B S Chandrasekhar, Match Type – Test, Function – Wickets vs Runs
sxr-4ii) Bowler – Malcolm Marshall, Match Type – Test, Function – Mean Economy Ratesxr-5iii) Bowler – Sunil Narine, Match Type – Twenty 20, Function – Bowler Wicket Rate
sxr-6
C. Relative performance of batsman (you can select more than 1)
The below plot gives the Mean Strike Rate of batsman. Viv Richards, Brian Lara, Sanath Jayasuriya and David Warner are best strikers of the ball.
sxr-7

Here are some of the great strikers of the ball in ODIs
sxr-8D. Relative performance of bowlers (you can select more than 1)
Finally a look at the famed Indian spin quartet. From the plot below it can be seen that B S Bedi & Venkatraghavan were more economical than Chandrasekhar and Prasanna.
sxr-9

But the latter have a better 4-5 wicket haul than the former two as seen in the plot below

sxr-11Finally a look at the average number of balls to take a wicket by the Top 4 Twenty 20 bowlers.
sxr-10

Do give the Shiny app Sixer a try.

Also see
1. Literacy in India : A deepR dive.
2. Natural Language Processing: What would Shakespeare say?
3. Revisiting crimes against women in India
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Experiments with deblurring using OpenCV
6. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
7. Working with Node.js and PostgreSQL
8. A method for optimal bandwidth usage by auctioning available bandwidth using the OpenFlow Protocol
9. Latency, throughput implications for the cloud
10. A closer look at “Robot horse on a Trot! in Android”

To leave a comment for the author, please follow the link and comment on their blog: Giga thoughts … » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Wind in Netherlands II

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

Two weeks ago I plotted how wind measurements on the edge of the North Sea changed in the past century. This week the same dataset is used for hypothesis testing.

Data

The most important things to reiterate from previous post is that the data is from KNMI and they come with a comment: “These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt or the Central Netherlands Temperature.
Data reading has slighlty changed, mostly because I needed different variables. In addition, for testing I wanted some categorical variables, these are Month and year. For year I have chosen five chunks of 22 years, 22 was chosen since it seemed large enough and resulted in approximately equal size chunks. Finally, for display purposes, wind direction was categorized in 8 directions according to the compass rose (North, North-East, East etc.).
library(circular)
library(dplyr)
library(ggplot2)
library(WRS2)

r1 <- readLines(‘etmgeg_235.txt')
r2 <- r1[grep(‘^#',r1):length(r1)]
explain <- r1[1:(grep(‘^#',r1)-1)]
explain
r2 <- gsub(‘#',”,r2)
r3 <- read.csv(text=r2)

r4 <- mutate(r3,
Date=as.Date(format(YYYYMMDD),format=’%Y%m%d’),
year=floor(YYYYMMDD/1e4),
rDDVEC=as.circular(DDVEC,units=’degrees’,template=’geographics’),
# Vector mean wind direction in degrees
# (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
DDVECf=as.character(cut(DDVEC,breaks=c(0,seq(15,330,45),361),left=TRUE,
labels=c(‘N’,’NE’,’E’,’SE’,’S’,’SW’,’W’,’NW’,’N2′))),
DDVECf=ifelse(DDVECf==’N2′,’N’,DDVECf),
DDVECf=factor(DDVECf,levels=c(‘N’,’NE’,’E’,’SE’,’S’,’SW’,’W’,’NW’)),
rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
yearf=cut(year,seq(1905,2015,22),labels=c(’05’,’27’,’49’,’71’,’93’)),
month=factor(format(Date,’%B’),levels=month.name),
tcat=interaction(month,yearf)
) %>%
select(.,YYYYMMDD,Date,year,month,DDVEC,rDDVEC,DDVECf,rFHVEC,yearf,tcat)

Analysis

The circular package comes with an aov.circular() function, which can do one way analysis. Since I am a firm believer that direction varies according to the seasons, the presence of a time effect (the five categories) has been examined by Month. To make result compact, only p-values are displayed, they are all significant.
sapply(month.name,function(x) {
aa <- filter(r4,month==x)
bb <- aov.circular(aa$rDDVEC,aa$yearf,method='F.test')
format.pval(bb$p.value,digits=4,eps=1e-5)
}) %>% as.data.frame
January 4.633e-05
February < 1e-05
March < 1e-05
April < 1e-05
May < 1e-05
June 0.00121
July 0.000726
August 0.0001453
September 0.02316
October < 1e-05
November 0.0001511
December 0.003236
The associated plot with this data shows frequency of directions by year and Month. The advantage here being that the time axis is the x-axis, so changes are more easily visible.

ggplot(r4[complete.cases(r4),], aes(x=yearf))+
geom_histogram()+
facet_grid(DDVECf ~ month)+
ggtitle(‘Frequency of Wind Direction’)

The other part of wind is strength. Two weeks ago I saw clear differences. However, since that may also be effect of instrument or location change. The test I am interested here is therefore not the main effect of year categories but rather the interaction Month*Year. In the objective of robustness I wanted to go nonparametric with this. However, since I did not find anything regarding two factor interaction in my second edition of Hollander and Wolfe I googled for robust interaction. This gave a hit on rcompanion for the WRS2 package.

t2way(rFHVEC ~ yearf + month + yearf:month,
data = r4)
value p.value
yearf 1063.0473 0.001
month 767.5687 0.001
yearf:month 169.4807 0.001

Conclusion

The data seems to show a change in wind measurements over these 110 years. This can be due to changes in wind or measurement instrument or instrument location. The statistical testing was chosen such as to counter some effects of these changes, hence it can be thought that the change is due to changes in wind itself.

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News