forcats 0.1.0 ๐Ÿˆ๐Ÿˆ๐Ÿˆ๐Ÿˆ

By hadleywickham

reorder-1

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m excited to announce forcats, a new package for categorical variables, or factors. Factors have a bad rap in R because they often turn up when you don’t want them. That’s because historically, factors were more convenient than character vectors, as discussed in stringsAsFactors: An unauthorized biography by Roger Peng, and stringsAsFactors = by Thomas Lumley.

If you use packages from the tidyverse (like tibble and readr) you don’t need to worry about getting factors when you don’t want them. But factors are a useful data structure in their own right, particularly for modelling and visualisation, because they allow you to control the order of the levels. Working with factors in base R can be a little frustrating because of a handful of missing tools. The goal of forcats is to fill in those missing pieces so you can access the power of factors with a minimum of pain.

Install forcats with:

install.packages("forcats")

forcats provides two main types of tools to change either the values or the order of the levels. I’ll call out some of the most important functions below, using using the included gss_cat dataset which contains a selection of categorical variables from the General Social Survey.

library(dplyr)
library(ggplot2)
library(forcats)

gss_cat
#> # A tibble: 21,483 × 9
#>  year    marital  age  race    rincome      partyid
#>  <int>    <fctr> <int> <fctr>     <fctr>       <fctr>
#> 1 2000 Never married  26 White $8000 to 9999    Ind,near rep
#> 2 2000   Divorced  48 White $8000 to 9999 Not str republican
#> 3 2000    Widowed  67 White Not applicable    Independent
#> 4 2000 Never married  39 White Not applicable    Ind,near rep
#> 5 2000   Divorced  25 White Not applicable  Not str democrat
#> 6 2000    Married  25 White $20000 - 24999  Strong democrat
#> # ... with 2.148e+04 more rows, and 3 more variables: relig <fctr>,
#> #  denom <fctr>, tvhours <int>

Change level values

You can recode specified factor levels with fct_recode():

gss_cat %>% count(partyid)
#> # A tibble: 10 × 2
#>       partyid   n
#>        <fctr> <int>
#> 1     No answer  154
#> 2     Don't know   1
#> 3    Other party  393
#> 4 Strong republican 2314
#> 5 Not str republican 3032
#> 6    Ind,near rep 1791
#> # ... with 4 more rows

gss_cat %>%
 mutate(partyid = fct_recode(partyid,
  "Republican, strong"  = "Strong republican",
  "Republican, weak"   = "Not str republican",
  "Independent, near rep" = "Ind,near rep",
  "Independent, near dem" = "Ind,near dem",
  "Democrat, weak"    = "Not str democrat",
  "Democrat, strong"   = "Strong democrat"
 )) %>%
 count(partyid)
#> # A tibble: 10 × 2
#>         partyid   n
#>         <fctr> <int>
#> 1       No answer  154
#> 2      Don't know   1
#> 3      Other party  393
#> 4  Republican, strong 2314
#> 5   Republican, weak 3032
#> 6 Independent, near rep 1791
#> # ... with 4 more rows

Note that unmentioned levels are left as is, and the order of the levels is preserved.

fct_lump() allows you to lump the rarest (or most common) levels in to a new โ€œotherโ€ level. The default behaviour is to collapse the smallest levels in to other, ensuring that it’s still the smallest level. For the religion variable that tells us that Protestants out number all other religions, which is interesting, but we probably want more level.

gss_cat %>% 
 mutate(relig = fct_lump(relig)) %>% 
 count(relig)
#> # A tibble: 2 × 2
#>    relig   n
#>    <fctr> <int>
#> 1   Other 10637
#> 2 Protestant 10846

Alternatively you can supply a number of levels to keep, n, or minimum proportion for inclusion, prop. If you use negative values, fct_lump()will change direction, and combine the most common values while preserving the rarest.

gss_cat %>% 
 mutate(relig = fct_lump(relig, n = 5)) %>% 
 count(relig)
#> # A tibble: 6 × 2
#>    relig   n
#>    <fctr> <int>
#> 1   Other  913
#> 2 Christian  689
#> 3    None 3523
#> 4   Jewish  388
#> 5  Catholic 5124
#> 6 Protestant 10846

gss_cat %>% 
 mutate(relig = fct_lump(relig, prop = -0.10)) %>% 
 count(relig)
#> # A tibble: 12 × 2
#>           relig   n
#>          <fctr> <int>
#> 1        No answer  93
#> 2       Don't know  15
#> 3 Inter-nondenominational  109
#> 4     Native american  23
#> 5        Christian  689
#> 6   Orthodox-christian  95
#> # ... with 6 more rows

Change level order

There are four simple helpers for common operations:

 • fct_relevel() is similar to stats::relevel() but allows you to move any number of levels to the front.
 • fct_inorder() orders according to the first appearance of each level.
 • fct_infreq() orders from most common to rarest.
 • fct_rev() reverses the order of levels.

fct_reorder() and fct_reorder2() are useful for visualisations. fct_reorder() reorders the factor levels by another variable. This is useful when you map a categorical variable to position, as shown in the following example which shows the average number of hours spent watching television across religions.

relig <- gss_cat %>%
 group_by(relig) %>%
 summarise(
  age = mean(age, na.rm = TRUE),
  tvhours = mean(tvhours, na.rm = TRUE),
  n = n()
 )

ggplot(relig, aes(tvhours, relig)) + geom_point()
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
 geom_point()

fct_reorder2() extends the same idea to plots where a factor is mapped to another aesthetic, like colour. The defaults are designed to make legends easier to read for line plots, as shown in the following example looking at marital status by age.

by_age <- gss_cat %>%
 filter(!is.na(age)) %>%
 group_by(age, marital) %>%
 count() %>%
 mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop)) +
 geom_line(aes(colour = marital))
reorder2-1ggplot(by_age, aes(age, prop)) +
 geom_line(aes(colour = fct_reorder2(marital, age, prop))) +
 labs(colour = "marital")
 reorder2-2

Learning more

You can learn more about forcats in R for data science, and on the forcats website.

Please let me know if you have more factor problems that forcats doesn’t help with!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Fundamental and Technical Analysis of Shares Exercises

By Miodrag Sljukic

fb-bb-rsi

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises we shall explore possibilities for fundamental and technical analysis of stocks offered by the quantmod package. If you don’t have the package already installed, install it using the following code:


install.packages("quantmod")

and load it into the session using the following code:


library("quantmod")

before proceeding.

Answers to the exercises are available here.

If you have a different solution, feel free to post it.

Exercise 1

Load FB (Facebook) market data from Yahoo and assign it to an xts object fb.p.

Exercise 2

Display monthly closing prices of Facebook in 2015.

Exercise 3

Plot weekly returns of FB in 2016.

Exercise 4

Plot a candlestick chart of FB in 2016.

Exercise 5

Plot a line chart of FB in 2016., and add boilinger bands and a Relative Strength index to the chart.

Exercise 6

Get yesterday’s EUR/USD rate.

Exercise 7

Get financial data for FB and display it.

Exercise 8

Calculate the current ratio for FB for years 2013, 2014 and 2015. (Tip: You can calculate the current ratio when you divide current assets with current liabilities from the balance sheet.)

Exercise 9

Based on the last closing price and income statement for 12 months ending on December 31th 2015, Calculate the PE ratio for FB. (Tip: PE stands for Price/Earnings ratio. You calculate it as stock price divided by diluted normalized EPS read from income statement.)

Exercise 10

write a function getROA(symbol, year) which will calculate return on asset for given stock symbol and year. What is the ROI for FB in 2014. (Tip: ROA stands for Return on asset. You calculate it as net income divided by total asset.)

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to Check Data Quality using R

By R programming

Share on Facebook

(This article was first published on R programming, and kindly contributed to R-bloggers)

Share on LinkedInShare on Google+

How to check data quality

By Milind Paradkar

Do You Use Clean Data?

Always go for clean data! Why is it that experienced traders/authors stress this point in their trading articles/books so often? As a novice trader, you might be using the freely available data from sources like Google or Yahoo finance. Do such sources provide accurate, quality data?

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 โ€“ 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

Here is our finding. Out of the 143 stocks scanned, 89 stocks had data points less than 5250, that’s more than 60% of our sample set!! The table shown below lists downs 10 such stocks from those 89 stocks.

Symbols

Let’s take the case of PAGEIND. Google finance has captured only 4348 1-minute data points for the stock, thus missing 902 points!!

Example โ€“ Missing the 1306 minute bar on 20160801:

Missing the 1306 minute bar on 20160801

Example โ€“ Missing the 1032 minute bar on 20160802:

Missing the 1032 minute bar on 20160802

If a trader is running an intraday strategy which generates buy/sell signals based on 1-minute bars, the strategy is bound to give some false signals.

As can be seen from the quick check above, data quality from free sources or from cheap data vendors is not always guaranteed. Many of the cheap data vendors source the data from Yahoo finance and provide it to their clients. Poor data feed is a big issue faced by many traders and you will find many traders complaining about the same on various trading forums.

Backtesting a trading strategy using such data will give false results. If are using the data in live trading and in case there is a server problem with Google or Yahoo finance, it will lead to a delay in the data feed. As a trader, you don’t want to be in a position where you have an open trade, and the data feed stops or is delayed. When trading with real money, one is always advised to use quality data from reliable data vendors. After all, Data is Everything!

Next Step

If you’re a retail trader interested in learning various aspects of Algorithmic trading, check out the Executive Programme in Algorithmic Trading (EPAT). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. The course equips you with the required skillsets to be a successful trader.

Download Data Files

 • Do You Use Clean Data.rar
  • 15 Day Intraday Historical Data.zip
  • F&O Stock List.csv
  • R code โ€“ Google_Data_Quality_Check.txt
  • R code โ€“ Stock price data.txt

Share on FacebookShare on LinkedInShare on Google+

The post How to Check Data Quality using R appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggtree for microbiome data

By R on Guangchuang Yu

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

ggtree can parse many software outputs and the evolution evidences inferred by these software can be used directly for tree annotation. ggtree not only works as an infrastructure that enables evolutionary data that inferred by commonly used software packages to be used in R, but also serves as a general tree visualization and annotation tool for the R community as it supports many S3/S4 objects defined by other R packages.

phyloseq for microbiome data

phyloseq class defined in the phyloseq package was designed for microbiome data. phyloseq package implemented plot_tree function using ggplot2. Although the function was implemented by ggplot2 and we can use theme, scale_color_manual etc for customization, the most valuable part of ggplot2, adding layer, is missing. plot_tree only provides limited parameters to control the output graph and it is hard to adding layer unless user has expertise in both phyloseq and ggplot2.

library(phyloseq)

data(GlobalPatterns)
GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
GP.chl <- subset_taxa(GP, Phylum=="Chlamydiae")

plot_tree(GP.chl, color="SampleType", shape="Family", label.tips="Genus", size="Abundance") + ggtitle("tree annotation using phyloseq")

PS: If we look at the plot careful, we will find that legend produce by plot_tree is not correct (plot_tree map SampleType to color text which was shown in legend, but we can’t find the mapping in the plot).

ggtree supports phyloseq object

One of the advantage of R is the community. R users develop packages that can work together and complete each other. ggtree fits the R ecosystem in phylogenetic analysis. It supports several classes defined in other R packages that designed for storing phylogenetic tree with associated data, including phyloseq.

library(scales)
library(ggtree)
p <- ggtree(GP.chl, ladderize = FALSE) + geom_text2(aes(subset=!isTip, label=label), hjust=-.2, size=4) +
  geom_tiplab(aes(label=Genus), hjust=-.3) +
  geom_point(aes(x=x+hjust, color=SampleType, shape=Family, size=Abundance),na.rm=TRUE) +
  scale_size_continuous(trans=log_trans(5)) +
  theme(legend.position="right") + ggtitle("reproduce phyloseq by ggtree")
print(p)

With ggtree, it would be more flexible to combine different layers using grammar of graphics syntax and more powerful since layers can be added without limitation (i.e. those predefined in plot_tree function). As an example, I extract the barcode sequence from the tree object and use msaplot to visualize the barcode sequence with the tree.

df <- fortify(GP.chl)
barcode <- as.character(df$Barcode_full_length)
names(barcode) <- df$label
barcode <- barcode[!is.na(barcode)]
msaplot(p, Biostrings::BStringSet(barcode), width=.3, offset=.05)

PS: I am thinking about writing a tutorial through examples. If you have any interesting topic, please let me know.

Citation

G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. doi:10.1111/2041-210X.12628.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ubeR: A Package for the Uber API

By Andrew Collier

(This article was first published on R โ€“ Exegetic Analytics, and kindly contributed to R-bloggers)

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience.

Installation

The package is currently hosted on GitHub. Installation is simple using the devtools package.

> devtools::install_github("DataWookie/ubeR")
> library(ubeR)

Authentication

To work with the API you’ll need to create a new application for the Rides API.

 • Set Redirect URL to http://localhost:1410/.
 • Enable the profile, places, ride_widgets, history_lite and history scopes.

With the resulting Client ID and Client Secret you’ll be ready to authenticate. I’ve stored mine as environment variables but you can just hard code them into the script for starters.

> UBER_CLIENTID = Sys.getenv("UBER_CLIENTID")
> UBER_CLIENTSECRET = Sys.getenv("UBER_CLIENTSECRET")
> 
> uber_oauth(UBER_CLIENTID, UBER_CLIENTSECRET)

Identity

We can immediately use uber_me() to retrieve information about the authenticated user.

> identity <- uber_me()
> names(identity)
[1] "picture"     "first_name"   "last_name"    "uuid"      "rider_id"    
[6] "email"      "mobile_verified" "promo_code"  
> identity$first_name
[1] "Andrew"
> identity$picture
[1] "https://d1w2poirtb3as9.cloudfront.net/default.jpeg"

Clearly I haven’t made enough effort in personalising my Uber account.

Designated Places

Uber allows you to specify predefined locations for โ€œhomeโ€ and โ€œworkโ€. These are accessible via uber_places_get().

> uber_places_get("home")
$address
[1] "St Andrews Dr, Durban North, 4051, South Africa"

> uber_places_get("work")
$address
[1] "Dock Rd, V & A Waterfront, Cape Town, 8002, South Africa"

These addresses can be modified using uber_places_put().

History

You can access data for recent rides using uber_history().

> history <- uber_history(50, 0)
> names(history)
 [1] "status"    "distance"   "request_time" "start_time"  "end_time"   "request_id" 
 [7] "product_id"  "latitude"   "display_name" "longitude"

The response includes a wide range of fields, we’ll just pick out just a few of them for closer inspection.

> head(history)[, c(2, 4:5, 9)]
 distance     start_time      end_time display_name
1  1.3140 2016-08-15 17:35:24 2016-08-15 17:48:54 New York City
2 13.6831 2016-08-11 15:29:58 2016-08-11 16:04:22   Cape Town
3  2.7314 2016-08-11 09:09:25 2016-08-11 09:23:51   Cape Town
4  3.2354 2016-08-10 19:28:41 2016-08-10 19:38:07   Cape Town
5  7.3413 2016-08-10 16:37:30 2016-08-10 17:21:16   Cape Town
6  4.3294 2016-08-10 13:38:49 2016-08-10 13:59:00   Cape Town

Product Descriptions

We can get a list of cars near to a specified location using uber_products().

> cars <- uber_products(latitude = -33.925278, longitude = 18.423889)
> names(cars)
[1] "capacity"     "product_id"    "price_details"   "image"      
[5] "cash_enabled"   "shared"      "short_description" "display_name"   
[9] "description" 
> cars[, c(1, 2, 7)]
 capacity              product_id short_description
1    4 91901472-f30d-4614-8ba7-9fcc937cebf5       uberX
2    6 419f6bdc-7307-4ea8-9bb0-2c7d852b616a      uberXL
3    4 1dd39914-a689-4b27-a59d-a74e9be559a4     UberBLACK

Information for a particular car can also be accessed.

> product <- uber_products(product_id = "91901472-f30d-4614-8ba7-9fcc937cebf5")
> names(product)
[1] "capacity"     "product_id"    "price_details"   "image"      
[5] "cash_enabled"   "shared"      "short_description" "display_name"   
[9] "description"   
> product$price_details
$service_fees
list()

$cost_per_minute
[1] 0.7

$distance_unit
[1] "km"

$minimum
[1] 20

$cost_per_distance
[1] 7

$base
[1] 5

$cancellation_fee
[1] 25

$currency_code
[1] "ZAR"

Estimates

It’s good to have a rough idea of how much a ride is going to cost you. What about a trip from Mouille Point to the Old Biscuit Mill?

old-biscuit-mill

> estimate <- uber_requests_estimate(start_latitude = -33.899656, start_longitude = 18.407663,
+                  end_latitude = -33.927443, end_longitude = 18.457557)
> estimate$trip
$distance_unit
[1] "mile"

$duration_estimate
[1] 600

$distance_estimate
[1] 4.15

> estimate$pickup_estimate
[1] 4
> estimate$price
 high_amount display_amount display_name low_amount surge_multiplier currency_code
1    5.00      5.00  Base Fare    5.00        1      ZAR
2    56.12  42.15-56.12   Distance   42.15        1      ZAR
3    8.30   6.23-8.30     Time    6.23        1      ZAR

Not quite sure why the API is returning the distance in such obscure units. (Note to self: convert those to metric equivalent in next release!) The data above are based on the car nearest to the start location. What about prices for a selection of other cars?

> estimate <- uber_estimate_price(start_latitude = -33.899656, start_longitude = 18.407663,
+           end_latitude = -33.927443, end_longitude = 18.457557)
> names(estimate)
 [1] "localized_display_name" "high_estimate"     "minimum"        "duration"
 [5] "estimate"        "distance"        "display_name"      "product_id"
 [9] "low_estimate"      "surge_multiplier"    "currency_code"     
> estimate[, c(1, 5)]
 localized_display_name estimate
1         uberX ZAR53-69
2         uberXL ZAR68-84
3       uberBLACK ZAR97-125

The time of arrival for each of those cars can be accessed via uber_estimate_time().

> uber_estimate_time(start_latitude = -33.899656, start_longitude = 18.407663)
 localized_display_name estimate display_name              product_id
1         uberX   180    uberX 91901472-f30d-4614-8ba7-9fcc937cebf5
2         uberXL   420    uberXL 419f6bdc-7307-4ea8-9bb0-2c7d852b616a
3       uberBLACK   300  uberBLACK 1dd39914-a689-4b27-a59d-a74e9be559a4

So, for example, the uberXL would be expected to arrive in 7 minutes, while the uberX would pick you up in only 3 minutes.

Requesting a Ride

It’s also possible to request a ride. At present these requests are directed to the Uber API Sandbox. After we have done further testing we’ll retarget the requests to the API proper.

A new ride is requested using uber_requests().

> ride <- uber_requests(start_address = "37 Beach Road, Mouille Point, Cape Town",
+            end_address = "100 St Georges Mall, Cape Town City Centre, Cape Town")

Let’s find out the details of the result.

> names(ride)
 [1] "status"      "destination"   "product_id"    "request_id"
 [5] "driver"      "pickup"      "eta"       "location"
 [9] "vehicle"     "surge_multiplier" "shared"   
> ride$pickup
$latitude
[1] -33.9

$longitude
[1] 18.406
> ride$destination
$latitude
[1] -33.924

$longitude
[1] 18.42

Information about the currently requested ride can be accessed using uber_requests_current(). If we decide to walk instead, then it’s also possible to cancel the pickup.

> uber_requests_current_delete()

Future

For more information about units of measurement, limits and parameters of the Uber API, have a look at the API Overview.

We’ll be extending the package to cover the remaining API endpoints. But, for the moment, most of the core functionality is already covered.

The post ubeR: A Package for the Uber API appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R โ€“ Exegetic Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Floridโ€™AISTATS

By xi’an

(This article was first published on R โ€“ Xi’an’s Og, and kindly contributed to R-bloggers)

The next AISTATS conference is taking place in Florida, Fort Lauderdale, on April 20-22. (The website keeps the same address one conference after another, which means all my links to the AISTATS 2016 conference in Cadiz are no longer valid. And that the above sunset from Florida is namedโ€ฆ cadiz.jpg!) The deadline for paper submission is October 13 and there are two novel features:

 1. Fast-track for Electronic Journal of Statistics: Authors of a small number of accepted papers will be invited to submit an extended version for fast-track publication in a special issue of the Electronic Journal of Statistics (EJS) after the AISTATS decisions are out. Details on how to prepare such extended journal paper submission will be announced after the AISTATS decisions.
 2. Review-sharing with NIPS: Papers previously submitted to NIPS 2016 are required to declare their previous NIPS paper ID, and optionally supply a one-page letter of revision (similar to a revision letter to journal editors; anonymized) in supplemental materials. AISTATS reviewers will have access to the previous anonymous NIPS reviews. Other than this, all submissions will be treated equally.

I find both initiatives worth applauding and replicating in other machine-learning conferences. Particularly in regard with the recent debate we had at Annals of Statistics.

Filed under: pictures, R, Statistics, Travel, University life Tagged: AISTATS 2016, AISTATS 2017, Annals of Statistics, Cadiz, Electronic Journal of Statistics, Florida, machine learning, NIPS 2017, proceedings, refereeing

To leave a comment for the author, please follow the link and comment on their blog: R โ€“ Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

2016-11 The Butterfly Affectation

By pmur002

(This article was first published on R โ€“ Stat Tech, and kindly contributed to R-bloggers)

This report documents a variety of approaches to including an external vector image within an R plot. The image presents particular challenges because it contains features that are not natively supported by the R graphics system, which makes it hard for R to faithfully reproduce the image.

Paul Murrell

Download

To leave a comment for the author, please follow the link and comment on their blog: R โ€“ Stat Tech.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Courses at London, Leeds and Newcastle

By csgillespie

(This article was first published on R โ€“ Why?, and kindly contributed to R-bloggers)

Over the next few months we’re running a number of R courses at London, Leeds and Newcastle.

 • September 2016 (Newcastle)
  • Sept 12th: Introduction to R
  • Sept 13th: Statistical modelling
  • Sept 14th: Programming with R
  • Sept 15th: Efficient R: speeding up your code
  • Sept 16th: Advanced graphics
 • October 2016 (London)
  • Oct 3rd: Introduction to R
  • Oct 4th: Programming with R
  • Oct 5, 6th: Predictive analytics
  • Oct 7th: Building an R package
 • November 2016 (Leeds)
  • Nov 21st, 22nd: Predictive analytics
  • Nov 23rd: Building an R package
 • December 2016 (London)
  • December 5th, 6th: Advanced programming. Held at the Royal Statistical Society (booking form).
 • January 2017 (Newcastle)
  • Jan 16th: Introduction to R
  • Jan 17th: Statistical modelling
  • Jan 18th: Programming with R
  • Jan 19th: Efficient R: speeding up your code
  • Jan 20th: Advanced graphics

See the website for course description. Any questions, feel free to contact me: csgillespie@gmail.com

On site courses available on request.

To leave a comment for the author, please follow the link and comment on their blog: R โ€“ Why?.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The R community is awesome (and fast)

By John Mount

(This article was first published on R โ€“ Win-Vector Blog, and kindly contributed to R-bloggers)

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

Please read on for my list of n=3 interactions.

 1. While discussing plotting market data I ran into a corner-case with ggplot2. Even though I figured out how to work around it, it is now fixed by the ggplot2 team!
 2. I wrote an entire article denouncing a default setting of a single argument in the ranger random forest library. The ranger author himself replied with a fix that is very clever and mathematically well-founded (I suspect he had be researching this issue a while on his own).
 3. I complained about summary presentation fidelity in base R summary.default. You guessed it: the volunteers have generously fielded a patch!

Like any real-world system R represents a sequence of history and compromises. Only unused systems can be perfect without compromise. It is very evident how eager and able the volunteers who maintain it are to make sure R represents very good compromises.

I would like to offer a sincere appreciation and thank you from me to the R community. If this is what you can expect using R it is yet another strong argument for R.

And personal thanks to: Martin Maechler, Hadley Wickham, and Marvin N. Wright.

To leave a comment for the author, please follow the link and comment on their blog: R โ€“ Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Multidimensional clustering with web analytics data

By eoda GmbH

[R] Kenntnis-Tage 2016 | 2.11-3.11 | Kassel - Germany

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

Speaker of the [R] Kenntnis-Tage 2016: Alexander Kruse | etracker GmbH

Alexander Kruse works as a data analyst at etracker, a leading provider of products and services for optimizing websites and online marketing activities in Europe. By now, more than 110.000 customers are using etracker solutions, among them companies such as Jochen Schweizer, Vorwerk, the Lufthansa Worldshop and many more from the fields of e-commerce, media, brands and B2B.

[R] Kenntnis-Tage 2016 | 2.11-3.11 | Kassel โ€“ Germany

Web analytics is the collection and evaluation of data regarding the behavior of website visitors. Alexander Kruse’s guest lecture multidimensional clustering with web analytics data will show an approach to divide the heterogeneous totality of website visitors into homogeneous segments or clusters using a multidimensional cluster analysis. The lecture will focus on choosing adequate variables for segmentation, determining the ideal number of clusters and performing a multidimensional clustering. For data preparation, analysis and visualization the statistics software R is used.

On http://www.eoda.de/de/R-Kenntnis-Tage.html you will find the agenda and further information as well as the registration form for the [R] Kenntnis-Tage 2016.

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News