Visualizing Chess Data With ggplot

By Joshua Kunst

plot of chunk unnamed-chunk-8

(This article was first published on Jkunst – R category , and kindly contributed to R-bloggers)

There are nice visualizations from chess data:
piece movement,
piece survaviliy,
square usage by player.
Sadly not always the authors shows the code/data for replicate the final result.
So I wrote some code to show how to do some this great visualizations entirely in
R. Just for fun.

  1. The Data
  2. Piece Movements
  3. Survival rates
  4. Square usage by player
  5. Distributions for the first movement
  6. Who captures whom

The Data

The original data come from here
which was parsed and stored in the rchess package.

library("rchess")
data(chesswc)
str(chesswc)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1266 obs. of  11 variables:
##  $ event   : chr  "FIDE World Cup 2011" "FIDE World Cup 2011" "FIDE World Cup 2011" "FIDE World Cup 2011" ...
##  $ site    : chr  "Khanty-Mansiysk RUS" "Khanty-Mansiysk RUS" "Khanty-Mansiysk RUS" "Khanty-Mansiysk RUS" ...
##  $ date    : Date, format: "2011-08-28" "2011-08-28" ...
##  $ round   : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ white   : chr  "Kaabi, Mejdi" "Ivanchuk, Vassily" "Ibrahim, Hatim" "Ponomariov, Ruslan" ...
##  $ black   : chr  "Karjakin, Sergey" "Steel, Henry Robert" "Mamedyarov, Shakhriyar" "Gwaze, Robert" ...
##  $ result  : chr  "0-1" "1-0" "0-1" "1-0" ...
##  $ whiteelo: int  2344 2768 2402 2764 2449 2746 2477 2741 2493 2736 ...
##  $ blackelo: int  2788 2362 2765 2434 2760 2452 2744 2480 2739 2493 ...
##  $ eco     : chr  "D15" "E68" "E67" "B40" ...
##  $ pgn     : chr  "1. d4 d5 2. Nf3 Nf6 3. c4 c6 4. Nc3 dxc4 5. e3 b5 6. a4 b4 7. Nb1 Ba6 8. Ne5 e6 9. Nxc4 c5 10. b3 cxd4 11. exd4 Nc6 12. Be3 Be7"| __truncated__ "1. c4 Nf6 2. Nc3 g6 3. g3 Bg7 4. Bg2 O-O 5. d4 d6 6. Nf3 Nbd7 7. O-O e5 8. e4 c6 9. Rb1 exd4 10. Nxd4 Re8 11. h3 Nc5 12. Re1 a5"| __truncated__ "1. Nf3 Nf6 2. c4 g6 3. Nc3 Bg7 4. g3 O-O 5. Bg2 d6 6. O-O Nbd7 7. d4 e5 8. b3 exd4 9. Nxd4 Re8 10. Bb2 Nc5 11. Qc2 h5 12. Rad1 "| __truncated__ "1. e4 c5 2. Nf3 e6 3. d3 Nc6 4. g3 e5 5. Bg2 d6 6. O-O Be7 7. c3 Nf6 8. Nbd2 O-O 9. a3 b5 10. Re1 Kh8 11. d4 Bd7 12. b4 cxd4 13"| __truncated__ ...
chesswc %>% count(event)
event n
FIDE World Cup 2011 398
FIDE World Cup 2013 435
FIDE World Cup 2015 433
chesswc <- chesswc %>% filter(event == "FIDE World Cup 2015")

The most important variable here is the pgn game.
This pgn is a long string which represent the game. However this format is not so visualization
friendly. That’s why I implemented the history_detail() method for a Chess object. Let’s check.

set.seed(123)
pgn <- sample(chesswc$pgn, size = 1)
str_sub(pgn, 0, 50)
## [1] "1. d4 Nf6 2. Nf3 d5 3. c4 e6 4. e3 Be7 5. Nbd2 O-O"

Compare the previous string with the first 10 rows of the history_detail()

chss <- Chess$new()
chss$load_pgn(pgn)
## [1] TRUE
chss$history_detail() %>%
  arrange(number_move) %>% 
  head(10)
piece from to number_move piecenumbermove status numbermovecapture captured_by
d2 Pawn d2 d4 1 1 NA NA NA
g8 Knight g8 f6 2 1 game over NA NA
g1 Knight g1 f3 3 1 NA NA NA
d7 Pawn d7 d5 4 1 NA NA NA
c2 Pawn c2 c4 5 1 captured 14 d7 Pawn
e7 Pawn e7 e6 6 1 game over NA NA
e2 Pawn e2 e3 7 1 game over NA NA
f8 Bishop f8 e7 8 1 NA NA NA
b1 Knight b1 d2 9 1 NA NA NA
Black King e8 g8 10 1 game over NA NA

The result is a dataframe where each row is a piece’s movement showing explicitly the cells
where the travel in a particular number move. Now we apply this function over the 433
games in the FIDE World Cup 2015.

library("foreach")
library("doParallel")
workers <- makeCluster(parallel::detectCores())
registerDoParallel(workers)

chesswc <- chesswc %>% mutate(game_id = seq(nrow(.)))

dfmoves <- adply(chesswc %>% select(pgn, game_id), .margins = 1, function(x){
  chss <- Chess$new()
  chss$load_pgn(x$pgn)
  chss$history_detail()
  }, .parallel = TRUE, .paropts = list(.packages = c("rchess")))

dfmoves <- tbl_df(dfmoves) %>% select(-pgn)
dfmoves %>% filter(game_id == 1, piece == "g1 Knight")
game_id piece from to number_move piecenumbermove status numbermovecapture captured_by
1 g1 Knight g1 f3 5 1 NA NA NA
1 g1 Knight f3 h2 37 2 NA NA NA
1 g1 Knight h2 g4 39 3 NA NA NA
1 g1 Knight g4 f2 85 4 game over NA NA

The dfmoves data frame will be the heart from all these plots due have a lot of information and
it is easy to consume.

Piece Movements

To try replicate the result it’s necessary the data to represent (and then plot) the
board. In the rchess package there are some helper functions like chessboardata().

dfboard <- rchess:::.chessboarddata() %>%
  select(cell, col, row, x, y, cc)

head(dfboard)
cell col row x y cc
a1 a 1 1 1 b
b1 b 1 2 1 w
c1 c 1 3 1 b
d1 d 1 4 1 w
e1 e 1 5 1 b
f1 f 1 6 1 w

Now we add this information to the dfmoves data frame and calculates some field to
to know how to draw the curves (see here for more details).

dfpaths <- dfmoves %>%
  left_join(dfboard %>% rename(from = cell, x.from = x, y.from = y),
            by = "from") %>%
  left_join(dfboard %>% rename(to = cell, x.to = x, y.to = y) %>% select(-cc, -col, -row),
            by = "to") %>%
  mutate(x_gt_y = abs(x.to - x.from) > abs(y.to - y.from),
         xy_sign = sign((x.to - x.from)*(y.to - y.from)) == 1,
         x_gt_y_equal_xy_sign = x_gt_y == xy_sign)

The data is ready! So we need now some ggplot, geom_tile for the board, the new geom_curve
to represent the piece’s path and some jitter to make this more artistic. Let’s
plot the f1 Bishop’s movements.

ggplot() +
  geom_tile(data = dfboard, aes(x, y, fill = cc)) +
  geom_curve(data = dfpaths %>% filter(piece == "f1 Bishop", x_gt_y_equal_xy_sign),
             aes(x = x.from, y = y.from, xend = x.to, yend = y.to),
             position = position_jitter(width = 0.2, height = 0.2),
             curvature = 0.50, angle = -45, alpha = 0.02, color = "white", size = 1.05) +
  geom_curve(data = dfpaths %>% filter(piece == "f1 Bishop", !x_gt_y_equal_xy_sign),
             aes(x = x.from, y = y.from, xend = x.to, yend = y.to),
             position = position_jitter(width = 0.2, height = 0.2),
             curvature = -0.50, angle = 45, alpha = 0.02, color = "white", size = 1.05) +
  scale_fill_manual(values =  c("gray10", "gray20")) +
  ggtitle("f1 Bishop") +
  coord_equal()

In the same way we can plot every piece.

plot of chunk unnamed-chunk-9

I think it’s look very similar to the original work made by Steve Tung.

Survival Rates

In this plot we need filter dfmoves by !is.na(status) so we can know what happend with
every piece in at the end of the game: if a piece was caputered of or not. Then get summary
across all the games.

dfsurvrates <- dfmoves %>%
  filter(!is.na(status)) %>%
  group_by(piece) %>%
  summarize(games = n(),
            was_captured = sum(status == "captured")) %>%
  mutate(surv_rate = 1 - was_captured/games)

dfsurvrates %>% arrange(desc(surv_rate)) %>% head()
piece games was_captured surv_rate
Black King 433 0 1.000
White King 433 0 1.000
h2 Pawn 433 121 0.721
h7 Pawn 433 148 0.658
g2 Pawn 433 150 0.654
g7 Pawn 433 160 0.630

This helps as validation because the kings are never captured. Now we use a helper function in the
rchess package rchess:::.chesspiecedata() to get the start position for every piece and then plot
the survival rates in the cell where the piece start in the game.

dfsurvrates <- dfsurvrates %>%
  left_join(rchess:::.chesspiecedata() %>% select(start_position, piece = name, color, unicode),
            by = "piece") %>%
  full_join(dfboard %>% rename(start_position = cell),
            by = "start_position")

# Auxiliar data to plot the board
dfboard2 <- data_frame(x = 0:8 + 0.5, y = 0 + 0.5, xend = 0:8 + 0.5, yend = 8 + 0.5)

ggplot(dfsurvrates) +
  geom_tile(data = dfsurvrates %>% filter(!is.na(surv_rate)),
            aes(x, y, fill = surv_rate)) +
  scale_fill_gradient(low = "darkred",  high = "white") +
  geom_text(data = dfsurvrates %>% filter(!is.na(surv_rate)),
            aes(x, y, label = scales::percent(surv_rate)),
            color = "gray70", size = 5) +
  scale_x_continuous(breaks = 1:8, labels = letters[1:8]) +
  scale_y_continuous(breaks = 1:8, labels = 1:8)  +
  geom_segment(data = dfboard2, aes(x, y, xend = xend, yend = yend), color = "gray70") +
  geom_segment(data = dfboard2, aes(y, x, xend = yend, yend = xend), color = "gray70") +
  ggtitle("Survival Rates for each piece") + 
  coord_equal() + 
  theme_minimal() +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-11

Obviously the plot show same data in text and color, and there a lot of space without
information but the idea is use the chess board to represent the initial position in a chess game.

We can replace the texts with the piece’s icons:

ggplot(dfsurvrates) +
  geom_tile(data = dfsurvrates %>% filter(!is.na(surv_rate)),
            aes(x, y, fill = 100*surv_rate)) +
  scale_fill_gradient(NULL, low = "darkred",  high = "white") +
  geom_text(data = dfsurvrates %>% filter(!is.na(surv_rate)),
            aes(x, y, label = unicode), size = 11, color = "gray20", alpha = 0.7) +
  scale_x_continuous(breaks = 1:8, labels = letters[1:8]) +
  scale_y_continuous(breaks = 1:8, labels = 1:8)  +
  geom_segment(data = dfboard2, aes(x, y, xend = xend, yend = yend), color = "gray70") +
  geom_segment(data = dfboard2, aes(y, x, xend = yend, yend = xend), color = "gray70") +
  ggtitle("Survival Rates for each piece") + 
  coord_equal() +
  theme_minimal() +
  theme(legend.position = "bottom")

plot of chunk unnamed-chunk-12

Square Usage By Player

For this visualization we will use the to variable. First of all we select the player
who have more games in the table chesswc. Then for each of them get the to counts.

players <- chesswc %>% count(white) %>% arrange(desc(n)) %>% .$white %>% head(4)
players
## [1] "Karjakin, Sergey" "Svidler, Peter"   "Wei, Yi"         
## [4] "Adams, Michael"
dfmov_players <- ldply(players, function(p){ # p <- sample(players, size = 1)
  games <- chesswc %>% filter(white == p) %>% .$game_id
  dfres <- dfmoves %>%
    filter(game_id %in% games, !is.na(to)) %>%
    count(to) %>%
    mutate(player = p,
           p = n/length(games))
  dfres
})

dfmov_players <- dfmov_players %>%
  rename(cell = to) %>%
  left_join(dfboard, by = "cell")

ggplot(dfmov_players) +
  geom_tile(aes(x, row, fill = p)) +
  scale_fill_gradient("Movements to every celln(normalized by number of games)",
                      low = "white",  high = "darkblue") +
  geom_text(aes(x, row, label = round(p, 1)), size = 3, color = "white", alpha = 0.5) +
  facet_wrap(~player) +
  scale_x_continuous(breaks = 1:8, labels = letters[1:8]) +
  scale_y_continuous(breaks = 1:8, labels = 1:8)  +
  geom_segment(data = dfboard2, aes(x, y, xend = xend, yend = yend), color = "gray70") +
  geom_segment(data = dfboard2, aes(y, x, xend = yend, yend = xend), color = "gray70") +
  coord_equal() +
  theme_minimal() +
  theme(legend.position = "bottom")

plot of chunk unnamed-chunk-13

Distributions For The First Movement

Now, with the same data and using the piece_number_move and number_move we can obtain
the distribution for the first movement for each piece.

piece_lvls <- rchess:::.chesspiecedata() %>%
  mutate(col = str_extract(start_position, "w{1}"),
         row = str_extract(start_position, "d{1}")) %>%
  arrange(desc(row), col) %>%
  .$name

dfmoves_first_mvm <- dfmoves %>%
  mutate(piece = factor(piece, levels = piece_lvls),
         number_move_2 = ifelse(number_move %% 2 == 0, number_move/2, (number_move + 1)/2 )) %>%
  filter(piece_number_move == 1)
ggplot(dfmoves_first_mvm) +
  geom_density(aes(number_move_2), fill = "#B71C1C", alpha = 0.8, color = NA) +
  scale_y_continuous(breaks = NULL) +
  facet_wrap(~piece, nrow = 4, ncol = 8, scales = "free_y")  +
  xlab("Density") + ylab("Number Move") + 
  xlim(0, 40) +
  theme_gray() +
  theme(panel.background = element_rect(fill = "gray90"))

plot of chunk unnamed-chunk-15

Notice the similarities between the White King and h1 Rook due the castling, the same
effect is present between the Black King and the h8 Rook.

Who Captures Whom

For this plot we’ll use the igraph package and ForceAtlas2
package an R implementation by Adolfo Alvarez of the Force Atlas 2 graph layout
designed for Gephi.

We get the rows with status == "captured" and summarize by piece and captured_by variables. The result data
frame will be the edges in our igraph object using the graph.data.frame function.

library("igraph")
library("ForceAtlas2")

dfcaputures <- dfmoves %>%
  filter(status == "captured") %>%
  count(captured_by, piece) %>%
  ungroup() %>% 
  arrange(desc(n))

dfvertices <- rchess:::.chesspiecedata() %>%
  select(-fen, -start_position) %>%
  mutate(name2 = str_replace(name, " w+$", unicode),
         name2 = str_replace(name2, "White|Black", ""))

g <- graph.data.frame(dfcaputures %>% select(captured_by, piece, weight = n),
                      directed = TRUE,
                      vertices = dfvertices)

set.seed(123)
# lout <- layout.kamada.kawai(g)
lout <- layout.forceatlas2(g, iterations = 10000, plotstep = 0)

dfvertices <- dfvertices %>%
  mutate(x = lout[, 1],
         y = lout[, 2])

dfedges <- as_data_frame(g, "edges") %>%
  tbl_df() %>%
  left_join(dfvertices %>% select(from = name, x, y), by = "from") %>%
  left_join(dfvertices %>% select(to = name, xend = x, yend = y), by = "to")

To plot the the network I prefer use ggplot2 instead igraph just you get more control in the style
and colors.

ggplot() +
  geom_curve(data = dfedges %>%
               filter((str_extract(from, "d+") %in% c(1, 2) |
                         str_detect(from, "White"))),
             aes(x, y, xend = xend, yend = yend, alpha = weight, size = weight),
             curvature = 0.1, color = "red") +
  geom_curve(data = dfedges %>%
               filter(!(str_extract(from, "d+") %in% c(1, 2) |
                          str_detect(from, "White"))),
             aes(x, y, xend = xend, yend = yend, alpha = weight, size = weight),
             curvature = 0.1, color = "blue") +
  scale_alpha(range = c(0.01, 0.5)) +
  scale_size(range = c(0.01, 2)) +
  geom_point(data = dfvertices, aes(x, y, color = color), size = 15, alpha = 0.9) +
  scale_color_manual(values = c("gray10", "gray90")) +
  geom_text(data = dfvertices %>% filter(str_length(name2) != 1),
            aes(x, y, label = name2), size = 5, color = "gray50") +
  geom_text(data = dfvertices %>% filter(str_length(name2) == 1),
            aes(x, y, label = name2), size = 9, color = "gray50") +
  ggtitle("Red: white captures black | Blue: black captures white")

plot of chunk unnamed-chunk-17

It’s know we usually exchange pieces with the same values: queen by queen, knight by bishop, etc. The interesting
fact we see here is the d2 pawn/c7 pawn/g1 knight relationship beacuse d2 pawn/c7 pawn is not so symmetrical and
it’s explained by the popular use the
Sicilian Opening
in a master level (1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4).

I hope you enjoyed this post in the same way I enjoyed doing it :D. If you notice a mistake please let me know.

To leave a comment for the author, please follow the link and comment on their blog: Jkunst – R category .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny Developer Conference | Stanford University | January 2016

By Joe Cheng

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

In the three years since we launched Shiny, our focus has been on helping people get started with Shiny. But there’s a huge difference between using Shiny and using it well, and we want to start getting serious about helping people use Shiny most effectively. It’s the difference between having apps that merely work, and apps that are performant, robust, and maintainable.

That’s why RStudio is thrilled to announce the first ever Shiny Developer Conference, to be held at Stanford University on January 30-31, 2016, three months from today. We’ll skip past the basics, and dig into principles and practices that will simultaneously simplify and improve the robustness of your code. We’ll introduce you to some brand new tools we’ve created to help you build ever larger and more complex apps. And we’ll show you what to do if things go wrong.

Check out the agenda to see the complete lineup of speakers and talks.

We’re capping the conference at just 90 people, so if you’d like to level up your Shiny skills, register now at http://shiny2016.eventbrite.com.

Hope to see you there!


Note that this conference is intended for R users who are already comfortable writing Shiny apps. We won’t cover the basics of Shiny app creation at all. If you’re looking to get started with Shiny, please see our tutorial.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R and Impala: it’s better to KISS than using Java

By Gergely Daróczi

impala-benchmark

(This article was first published on Data Science Los Angeles » R, and kindly contributed to R-bloggers)

One of the best things I like in working at CARD.com is that I am not only crunching R code in 24/7, but I also have the chance to interact with and improve the related data infrastructure with some interesting technologies.

After joining the company in January, I soon realized that while Impala is a very powerful database for handling data that do not comfortably fit in MySQL, it’s still not as fast as one might expect when querying large amount of data from R. Sometimes I had to wait several minutes for a query to run! So I used this spare time to think about how to improve the workflow.

Interacting with Impala from R is pretty straightforward: just install and load the RImpala package, which uses the JDBC driver to communicate with Impala. It does the job very well for fetching aggregated data form the database, but gets extremely slow when loading more than a thousand or so row — that you cannot resolve buy throwing more hardware on the problem.

So when loading larger amount of data, the related R process is running with 100% CPU usage on one core, while doing the very same query from bash via impala-shell, the results are returned pretty fast. Why not exporting the data to a CSV file via impala-shell then?

TL;DR: loading data into/from Impala via an intermediary CSV file may perform a lot better compared to using the JDBC driver.

Benchmark

To compare the performance of the two approach in a reproducible way, I started an Elastic MapReduce cluster on AWS with a single m3.xlarge instance running the AMI version 3.8.0 with Impala 1.2.4 — R already pre-installed. Then I downloaded the dbgen utility to generate some data for the benchmarks, as described in the Amazon EMR docs:

$ dir test; cd test
$ wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar
$ java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /tmp/dbgen -b 1 -c 0 -t 0

Then put the generated pipe-delimited data files on HDFS:

$ hadoop fs -mkdir /data/
$ hadoop fs -put /tmp/dbgen/* /data/
$ hadoop fs -ls -h -R /data/

And load the data into Impala (not dealing with transforming the data into the Parquet File format or other tweaks as now we are comparing data transfer speed of the connectors):

create EXTERNAL TABLE transactions(id BIGINT, customer_id BIGINT, book_id BIGINT, quantity INT, transaction_date TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';

Now we need to install the RImpala package, then download and extract the zip of JDBC jars as described on the package GitHub page:

> install.packages('RImpala')
> download.file('https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip?raw=true', 'jdbc.zip', method = 'curl', extra = '-L')
> unzip('jdbc.zip')

And all set to interact with the database after initializing the connection to Impala:

> library(RImpala)
Loading required package: rJava
> rimpala.init(libs = 'impala-jdbc-cdh5')
[1] "Classpath added successfully"
> rimpala.connect()
[1] TRUE
> rimpala.query('select count(1) from books')
  count(1)
1 15969976

Now let’s see what happens when we load 10, 100, 1000 or let’s say 10K rows:

> install.packages('microbenchmark'); library(microbenchmark)
> microbenchmark(
+     l1e1 = rimpala.query('select * from books limit 10'),
+     l1e2 = rimpala.query('select * from books limit 100'),
+     l1e3 = rimpala.query('select * from books limit 1000'),
+     l1e4 = rimpala.query('select * from books limit 10000'), times = 10)
Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
 l1e1   373.8289   391.8877   393.5581   392.6247   400.7414   409.0315    10
 l1e2   526.9520   534.7739   547.1202   543.5515   551.1544   578.5519    10
 l1e3  1236.0779  1872.6034  1887.2793  1948.8703  2125.3348  2222.2258    10
 l1e4 17801.2652 23253.7269 24784.2390 25836.3453 26850.4088 27717.6926    10

Almost 30 seconds to fetch 10K rows! Things getting very slow, right? So let’s create a minimal working R implementation of the above proposed method to export the results from Impala to a CSV file and load it via data.table::fread (due to read performance and I am using data.table in most of my R scripts anyway):

> library(data.table)
> query_impala <- function(query) {
+
+     ## generate a temporary file name
+     fn <- tempfile()
+
+     ## remove after read
+     on.exit(unlink(fn))
+
+     ## export results to this file
+     system(paste(
+         'impala-shell -B --quiet -q',
+         shQuote(query),
+         '-o', fn,
+         '"--output_delimiter=,"',
+         '--print_header > /dev/null'))
+ 
+     ## read (and return) data like a pro
+     fread(fn)
+ 
+ }

Well, this function is extremely basic and can work only on localhost. For a more general approach with SSH access to remote databases, logging features and a bit of error handling, please see the updated query_impala function referenced at the end of this post.

But this simple function is fair enough to do some quick benchmarks on how JDBC and the CSV export/import hack performs with different number of rows fetched from Impala. Let’s run a loop to load 1, 10, 100, 1K, 10K and 100K values from the database via the two methods, each repeated by 10 times for future comparison:

> benchmarks <- lapply(10^(0:5), function(limit) {
+     query <- paste('select id from books limit', limit + 1)
+     res <- microbenchmark(
+         rimpala  = rimpala.query(query),
+         csv_hack = query_impala(query),
+         times = 10)
+     res$limit <- limit
+     res
+ })

And let’s transform the results of the benchmarks to an aggregated data.table, and plot the averages on a joint (log-scaled) graph:

> df <- rbindlist(benchmarks)[, .(time = mean(time)), by = .(expr, limit)]
> library(ggplot2)
> ggplot(df, aes(x = limit, y = time/1e9, color = expr)) + geom_line() +
+     scale_y_log10(breaks = c(0.5, 1, 5, 15, 60, 60*5, 15*60)) +
+     scale_x_log10(breaks = 10^(0:5),
+                   labels = c(1, 10, 100, '1K', '10K', '100K')) +
+     xlab('Number of rows') + ylab('Seconds') +
+     theme_bw() + theme('legend.position' = 'top')

Unfortunately, I did not have patience to run this benchmark on more rows or columns, but this is already rather impressive in my (personal) opinion. In short, if you are querying more than a 100 rows from Impala and you have (SSH) console access to the server, you’d better use CSV export instead of waiting for the JDBC driver to deliver the data for you.

Quick comparison of the CSV export and the RImpala approach

Please find this quick comparison of the discussed methods for fetching data from Impala to R:

Advantages Disadvantages
RImpala
  • Can connect to remote database without SSH access
  • On CRAN
  • Slow when querying many rows
  • Java dependency on the client side
  • 20 megabytes of jar files for the driver
CSV export
  • Scales nicely
  • No Java and jar dependencies, it’s damn simple
  • Needs SSH access for remote connections
  • Not on CRAN (yet)

Second thoughts

So why is it happening? I was not 100% sure, but suspected it must be something with the jar files or how those are used in R. The query takes the very same amount of time inside of Impala as it does not matter if you export the data into a CSV file or pass it via the JDBC driver, but parsing and loading it takes extremely long with the latter.

Mentioning these interesting results to Szilard at a lunch, he suggested me to give a try directly querying Impala with the RJDBC package. It sounded pretty insane to use a different R wrapper to the very the same jar files of the RImpala package, but I decided to do this extra test after all to make sure it’s a Java and not an R implementation issue — as per my proposal of keeping things simple (KISS) over using Java.

So I unzipped all the jar files used by the RImpala package above and created a new archive containing the merged content in a file named to impala-jdbc-driver.jar. Then loaded the RJDBC package and initialized a connection:

> library(RJDBC)
> drv  <- JDBC("org.apache.hive.jdbc.HiveDriver", "impala-jdbc-driver.jar", "'")
> conn <- dbConnect(drv, "jdbc:hive2://localhost:21050/;auth=noSasl")

Then we can use the very convenient dbGetQuery method from the DBI package to fetch rows from Impala, with the following impressive results:

So my hypothesis turned out to be wrong! The JDBC driver performs pretty well, even better compared to the CSV hack. I was even tempted to revert our production R scripts to use JDBC instead of the below function using temporary files to read data from Impala, but decided to keep the non-Java approach for multiple reasons after all:

  • No Java dependency on the server. Although storage is cheap nowadays, and I’ve already created a minimal Docker image including R, Java and rJava, but do we really need that 60% increase in the Docker image size? Compare the minimal R Docker image size of with the R+Java Docker image of
  • No memory issues when loading large amount of data. By default, the rJava packages starts JVM with 512 MB of memory, which might not be enough for some queries, so you have to update this default value via eg options(java.parameters = '-Xmx1024m') before loading the rJava package.
  • I prefer using SSH to access the data even if SSL encryption is available with the JDBC driver as well. This might sound silly, but managing users and authentication methods can be a lot easier via traditional Linux users/groups compared to Impala, especially with older CDH releases. Not speaking about in-database permissions here, of course.
  • Although JDBC can be perfect for reading data from Impala, writing to the database might be a nightmare. I am not aware of any bulk import feature via the JDBC driver, and separate INSERT statements are extremely slow. So instead of preparing SQL statements, I prefer creating an intermediary dump file to be imported by Impala on the command line — via a helper R function that does all these steps automatically. I did not prepare any benchmarks on this, but believe me, it’s a LOT faster. The same also stands for eg Redshift, where loading data from S3 or remote hosts via SSH and using COPY FROM instead of INSERT statements can result in multiple orders of magnitude speedup. This hack seems to be used by the Spark folks as well.

Proof-of-concept demo function to use intermediary CSV files to export data from Impala

If you find this useful, please see the below function to automate the required steps of using an intermediary file instead of JDBC to load data from Impala :

  • connect to a remote host via SSH
  • create a temporary CSV file on the remote host
  • dump the results of the Impala query to the CSV file
  • copy the CSV to a local file
  • read the content of the CSV
  • clean up the local and remote workspaces.

Comments and any kind of feedback is highly welcomed!

nn

n n

nn

n

n view rawn impala.Rn hosted with ❤ by GitHubn

n

n

n’)

Not Found

Source:: R News

Multiple legends for the same aesthetic

By Nicola Sturaro Sommacal

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Our new R package

By Gianluca Baio

(This article was first published on Gianluca Baio’s blog, and kindly contributed to R-bloggers)

As part of the work she’s doing for her PhD, Christina has done some (fairly major, I’d say!) review of the literature about prevalence studies on PCOS $-$ that’s a rather serious, albeit probably fair to say quite under-researched area.

When it came to analysing the data she had collected, naturally I directed her towards doing some Bayesian modelling. In many cases, these are not too complicated $-$ often the outcome is binary and so “fixed” or “random” effect models are fairly simple to structure and run. One interesting point was that, because there often wasn’t very good or comprehensive evidence, setting up the model using some reasonable (and, crucially, fairly easy to elicit from clinicians) prior information did help in obtaining more stable estimates.

So, because we (she) have spent quite a lot of time working on this, I thought it would be good to structure all this into a R package. All of our models are actually run using JAGS as interfaced using the package R2jags and, I think, the nice idea is that in R the user can specify the kind of model they want to use. Our package, which incidentally is called bmeta, then builds a suitable model file for the assumptions selected in terms of outcome data and priors and then runs it via R2jags. The model file that is generated is automatically saved on the user’s computer and can then be re-used as a template or modified as necessary (eg to include different priors or more complex structures).

Currently, Christina has implemented 22 models (ie combinations of data model and prior, including variations of fixed vs random effects) and in the package we have also implemented several graphical diagnostics, including:

  • forest plots to visualise the level of pooling of the data
  • funnel plots to examine publication bias
  • diagnostics plots to examine convergence of the underlying MCMC algorithm
The package will be on CRAN in the next couple of days, but it’s already downloadable from this webpage. We’ll also put some more structured manual/guide shortly.

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio’s blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Instrumental Variables

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

We all “know” that correlation does not imply causation, that unmeasured and unknown factors can confound a seemingly obvious inference. But, who has not been tempted by the seductive quality of strong correlations?

Fortunately, it is also well known that a well done randomized experiment can account for the unknown confounders and permit valid causal inferences. But what can you do when it is impractical, impossible or unethical to conduct a randomized experiment? (For example, we wouldn’t want to ask a randomly assigned cohort of people to go through life with less education to prove that education matters.) One way of coping with confounders when randomization is infeasible is to introduce what Economists call instrumental variables. This is a devilishly clever and apparently fragile notion that takes some effort to wrap one’s head around.

On Tuesday October 20th, we at the Bay Area useR Group (BARUG) had the good fortune to have Hyunseung Kang describe the work that he and his colleagues at the Wharton School have been doing to extend the usefulness of instrumental variables. Hyunseung’s talk started with elementary notions: like explaining the effectiveness of randomized experiments, described the essential notion of instrumental variables and developed the background necessary for understanding the new results in this area. The slides from Hyunseung’s talk available for download in two parts from the BARUG website. As with most presentations, these slides are little more than the mute residue of talk itself. Nevertheless, Hyunseung makes such imaginative used of animation and build slides that the deck is worth working through.

The following slide from Hyunseung’s presentation captures the essence of the instrumental approach.

The general idea is that one or more variables, the instruments, are added to the model for the purpose of inducing randomness into the outcome. This has to be done in a way that conforms with the three assumptions mentioned in the figure. The first assumption, A1, is that the instrument variables are relevant to the process. The second assumption, A2, states that randomness is only induced into the exposure variables and not also into the outcome. The third assumption, A3, is a strong one: there are no unmeasured confounders. The claim is that if these three assumptions are met then causal effects can be estimated with coefficients for the exposure variables that are consistent and asymptotically unbiased.

In the education example developed by Hyunseung, the instrumental variables are the subject’s proximity to 2 year and 4 year colleges. Here is where the “rubber meets the road” so to speak. Assessing the relevancy of the instrumental variables and interpreting their effects are subject to the kinds of difficulties described by Andrew Gelman in his post of a few years back.

In the second part of his presentation Hyunseung presents new work: (1) two methods that provide robust confidence intervals when assumption A1 is violated, (2) a method for implementing a sensitivity analysis to assess the sensitivity of an instrumental variable model to violations of assumptions A2 and A3, and (3) the R package ivmodel that ties it all together.

Kang_et_al_contributions

To delve even deeper into this topic have a look at the paper: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

roxygen2 5.0.0

By hadleywickham

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

roxygen2 5.0.0 is now available on CRAN. roxygen2 helps you document your packages by turning specially formatted inline comments in R’s standard Rd format. Learn more at http://r-pkgs.had.co.nz/man.html.

In this release:

  • Roxygen records its version in a single place: the RoxygenNote field in your DESCRIPTION. This should make it easier to see what’s changed when you upgrade roxygen2, because only files with differences will be modified. Previously every Rd file was modified to update the version number.
  • You can now easily document functions that you’ve imported from another package:
    #' @importFrom magrittr %>%
    #' @export
    magrittr::`%>%`

    All imported-and-re-exported functions will be documented in the same file (rexports.Rd), with a brief descrption and links to the original documentation.

  • You can more easily generate package documentation by documenting the special string “_PACKAGE“:
    #' @details Details
    "_PACKAGE" 

    The title and description will be automatically filled in from the DESCRIPTION.

  • New tags @rawRd and @rawNamespace allow you to insert raw (unescaped) text in Rd and the NAMESPACE. @evalRd() is similar, but instead of literal Rd, you give it R code that produces literal Rd code when run. This should make it easier to experiment with new types of output.
  • Roxygen2 now parses the source code files in the order specified in the Collate field in DESCRIPTION. This improves the ordering of the generated documentation when using @describeIn and/or @rdname split across several .R files, as often happens when working with S4.
  • The parser has been completely rewritten in C++. This gives a nice performance boost and adds improves the error messages: now get the line number of the tag, not the start of the block.
  • @family now cross-links each manual page only once, instread of linking to all aliases.

There were many other minor improvements and bug fixes; please see the release notes for a complete list. A bug thanks goes to all the contributors who made this release possible.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Articles at Simple Talk

By C

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

R Articles at Simple Talk

Two articles about R posted at the Simple Talk blog in the last few weeks:
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Mango at RBelgium: Analytical Web Services

By Mango Blogger

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Last week Mango Solutions’ Principal Consultant, Stephanie Locke, presented Analytical Web Services for the RBelgium user group. Check out the full recording of her presentation here:

Analytical Web Services : In your day job, you might build some awesome bits of analysis in R; but now people want this information available real time and against everything that comes into the business. Help! What you need is an R web service, but not being a developer you have no clue how to go about it. This session takes you through how to convert your analysis into a web service and how to take into account important stuff like security, scalability and error handling!

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Statistical Graphics and Visualization course materials

By civilstat

Pie chart with remake

(This article was first published on Civil Statistician » R, and kindly contributed to R-bloggers)

I’ve just finished teaching the Fall 2015 session of 36-721, Statistical Graphics and Visualization. Again, it is a half-semester course designed primarily for students in the MSP program (Masters of Statistical Practice) in the CMU statistics department. I’m pleased that we also had a large number of students from other departments taking this as an elective.

For software we used mostly R (base graphics, ggplot2, and Shiny). But we also spent some time on Tableau, Inkscape, D3, and GGobi.

We covered a LOT of ground. At each point I tried to hammer home the importance of legible, comprehensible graphics that respect human visual perception.

Remaking pie charts is a rite of passage for statistical graphics students

My course materials are below. Not all the slides are designed to stand alone, but I have no time to remake them right now. I’ll post some reflections separately.

Download all materials as a ZIP file (38 MB), or browse individual files:

Please note:

  • The examples, papers, blogs and researchers linked here are just scratching the surface. I meant no offense to anyone left out. I’ve simply tried to link to blogs, Twitter, and researchers’ websites that are actively updated.
  • I have tried my best to include attribution, citations, and links for all images (besides my own) in the lecture slides. Same for datasets in the R code. Wherever I use scans from a book, I have contacted the authors and do so with their approval (Alberto Cairo, Di Cook, Mark Monmonier, Colin Ware, & Robin Williams). However, if you are the creator or copyright holder of any images here and want them removed or the attribution revised, please let me know and I will comply.
  • Most of the cited books have an Amazon Associates link. If you follow these links and buy something during that visit, I get a small advertising fee (in the form of an Amazon gift card). Each year so far, these fees have totaled under $100 a year. I just spend it on more dataviz books
To leave a comment for the author, please follow the link and comment on their blog: Civil Statistician » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News