Alleviating AWS Athena Aggravation with Asynchronous Assistance

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a dplyr chain, fire off a collect() and then wait.

And, wait.

And, wait.

And, wait.

Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!

I’ve been crafting SQL by hand or using sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run rJava-based code (it’s 2018 and that still is not an easy given on many systems unfortunately).

There are two functions below:

  • collect_async(), and
  • gather_results()

The collect_async() function is designed to be used like collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the dplyr chain asynchronously. The companion function gather_results() takes the object created by collect_async() and checks to see if the results are ready. If if they are, it will use the aws.s3 package to download them. Personally, I’d just aws s3 sync ... from the command line vs use the aws.s3 package but that’s not everyone’s cup of tea.

Once I figure out the best package API for this I’ll add it to the metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.

Here’s the code:

library(rJava)
library(awsjavasdk)
library(aws.signature)
library(aws.s3)
library(odbc)
library(tidyverse)
library(dbplyr)

#' Collect Amazon Athena query results asynchronously
#' 
#' Long running Athena queries and Athena queries with large result
#' sets can seriously stall a `dplyr` processing chain due to poorly
#' implemented ODBC and JDBC drivers. The AWS SDK for Athena has 
#' methods that support submitting a query asynchronously for "batch"
#' processing. All Athena resutls are stored in CSV files in S3 and it's
#' easy to use the R `aws.s3` package to grab these or perform an
#' `aws s3 sync ...` operation on the command line.
#' 
#' @md
#' @param obj the `dplyr` chain
#' @param schema Athena schema (usually matches the `Schema` parameter to the 
#'        Simba ODBC connection)
#' @param region Your AWS region. All lower case with dashes (usually matches
#'        the `AwsRegion` parameter to the Simba ODBC connection)
#' @param results_bucket the S3 results bucket where query results are stored 
#'        (usually matches the `S3OutputLocation` parameter to the Simba ODBC
#'        connection)
#' @return a `list` with the query execution ID and the S3 bucket. This object
#'         is designed to be passed to the companion `gather_results()` if you
#'         want to use the `aws.s3` package to retrieve the results. Otherwise,
#'         sync the file however you want using the query execution id.
#' @note You may need to change up the authentication provider depending on how 
#'       you use credentials with Athena
collect_async  region

  provider @md
#' @param async_result the result of a call to `collect_async()`
#' @return a data frame (tibble) or `NULL` if the query results are not ready yet
gather_results 

Now, we give it a go:

# Setup the credentials you're using
use_credentials("personal")

# load the AWS Java SDK classes
awsjavasdk::load_sdk()

# necessary for Simba ODBC and the async query ops
aws_region  con

# the sample table in the sample db/schema
elb_logs %
  filter(requestip == "253.89.30.138") %>%
  collect_async(
    schema = athena_schema,
    region = aws_region,
    results_bucket = athena_results_bucket
  ) -> async_result

async_result
## $qex_id
## [1] "d5fe7754-919b-47c5-bd7d-3ccdb1a3a414"
## 
## $results_bucket
## [1] "s3://aws-athena-query-results-redacted"

# For long queries we can wait a bit but the function will tell us if the results
# are there or not.

gather_results(async_result)
## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   elbname = col_character(),
##   requestip = col_character(),
##   requestport = col_integer(),
##   backendip = col_character(),
##   backendport = col_integer(),
##   requestprocessingtime = col_double(),
##   backendprocessingtime = col_double(),
##   clientresponsetime = col_double(),
##   elbresponsecode = col_integer(),
##   backendresponsecode = col_integer(),
##   receivedbytes = col_integer(),
##   sentbytes = col_integer(),
##   requestverb = col_character(),
##   url = col_character(),
##   protocol = col_character()
## )
## # A tibble: 1 x 16
##   timestamp           elbname requestip     requestport backendip     backendport
##   
## 1 2014-09-29 03:24:38 lb-demo 253.89.30.138       20159 253.89.30.138        8888
## # ... with 10 more variables: requestprocessingtime , backendprocessingtime ,
## #   clientresponsetime , elbresponsecode , backendresponsecode ,
## #   receivedbytes , sentbytes , requestverb , url , protocol 

If you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

[Web2Day] Producing web content with R

By Colin Fay

(This article was first published on Colin Fay, and kindly contributed to R-bloggers)

Earlier this week, my talk at the Web2Day
conference was put online. Here is an english summary for those who
don’t understand french

Disclaimer: this talk has been given during a conference about web
technologies. In other word,in front of a crowd that has never /
hardly heard about R before. If you’re already familiar with R,
Rmarkdown, Shiny and {plumber}, you won’t find anything groundbreaking
here.

The video

If ever you understand french, here is the
video:

Note: the title of the conference can be translated “When a data
science language produces web content.”

Doing web content with R

What is R?

The introduction of the talk is about “what is R?”. I won’t redo this
introduction here, as there are a lot of resources available online to
discover R
(and I’m pretty sure that if you’re here, you already know
R 😉 ):

Why doing web content with R?

There are many reasons why you would want to produce web content with R:

  • You need to communicate your results

  • You need to visualise your results dynamically

  • You need to share R content with others other the internet

  • You want to make R available through a user interface, so that
    people can get the full potential of R without writing any line of
    code

Why doing R with a web technology?

  • Use R computation performances

  • Call / use the latest datascience methods from a webpage

How?

Easy to produce webpages with markdown

With Rmarkdown, you can produce a web page that contains the results of
your experiment / research / data collection / visualisation, done in R.

Markdown is an easy to use, simplified version of html which can be
compiled as html
. Meaning that you can use any html, css or JavaScript
content. For example, a table can be converted as a JS datatable, which
is dynamic.

Here is an example of a webpage created with
R:

Nothing fancy here, but you got the idea.

Read more about Rmarkdown at https://rmarkdown.rstudio.com/

An API with {plumber}

{plumber} is an R package that can creates API, that is to say a
web-service that runs R
, and that you can access through an URL. This
URL has endpoints, and when you send information to these endpoints, an
R function is executed, with the params given to the API.

One you’ve got this url, it can be integrated simply to any webpage. For
example, if my API, at http://127.0.0.1:3953, produces a plot, I can
do :





Yihaaa

Ce graphe est généré par R et c'est de la balle. src = "http://127.0.0.1:3953/whoplot">

As {plumber}, by default, produces JSON content (but you can pass a
lot of other types of content), if can be easily integrated into any web
page
. That means you can make R do all the heavy computation, and just
get the result as a JSON to be integrated on your webpage.

About Shiny

Using Shiny to produce web app

Shiny is an amazing framework from RStudio, used to produce web app
that can run R
, written with R code: meaning that it can run any R
model, use any R dataviz library, take advantage of R file importing
packages, etc etc.

Creating Shiny apps has been one of our main area of work at
ThinkR
lately, and we’ve build quite large applications that run complex
model, which are then visualised in the same app
. What’s the point of
this app? To give access to these models to people who don’t know how to
code (and don’t want to).

Basic Shiny Designs

When designing a Shiny app, you can use a lot of already implemented
design libraries and templates. These produce nice results, and
everything can be written in R, which makes it easy to use for R
developers who don’t know html, css and JS
(even if we can definitely
say that when it comes to building large applications, you should at
some point learn at least html and css).

These basics packages produce contents like the one I’m showing in the
talk:

Here, I’m creating a little dashboard that connects to Twitter, and
does simple dataviz, counting, and text-mining
. Everything, here, done
in R.

Note: this app will be made open-source one day, when I’ll have found
some times to finish it

Advanced Shiny design

Or, what I call: Webdev + R = ❤

When doing more advanced Shiny designs, we can use htmlTemplates (a
feature which is not known enough). This functionnality allows you to
create html and css templates as any web developer would do. And of
course, with JS in it if you want to.

Once you’ve got this template, you can use it straight in Shiny.

For example:

# styles.css
body {
 font-family: 'Source Sans
Pro', Helvetica, sans-serif;
 background: #121212;
 color: #999;
 padding: 20px;
}
tr {
 vertical-align: top;
}
...

These html and css templates are “easy” (well, relatively easy) to
implement if you’re a web designer. But, let’s be honnest, they are
really hard to code if you’re a data scientist.

From the R point of view, this is as simple as creating a function that
does:

horizontalBox  function(title, content){
  htmlTemplate("box.html", title = title, content = content)
}

You can recognize here that I am calling the box.html template, with the
arguments of the function being the elements between {.

Might we work together?

I am not a web designer, and web designers are not data scientists, so
neither of us can pretend to do a better job than the other. But the
good news is that everything is there for us to work together.

So, if ever some web designer is interested to work with me on a Shiny
app (everything open source of course), I’ll be glad to do so, as I’m
sure we have a lot to learn from each other.

Feel free to out to me throught mail or on
twitter!

Slides

The slides from the talk are available
here:

https://github.com/ColinFay/conf/blob/master/2018-06-Web2Day-Nantes/web2day.pdf

The content of the .R, Rmd, etc, are here:

https://github.com/ColinFay/conf/tree/master/2018-06-Web2Day-Nantes

To leave a comment for the author, please follow the link and comment on their blog: Colin Fay.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Why I rarely use apply

By Florian Privé

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this short post, I talk about why I’m moving away from using function apply.

With matrices

It’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.

N  M  8000
X  matrix(rnorm(N * M), N)
system.time(res1  apply(X, 2, mean))
##    user  system elapsed 
##    0.73    0.05    0.78
system.time(res2  colMeans(X))
##    user  system elapsed 
##    0.05    0.00    0.05
stopifnot(isTRUE(all.equal(res2, res1)))

“Yeah, there are colSums and colMeans, but what about computing standard deviations?”

There are lots of apply-like functions in package {matrixStats}.

system.time(res3  apply(X, 2, sd))
##    user  system elapsed 
##    0.96    0.01    0.97
system.time(res4  matrixStats::colSds(X))
##    user  system elapsed 
##     0.2     0.0     0.2
stopifnot(isTRUE(all.equal(res4, res3)))

With data frames

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
apply(head(iris), 2, identity)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
## 1 "5.1"        "3.5"       "1.4"        "0.2"       "setosa"
## 2 "4.9"        "3.0"       "1.4"        "0.2"       "setosa"
## 3 "4.7"        "3.2"       "1.3"        "0.2"       "setosa"
## 4 "4.6"        "3.1"       "1.5"        "0.2"       "setosa"
## 5 "5.0"        "3.6"       "1.4"        "0.2"       "setosa"
## 6 "5.4"        "3.9"       "1.7"        "0.4"       "setosa"

A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of apply with a data frame?

  • If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).

    sapply(iris, typeof)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
    ##     "double"     "double"     "double"     "double"    "integer"
  • If you want to operate on all rows, I recommend you to watch this webinar.

With sparse matrices

The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.

library(Matrix)

X.sp  rsparsematrix(N, M, density = 0.01)

## X.sp is converted to a dense matrix when using `apply`
system.time(res5  apply(X.sp, 2, mean))  
##    user  system elapsed 
##    0.78    0.46    1.25
system.time(res6  Matrix::colMeans(X.sp))
##    user  system elapsed 
##    0.01    0.00    0.02
stopifnot(isTRUE(all.equal(res6, res5)))

You could implement your own apply-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of non-null elements, and x storing values of these elements). Then, you could use a group_bysummarize approach.

For instance, for the previous example, you can do this in base R:

apply2_sp  function(X, FUN) {
  res  numeric(ncol(X))
  X2  as(X, "dgTMatrix")
  tmp  tapply(X2@x, X2@j, FUN)
  res[as.integer(names(tmp)) + 1]  tmp
  res
}

system.time(res7  apply2_sp(X.sp, sum) / nrow(X.sp))
##    user  system elapsed 
##    0.03    0.00    0.03
stopifnot(isTRUE(all.equal(res7, res5)))

Conclusion

Using apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.

To leave a comment for the author, please follow the link and comment on their blog: Florian Privé.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Seasonal decomposition of short time series

By R on Rob J Hyndman

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Many users have tried to do a seasonal decomposition with a short time series, and hit the error “Series has less than two periods”.
The problem is that the usual methods of decomposition (e.g., decompose and stl) estimate seasonality using at least as many degrees of freedom as there are seasonal periods. So you need at least two observations per seasonal period to be able to distinguish seasonality from noise.

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Variable vs. Participant-wise Standardization

By Dominique Makowski

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

To make sense of their data and effects, psychologists often standardize (Z-score) their variables. However, in repeated-measures designs, there are three ways of standardizing data:

  • Variable-wise: The most common method. A simple scaling and reducing of each variable by their mean and SD.
  • Participant-wise: Variables are standardized “within” each participant, i.e., for each participant, by the participant’s mean and SD.
  • Full: Participant-wise first and then re-standardizing variable-wise.

Unfortunately, the method used is often not explicitly stated. This is an issue as these methods can generate important discrepancies that contribute to the reproducibility crisis of psychological science.

In the following, we will see how to perform those methods and look for differences.

The data

We will take a dataset in which participants were exposed to negative pictures and had to rate their emotions (valence) and the amount of memories associated with the picture (autobiographical link). One could make the hypothesis that for young participants with no context of war or violence, the most negative pictures (mutilations) are less related to memories than less negative pictures (involving for example car crashes or sick people). In other words, we expect a positive relationship between valence (with high values corresponding to less negativity) and autobiographical link.

Let’s have a look at the data, averaged by participants:

library(tidyverse)
library(psycho)

df  psycho::emotion %>%  # Load the dataset from the psycho package
  filter(Emotion_Condition == "Negative")  # Discard neutral pictures

df %>% 
  group_by(Participant_ID) %>% 
  summarise(n_Trials = n(),
            Valence_Mean = mean(Subjective_Valence, na.rm=TRUE),
            Valence_SD = sd(Subjective_Valence, na.rm=TRUE),
            Autobiographical_Link_Mean = mean(Autobiographical_Link, na.rm=TRUE),
            Autobiographical_Link_SD = sd(Autobiographical_Link, na.rm=TRUE)) %>% 
  mutate(Valence = paste(Valence_Mean, Valence_SD, sep=" +- "),
         Autobiographical_Link = paste(Autobiographical_Link_Mean, Autobiographical_Link_SD, sep=" +- ")) %>% 
  select(-ends_with("SD"), -ends_with("Mean"))
Participant_ID n_Trials Valence Autobiographical_Link
10S 24 -58.07 +- 42.59 49.88 +- 29.60
11S 24 -73.22 +- 37.01 0.94 +- 3.46
12S 24 -57.53 +- 26.56 30.24 +- 27.87
13S 24 -63.22 +- 23.72 27.86 +- 35.81
14S 24 -56.60 +- 26.47 3.31 +- 11.20
15S 24 -60.59 +- 33.71 17.64 +- 17.36
16S 24 -46.12 +- 24.88 15.33 +- 16.57
17S 24 -1.54 +- 4.98 0.13 +- 0.64
18S 24 -67.23 +- 34.98 22.71 +- 20.07
19S 24 -59.61 +- 33.22 28.37 +- 12.55

As we can see from the means and SDs, there is a lot of variability between and within participants.

Standardize

We will create three dataframes standardized with each of the three techniques.

Z_VarWise  df %>% 
  standardize()

Z_ParWise  df %>% 
  group_by(Participant_ID) %>% 
  standardize() 

Z_Full  df %>% 
  group_by(Participant_ID) %>% 
  standardize() %>% 
  ungroup() %>% 
  standardize() 

Effect of Standardization

Let’s see how these three standardization techniques affected the Valence variable.

At a general level

# Create convenient function
print_summary  function(data){
  paste(deparse(substitute(data)), ":", 
        format_digit(mean(data[["Subjective_Valence"]])),
        "+-",
        format_digit(sd(data[["Subjective_Valence"]])),
        "[", format_digit(min(data[["Subjective_Valence"]])),
        ",", format_digit(max(data[["Subjective_Valence"]])),
        "]")
}

# Check the results
print_summary(Z_VarWise)
[1] "Z_VarWise : 0 +- 1.00 [ -1.18 , 3.10 ]"
print_summary(Z_ParWise)
[1] "Z_ParWise : 0 +- 0.98 [ -2.93 , 3.29 ]"
print_summary(Z_Full)
[1] "Z_Full : 0 +- 1.00 [ -2.99 , 3.36 ]"

At a participant level

# Create convenient function
print_participants  function(data){
  data %>% 
    group_by(Participant_ID) %>% 
    summarise(Mean = mean(Subjective_Valence), 
              SD = sd(Subjective_Valence)) %>% 
    mutate_if(is.numeric, round, 2) %>% 
    head(5) 
    
}

# Check the results
print_participants(Z_VarWise)
# A tibble: 5 x 3
  Participant_ID    Mean    SD
  
1 10S            -0.0500 1.15 
2 11S            -0.460  1.00 
3 12S            -0.0300 0.720
4 13S            -0.190  0.640
5 14S            -0.0100 0.710
print_participants(Z_ParWise)
# A tibble: 5 x 3
  Participant_ID  Mean    SD
  
1 10S               0.    1.
2 11S               0.    1.
3 12S               0.    1.
4 13S               0.    1.
5 14S               0.    1.
print_participants(Z_Full)
# A tibble: 5 x 3
  Participant_ID  Mean    SD
  
1 10S               0.  1.02
2 11S               0.  1.02
3 12S               0.  1.02
4 13S               0.  1.02
5 14S               0.  1.02

Distribution

data.frame(VarWise = Z_VarWise$Subjective_Valence,
           ParWise = Z_ParWise$Subjective_Valence,
           Full = Z_Full$Subjective_Valence) %>% 
  gather(Method, Variable) %>% 
  ggplot(aes(x=Variable, fill=Method)) +
  geom_density(alpha=0.5) +
  theme_minimal() 

The distributions appear to be similar…

Correlation

Let’s do a correlation between the variable-wise and participant-wise methods.

psycho::bayes_cor.test(Z_VarWise$Subjective_Valence, Z_ParWise$Subjective_Valence)
Results of the Bayesian correlation indicate moderate evidence (BF = 8.65) in favour of an absence of a negative association between Z_VarWise$Subjective_Valence and Z_ParWise$Subjective_Valence (r = -0.015, MAD = 0.047, 90% CI [-0.096, 0.057]). The correlation can be considered as small or very small with respective probabilities of 3.46% and 59.43%.
data.frame(Original = df$Subjective_Valence,
           VarWise = Z_VarWise$Subjective_Valence,
           ParWise = Z_ParWise$Subjective_Valence) %>% 
  ggplot(aes(x=VarWise, y=ParWise, colour=Original)) +
  geom_point() +
  geom_smooth(method="lm") +
  theme_minimal()

While the three standardization methods roughly present the same characteristics at a general level (mean 0 and SD 1) and a similar distribution, their values are very different and completely uncorrelated!

Test

Let’s now answer to the original question by investigating the linear relationship between valence and autobiographical link. We can do this by running a mixed model with participants entered as random effects.

# Convenient function
print_model  function(data){
  type_name  deparse(substitute(data)) 

  lmerTest::lmer(Subjective_Valence ~ Autobiographical_Link + (1|Participant_ID), data=data) %>% 
    psycho::analyze(CI=NULL) %>%
    summary() %>% 
    filter(Variable == "Autobiographical_Link") %>% 
    mutate(Type = type_name,
           Coef = round(Coef, 2),
           p = format_p(p)) %>% 
    select(Type, Coef, p)
}

# Run the model on all datasets
rbind(print_model(df), 
      print_model(Z_VarWise),
      print_model(Z_ParWise),
      print_model(Z_Full))
       Type Coef       p
1        df 0.09    > .1
2 Z_VarWise 0.07    > .1
3 Z_ParWise 0.08 = 0.08°
4    Z_Full 0.08 = 0.08°

As we can see, in our case, using participant-wise standardization resulted in a significant (at p = .1) effect! But keep in mind that this is not always the case. In can be the contrary, or generate very similar results. No method is better or more justified, and its choice depends on the specific case, context, data and goal.

Conclusion

  1. Standardization can be useful in some cases and should be justified
  2. Variable and Participant-wise standardization methods produce “in appearance” similar data
  3. Variable and Participant-wise standardization can lead to different and uncorrelated results
  4. The choice of the method can strongly influence the results and thus, should be explicitly stated

We showed here yet another way of sneakily tweaking the data that can change the results. To prevent its use for p-hacking, we can only support the generalization of open-data, open-analysis and preregistration.

Credits

The psycho package helped you? Don’t forget to cite the various packages you used 🙂

You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

Previous blogposts

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

From Data to Viz | Find the graphic you need

By Holtz

(This article was first published on Blog – The R Graph Gallery, and kindly contributed to R-bloggers)

I’m delighted to announce a new dataviz project called ‘Data to Viz‘.

—> data-to-viz.com

What it is


From Data to Viz is a classification of chart types based on input data format. It comes in the form of a decision tree leading to a set of potentially appropriate visualizations to represent the dataset.

The decision trees are available in a poster that has been presented at UseR in Brisbane.

Philosophie


The project is built on two underlying philosophies. First, that most data analysis can be summarized in about twenty different dataset formats. Second, that both data and context determine the appropriate chart.

Thus, our suggested method consists in identifying and trying all feasible chart types to find out which suits your data and idea best.

Once this set of graphic identified, data-to-viz.com aims to guide you toward the best decision.

Content


Several sections are available on top of the decision tree:
portfolioan overview of all chart possibilities. For each, an extensive description is given, showing variations, pros and cons, common pitfalls and more.
stories – for each input data format, a real life example is analyzed to illustrate the different chart types applicable to it.
caveat gallery – a list of common dataviz pitfalls, with suggested workarounds.

Link with R


From Data to Viz aims to give general advices for data visualization in general and is not targeting R users especialy.

However, 100% of the charts are made using R, mostly using ggplot2 and the tidyverse. The reproducible code snippets are always available. The biggest part of the website is built using R Markdown, using a good amount of hacks described here.

The website is tightly linked with the R graph gallery. Once you’ve identified the graphic that suits your needs, you will be redirected to the appropriate section of the gallery to get the R code in minutes.

Next step


From Data to Viz is still in beta version, and a lot remains to be done. The caveat gallery is incomplete and some chart types are missing. Also, a few inaccuracies may be present in the decision tree. Last but not least the English is horrible but this is not likely to change unfortunately, I apologize for that.

If you find any mistake or potential improvement, please fill an issue on GitHub, contact me at Yan.holtz.data@gmail.com or on twitter, or leave a comment below. Any feedback will be very welcome.

To leave a comment for the author, please follow the link and comment on their blog: Blog – The R Graph Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Stencila – an office suite for reproducible research

By Emily Packer

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Stencila launches the first version of its (open source) word processor and spreadsheet editor designed for researchers.

By Michael Aufreiter, Substance, and Aleksandra Pawlik and Nokome Bentley, Stencila

Stencila is an open source office suite designed for researchers. It allows the authoring of interactive, data-driven publications in visual interfaces, similar to those in conventional office suites, but is built from the ground up for reproducibility.

Stencila aims to make it easier for researchers with differing levels of computational skills to collaborate on the same research article. Researchers used to tools like Microsoft Word and Excel will find Stencila’s interfaces intuitive and familiar. And those who use tools such as Jupyter Notebook or R Markdown are still able to embed code for data analysis within their research articles. Once published, Stencila documents are self-contained, interactive and reusable, containing all the text, media, code and data needed to fully support the narrative of research discovery.

Source: https://stenci.la

The Stencila project aims to be part of the wider vision to enable the next generation of research article – all the way from authoring through to publication as a reproducible, self-contained webpage. A key limitation of the current research publishing process is that conventional document formats (e.g. Word, PDF and LaTeX) do not support the inclusion of reproducible research elements, nor do they produce content in the structured format used for science publishing and dissemination (XML). Stencila aims to remove the need for manual conversion of content from source documents to XML and web (HTML) publishing formats, whilst enabling the inclusion of source data and computational methods within the manuscript. We hope that establishing a digital-first, reproducible archive format for publications will facilitate research communication that is faster and more open, and which lowers the barrier for collaboration and reuse. The development of Stencila is driven by community needs and in coordination with the goals of the Reproducible Document Stack, an initiative started by eLife, Substance and Stencila.

A word processor for creating journal-ready scientific manuscripts

Stencila’s article editor builds on Texture, an open source editor built for visually editing JATS XML documents (a standard widely used by scientific journals). Supporting all elements of a standardised research article, the editor features semantic content-oriented editing that allows the user to focus on the research without worrying about layout information, which is normally stripped during the publishing process. While Texture implements all static elements (abstract, figures, references, citations and so on), Stencila extends Texture with code cells which enable computed, data-driven figures.

Spreadsheets for source data and analysis

In Stencila, datasets are an integral part of the publication. They live as individual spreadsheet documents holding structured data. This data can then be referenced from the research article to drive analysis and plots. As within Excel, cells can contain formulas and function calls to run computations directly in a spreadsheet. But not only can users enter simple expressions, they can also add and execute code in a variety of supported programming languages (at the moment R, Python, SQL and Javascript).

A walk-through of some of the features of Stencila, using this Stencila Article. Source: YouTube; video CC-BY Stencila.

Code evaluation in the browser and beyond

Stencila’s user interfaces build on modern web technology and run entirely in the browser – making them available on all major operating systems. The predefined functions available in Stencila use Javascript for execution so they can be run directly in the editor. For example, the plotly() function generates powerful, interactive visualizations solely using Plotly’s Javascript library.

Stencila can also connect to R, Python and SQL sessions, allowing more advanced data analysis and visualization capabilities. Stencila’s execution engine keeps a track of the dependency between code cells, enabling a reactive, spreadsheet-like programming experience both in Stencila Articles and Sheets.

An example of using R within a Stencila Sheet. Source: YouTube; video CC-BY Stencila.

Reproducible Document Archive (Dar)

Stencila stores projects in an open file archive format called Dar. A Dar is essentially a folder with a number of files encompassing the manuscript itself (usually one XML per document) and all associated media.

The Dar format is open source: inspect it and provide feedback at https://github.com/substance/dar

Dar uses existing standards when possible. For instance, articles are represented as JATS XML, the standard preferred by a number of major publishers. The Dar format is a separate effort from Stencila, and aims to establish a strict standard for representing self-contained reproducible publications, which can be submitted directly to publishers. Any other tool should be able to easily read and write such archives, either by supporting it directly or by implementing converters.

Interoperability with existing tools and workflows

Stencila is developed not to replace existing tools, but to complement them. Interoperability is at the heart of the project, with the goal of supporting seamless collaboration between users of Jupyter Notebooks, R Markdown and spreadsheet applications. We are working closely with the communities of existing open source tools to improve interoperability. For instance, we are working with the Jupyter team on tools to turn notebooks into journal submissions. We are also evaluating whether the Stencila editor could be used as another interface to edit Jupyter Notebooks or R Markdown files: we hope this could help researchers who use existing tools to collaborate with peers who are used to other office tools, such as Word and Excel, and thus encourage wider adoption of reproducible computational research practises.

State of development

Over the past two years, we’ve built Stencila from the ground up as a set of modular components that support community-driven open standards for publishing and computation. Stencila Desktop is our prototype of a ‘researcher’s office suite’, built by combining these components into an integrated application. During this beta phase of the project, we are working to address bugs and add missing features, and welcome your feedback and suggestions (see below).

One of our next priorities will be to develop a toolset for generating a web page from a reproducible article in the Dar format. Using progressive enhancement, the reader should be able to reproduce a scientific article right from the journal’s website in various forms, ranging from a traditional static representation of the manuscript and its figures to a fully interactive, executable publication.

We will continue working on Stencila’s various software components, such as the converter module and execution contexts for R and Python, towards improved integration and interoperability with other tools in the open science toolbox (e.g. Jupyter, RStudio and Binder).

Get involved

We’d love to get your input to help shape Stencila. Download Stencila Desktop and take it for a test drive. You could also try porting an existing manuscript over to Stencila using the Stencila command line tool. Give us your feedback and contribute ideas on our community forum or in our chat channel, or drop us an email at aleksandra@stenci.la or nokome@stenci.la.

Acknowledgments

Development of Stencila has been generously supported by the Alfred P. Sloan Foundation and eLife.

This post was originally published on eLife Labs.

eLife welcomes comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Is GINA really about to die?!?

By Nathan

(This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers)

Introduction

During a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data.

Being a data-driven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data!

libraries 

Data Scraping

In standard rvest fashion, we’ll read a url, extract the table containing the GINA enforcement statistics, and then do some data cleaning. Once we read the table and gather all of the annual results into key/pair of year/value, we get the following results:

url %
  html_nodes("table") %>% 
  extract2(1) %>%
  html_table(trim = TRUE, fill = TRUE, header = TRUE) 

names(GINA.resolutions)[1] %
  filter(! metric == "") %>% # Remove percentage rows
  filter(! metric == "Resolutions By Type") %>% # Remove blank line
  gather(year, value, 2:9) %>% # short and wide data to tall and skinny
  mutate(
    year = as.integer(year),
    value = gsub("[$%]", "", value)
  ) %>%
  mutate(
    value = as.numeric(value)
  ) %>%
  as.tibble()

GINA.resolutions
## # A tibble: 88 x 3
##    metric                      year value
##    
##  1 Receipts                    2010   201
##  2 Resolutions                 2010    56
##  3 Settlements                 2010     3
##  4 Withdrawals w/Benefits      2010     2
##  5 Administrative Closures     2010    11
##  6 No Reasonable Cause         2010    38
##  7 Reasonable Cause            2010     2
##  8 Successful Conciliations    2010     1
##  9 Unsuccessful Conciliations  2010     1
## 10 Merit Resolutions           2010     7
## # ... with 78 more rows

Claim Numbers over Time

Now that we have the data in a format we can use, we’ll look at the volume of claims and resolutions over time:

GINA.resolutions %>%
  filter(metric == "Receipts" | metric == "Resolutions") %>%
  ggplot(aes(year, value, color = metric)) +
  geom_line() +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Claims and Resolutions, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
   x = "", y = ""
  ) +
  scale_color_brewer("", palette = "Paired")

The number of GINA claims rose for the first few years, but then declined down to enactment-year levels. This could represent support for the opposing lawyer’s argument that enforcement is waning. However, it could just as likely be showing that the deterrent effect of the law has proven effective, and most firms subject to GINA are now aware of the law and have taken appropriate steps toward compliance. Given these competing interpretations, we’ll need to look at little deeper to see if we can derive trends from the data.

GINA Claim Resolutions

One of the arguments made by the opposing lawyer is that the Obama administration was pushing GINA enforcement, and that the Trump administration hates the law and won’t enforce it. We can look at the resolution types to test this hypothesis:

GINA.resolutions %>%
  filter(metric != "Receipts" & metric != "Resolutions") %>%
  ggplot(aes(year, value)) +
  geom_line() +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Resolutions by Type, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
    x = "", y = ""
  )

In 2017, the first year of the Trump administration, administrative closures were down and resolutions on the merits were up, which contradicts the opposing lawyer’s argument. While findings of no reasonable cause were up about 10-15%, findings of reasonable cause were up 50%; if anything, this also contradicts the opposing lawyer’s argument. Monetary settlements appear to be relatively flat from 2013 – 2017, and in any event a million dollars isn’t a ton of money in light of the EEOC’s annual budget of about $376 million (note that the EEOC handles many other types of charges besides GINA).

The resolution type that jumped most markedly in 2017 was “unsuccessful conciliation.” A conciliation is where the EEOC “attempt[s] to achieve a just resolution of all violations found and to obtain agreement that the respondent will eliminate the unlawful employment practice and provide appropriate affirmative relief.” 29 C.F.R. § 1601.24. It’s unclear why this jump occurred from the summary statistics provided by the EEOC.

Finally, I thought it was useful to plot all the resolution types together to show relative numbers:

GINA.resolutions %>%
  filter(metric != "Receipts" & metric != "Resolutions" & metric != "Monetary Benefits (Millions)*") %>%
  ggplot(aes(year, value, color = metric)) +
  geom_line() +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Resolutions by Type, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
    x = "", y = ""
  ) +
#  scale_y_sqrt() +
  scale_color_brewer("Resolution Type", palette="Paired")

EEOC Enforcement of GINA Charges, Resolutions by Type FY2010 - FY 2017From this perspective, it does look like the long-term trend has been for the EEOC to dismiss a majority of GINA-related charges as unfounded (i.e., no reasonable cause). However, for the cases that do have merit, it appears the EEOC has reversed an initial trend showing a preference toward administrative closure.

Conclusion

In all, I didn’t find the opposing lawyer’s argument particularly compelling in light of the data from President Trump’s first year in office. However, the first month of 2017 was President Obama’s last in office, and there was a flurry of activity by many regulatory agencies. It wouldn’t surprise me if EEOC also participated in a high volume of lame-duck activity, and a lot of activity in January 2017 could haved skewed the annual results. Monthly statistics would be nice but didn’t appear to be readily available. The goal with any R project is for it to be repeatable with additional data, so it will be interesting to see what the data from FY2018 shows.

This wasn’t a particularly complicated coding project – in fact, this writeup took me longer to produce than writing the actual code and coming to conclusions about whether GINA is on its last leg or not. Despite that fact, I thought it was a good example of how data science can be used to inform solutions to simple as well as complex problems.

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing the Kernelheaping Package II

By INWT-Blog-RBloggers

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)
In the first part of Introducing the Kernelheaping Package I showed how to compute and plot kernel density estimates on rounded or interval censored data using the Kernelheaping package. Now, let's make a big leap forward to the 2-dimensional case. Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as chloropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate. For the German capital city of Berlin, for example, there exists an open data initiative, where data on e.g. demographics is publicly available.  We first load a dataset on the Berlin population, which can be downloaded from: https://www.statistik-berlin-brandenburg.de/opendata/EWR201512E_Matrix.csv ```r library(dplyr) library(fields) library(ggplot2) library(Kernelheaping) library(maptools) library(RColorBrewer) library(rgdal) gpclibPermit() ``` ```r data https://www.statistik-berlin-brandenburg.de/opendata/RBS_OD_LOR_2015_12.zip ```r berlin %    do.call(rbind, .) %>%    cbind(data$E_E65U80) ``` In the next step we calculate the bivariate kernel density with the “dshapebivr” function (this may take some minutes) using the prepared data and the shape file: ```r est %    filter(Density > 0) ``` Now, we are able to plot the density together with the administrative districts: ```r ggplot(kData) +   geom_raster(aes(long, lat, fill = Density)) +    ggtitle("Bivariate density of Inhabitants between 65 and 80 years") +   scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e")) +   geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) +   coord_quickmap() ```
This map gives a much better overall impression of the distribution of older people than a simple choropleth map:  ```r ggplot(berlinDf) +   geom_polygon(aes(x = long, y = lat, fill = E_E65U80, group = id)) +    ggtitle("Number of Inhabitants between 65 and 80 years by district") +   scale_fill_gradientn(colours = c("#FFFFFF", "#5c87c2", "#19224e"), "n") +   geom_path(color = "#000000", data = berlinDf, aes(long, lat, group = group)) +   coord_quickmap() ```
Often, as the case with Berlin we may have large uninhabited areas such as woods or lakes. Furthermore, we may like to compute the proportion of older people compared to the overall population in a spatial setting. The third part of this series shows how you can compute boundary corrected and smooth proportion estimates with the Kernelheaping package.

Further parts of the article series Introducing the Kernelheaping Package:

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to Aggregate Data in R

By Jake Hoare

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.

To perform aggregation, we need to specify three things in the code:

  • The data that we want to aggregate
  • The variable to group by within the data
  • The calculation to apply to the groups (what you want to find out)

Example data

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Load the example data by running the following R code:

 
library(flipAPI)
data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

Perform aggregation with the following R code.

agg = aggregate(data,
                by = list(data$Role),
                FUN = mean)

This produces a table of the average salary and age by role, as below.

The aggregate function

The first argument to the function is usually a data.frame.

The by argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.

The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns, which I have removed for clarity.

Other aggregation functions

Any function that can be applied to a numeric variable can be used within aggregate. Maximum, minimum, count, standard deviation and sum are all popular.

For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. I’ve demonstrated this below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.

second = function(x) {
            if (length(x) == 1)
                return(x)
            return(sort(x, decreasing = TRUE)[2])}

agg = aggregate(data,
                by = list(data$Role, data$Shift),
                FUN = second)

Additional features

The aggregate function has a few more features to be aware of:

  • Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
  • Setting drop = TRUE means that any groups with zero count are removed.
  • na.action controls the treatment of missing values within the data.

TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr.


To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News