## Visualisation of Likert scale results

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

I wrote a function to visualise results of Likert scale items. Please find the function below the post. Here is an example plot:

The function is called ‘plot.likert’ and takes the following arguments:
– vec: The vector with the raw results
– possible.values: A vector with all the possible values. This is sometimes important if not all possible responses were actually ticked by your participants. Defaults to all values found in vec.
– left: Annotation for the left side of the plot
– right: Annotation for the right side of the plot
– plot.median: Plot the median as a little line at the top? Defaults to FALSE.
– plot.sd: Plot the standard deviation around the mean? Defaults to TRUE.
– include.absolutes: Include the absolute values as small bold black numbers above the plot? Defaults to TRUE.
– include.percentages: Include the percentage values as blue numbers above the plot? Defaults to TRUE.
– own.margins: Override the default margins for the plot. Defaults to c(2,2,3,2).
– othermean: Plot another mean into the visualisation as grey line above the bars. I used this to compare the results to older results for the same questions. Defaults to NULL (no other mean is plotted in this case).
– …: Other parameters might be passed that are used for the first call to plot().
The example call for the plot shown above is:
plot.likert(sample(-2:2, 75, replace = T,
prob = c(0, .2, .2, .3, .3)),
left = “strongly disagree”,
right = “strongly agree”,
own.margins = c(2,2,5,2),
main = “I like this visualisation of Likert scale
results.”,
possible.values = -2:2,
othermean = 1.09)

NB: I know of all the stuff regarding the calculation of means on Likert scale items. However, it is still done a lot and you can also include the median after all…

Here’s the function:

plot.likert <- function (vec, possible.values = sort(unique(vec)), left = “linker Pol”, right = “rechter Pol”,
plot.median = F, plot.sd = T, include.absolutes = T, include.percentages = T, own.margins = c(2, 2, 3, 2),
othermean = NULL, …) {
tab <- table(vec)
if (length(tab) != length(possible.values)) {
values.not.in.tab <- possible.values[!(possible.values %in% names(tab))]
for (val in values.not.in.tab) {
tab[as.character(val)] <- 0
}
tab <- tab[sort(names(tab))]
}
prop.tab <- prop.table(tab) * 100
v.sd <- sd(vec, na.rm = T)
v.m <- mean(vec, na.rm = T)
v.med <- median(vec, na.rm = T)
old.mar <- par(“mar”)
par(mar = own.margins)
# Setting-up plot region
plot(x = c(min(as.numeric(names(tab)), na.rm = T) – 1.1, max(as.numeric(names(tab)), na.rm = T) + 1.1),
y = c(0, 100), type = “n”, xaxt = “n”, yaxt = “n”,
xlab = “”, ylab = “”, bty = “n”, …)
# Bars
rect(xleft = as.numeric(names(prop.tab)) – .4,
ybottom = 0,
xright = as.numeric(names(prop.tab)) + .4,
ytop = prop.tab,
border = “#00000000”, col = “#ADD8E6E6”)
# Lower black line
lines(x = c(min(as.numeric(names(tab)), na.rm = T) – .6, max(as.numeric(names(tab)), na.rm = T) + .6),
y = c(0, 0), col = “black”, lwd = 2)
# Upper black line
lines(x = c(min(as.numeric(names(tab)), na.rm = T) – .6, max(as.numeric(names(tab)), na.rm = T) + .6),
y = c(100, 100), col = “black”, lwd = 2)
# Blue lines
for (n.i in names(tab)) {
lines(x = c(n.i, n.i), y = c(0, 100), col = “blue”)
}
# Grey rectangles at sides
rect(xleft = min(as.numeric(names(tab)), na.rm = T) – 1.1,
ybottom = 0,
xright = min(as.numeric(names(tab)), na.rm = T) – .6,
ytop = 100,
border = “#00000000”, col = “grey”)
rect(xleft = max(as.numeric(names(tab)), na.rm = T) + .6,
ybottom = 0,
xright = max(as.numeric(names(tab)), na.rm = T) + 1.1,
ytop = 100,
border = “#00000000”, col = “grey”)
mtext(names(prop.tab), side = 1, at = names(prop.tab))
# Percentages and numbers at the top
if (include.percentages) mtext(paste(round(prop.tab, 0), “%”), side = 3, at = names(prop.tab), line = -.3, col = “blue”)
if (include.absolutes) mtext(tab, side = 3, at = names(tab), line = .5, cex = .8, font = 2)
# Mean line
lines(x = c(v.m, v.m), y = c(95, 85), lwd = 6, col = “blue”)
# Median line
if (plot.median) lines(x = c(v.med, v.med), y = c(95, 85), lwd = 4, col = “#00FF00AA”)
# Other mean line
if (!is.null(othermean)) lines(x = c(othermean, othermean), y = c(85, 75), lwd = 6, col = “#00000099”)
# SD lines
if (plot.sd) { arrows(x0 = c(v.m, v.m), x1 = c(v.m – v.sd, v.m + v.sd), y0 = c(90, 90), y1 = c(90, 90),
angle = 90, length = 0, lwd = 1) }
# Left label
mtext(left, side = 2, line = -.5)
# Right label
mtext(right, side = 4, line = -.5)
par(mar = old.mar)
}

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Visualisation of Likert scale results

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

I wrote a function to visualise results of Likert scale items. Please find the function below the post. Here is an example plot:

The function is called ‘plot.likert’ and takes the following arguments:
– vec: The vector with the raw results
– possible.values: A vector with all the possible values. This is sometimes important if not all possible responses were actually ticked by your participants. Defaults to all values found in vec.
– left: Annotation for the left side of the plot
– right: Annotation for the right side of the plot
– plot.median: Plot the median as a little line at the top? Defaults to FALSE.
– plot.sd: Plot the standard deviation around the mean? Defaults to TRUE.
– include.absolutes: Include the absolute values as small bold black numbers above the plot? Defaults to TRUE.
– include.percentages: Include the percentage values as blue numbers above the plot? Defaults to TRUE.
– own.margins: Override the default margins for the plot. Defaults to c(2,2,3,2).
– othermean: Plot another mean into the visualisation as grey line above the bars. I used this to compare the results to older results for the same questions. Defaults to NULL (no other mean is plotted in this case).
– …: Other parameters might be passed that are used for the first call to plot().
The example call for the plot shown above is:
plot.likert(sample(-2:2, 75, replace = T,
prob = c(0, .2, .2, .3, .3)),
left = “strongly disagree”,
right = “strongly agree”,
own.margins = c(2,2,5,2),
main = “I like this visualisation of Likert scale
results.”,
possible.values = -2:2,
othermean = 1.09)

NB: I know of all the stuff regarding the calculation of means on Likert scale items. However, it is still done a lot and you can also include the median after all…

Here’s the function:

plot.likert <- function (vec, possible.values = sort(unique(vec)), left = “linker Pol”, right = “rechter Pol”,
plot.median = F, plot.sd = T, include.absolutes = T, include.percentages = T, own.margins = c(2, 2, 3, 2),
othermean = NULL, …) {
tab <- table(vec)
if (length(tab) != length(possible.values)) {
values.not.in.tab <- possible.values[!(possible.values %in% names(tab))]
for (val in values.not.in.tab) {
tab[as.character(val)] <- 0
}
tab <- tab[sort(names(tab))]
}
prop.tab <- prop.table(tab) * 100
v.sd <- sd(vec, na.rm = T)
v.m <- mean(vec, na.rm = T)
v.med <- median(vec, na.rm = T)
old.mar <- par(“mar”)
par(mar = own.margins)
# Setting-up plot region
plot(x = c(min(as.numeric(names(tab)), na.rm = T) – 1.1, max(as.numeric(names(tab)), na.rm = T) + 1.1),
y = c(0, 100), type = “n”, xaxt = “n”, yaxt = “n”,
xlab = “”, ylab = “”, bty = “n”, …)
# Bars
rect(xleft = as.numeric(names(prop.tab)) – .4,
ybottom = 0,
xright = as.numeric(names(prop.tab)) + .4,
ytop = prop.tab,
border = “#00000000”, col = “#ADD8E6E6”)
# Lower black line
lines(x = c(min(as.numeric(names(tab)), na.rm = T) – .6, max(as.numeric(names(tab)), na.rm = T) + .6),
y = c(0, 0), col = “black”, lwd = 2)
# Upper black line
lines(x = c(min(as.numeric(names(tab)), na.rm = T) – .6, max(as.numeric(names(tab)), na.rm = T) + .6),
y = c(100, 100), col = “black”, lwd = 2)
# Blue lines
for (n.i in names(tab)) {
lines(x = c(n.i, n.i), y = c(0, 100), col = “blue”)
}
# Grey rectangles at sides
rect(xleft = min(as.numeric(names(tab)), na.rm = T) – 1.1,
ybottom = 0,
xright = min(as.numeric(names(tab)), na.rm = T) – .6,
ytop = 100,
border = “#00000000”, col = “grey”)
rect(xleft = max(as.numeric(names(tab)), na.rm = T) + .6,
ybottom = 0,
xright = max(as.numeric(names(tab)), na.rm = T) + 1.1,
ytop = 100,
border = “#00000000”, col = “grey”)
mtext(names(prop.tab), side = 1, at = names(prop.tab))
# Percentages and numbers at the top
if (include.percentages) mtext(paste(round(prop.tab, 0), “%”), side = 3, at = names(prop.tab), line = -.3, col = “blue”)
if (include.absolutes) mtext(tab, side = 3, at = names(tab), line = .5, cex = .8, font = 2)
# Mean line
lines(x = c(v.m, v.m), y = c(95, 85), lwd = 6, col = “blue”)
# Median line
if (plot.median) lines(x = c(v.med, v.med), y = c(95, 85), lwd = 4, col = “#00FF00AA”)
# Other mean line
if (!is.null(othermean)) lines(x = c(othermean, othermean), y = c(85, 75), lwd = 6, col = “#00000099”)
# SD lines
if (plot.sd) { arrows(x0 = c(v.m, v.m), x1 = c(v.m – v.sd, v.m + v.sd), y0 = c(90, 90), y1 = c(90, 90),
angle = 90, length = 0, lwd = 1) }
# Left label
mtext(left, side = 2, line = -.5)
# Right label
mtext(right, side = 4, line = -.5)
par(mar = old.mar)
}

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Plotcon 2016 – Speakers and topics in R

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

### Jenny Bryan

Topic: Extract plotting intent from spreadsheets in R

Bio: Jennifer Bryan is an Associate Professor in the Statistics Department and the Michael Smith Laboratories at the University of British Columbia in Vancouver. She’s a biostatistician specialized in genomics and takes a special interest and delight in data analysis and statistical computing.

### Kent Russell

Topic: Research in Finance Alive with Interactivity

Bio: Kent blends technology and finance as a portfolio manager in Birmingham, Alabama. His favorite tool is R combined with HTML/JavaScript to explore, decide, and communicate. Last year he built a R htmlwidget each week at http://buildingwidgets.com.

Topic: New open viz in R

Bio: Hadley is Chief Scientist at RStudio and a member of the R Foundation. He builds tools (both computational and cognitive) that make data science easier, faster, and more fun. His work includes packages for data science (ggplot2, dplyr, tidyr), data ingest (readr, readxl, haven), and principled software development (roxygen2, testthat, devtools). He is also a writer, educator, and frequent speaker promoting the use of R for data science.

### Nick Elprin

Topic: Automating visualization with cloud task scheduling

Bio: Before starting Domino, Nick was a senior technologist and technology manager at Bridgewater Associates, where he managed a team that designed, developed, and delivered Bridgewater’s next generation research platform. He has a BA and MS from Harvard College in computer science.

### Carson Sievert

Topic: Practical tools for exploratory web graphics

Bio: Carson is a PhD student in the Department of Statistics at Iowa State University working on interactive graphics, statistical computation, sports statistics, and web technologies. Carson is primarily interested in problems where visualization can augment/enhance/improve statistical methods and/or automated tasks. He is the lead R developer at Plotly.

### Kristen Beck

Topic: Bringing microscopic data to life using R Markdown and Plotly

Bio: Dr. Beck is a research staff member in the Industrial and Applied Genomics group at IBM Research. As a bioinformatician, she contributes to the development of a web application for exploration and analysis of microbiomes of food ingredient. This research is part of the Consortium for Sequencing the Food Supply Chain which aims to detect pathogenic bacteria, identify food fraud, and detect antimicrobial resistance. Terabytes of sequencing and derived data must be processed into intelligible reports for a nonscientist user. Enhanced visualizations are essential for distilling dense data and for communicating scientific results that have broader implications for food safety. Dr. Beck received a Ph.D. in Biochemistry, Molecular, Cellular, and Developmental Biology with a Designated Emphasis in Biotechnology from the University of California, Davis. She was the recipient of two NIH Doctoral T32 Training Grants in Bimolecular Technology and Molecular and Cellular Biology.

### Michael Freeman

Topic: Visualizing Concepts with D3.js

Bio: Michael Freeman is a faculty member at the University of Washington’s Information School where he teaches courses on interactive data visualization, web development, and data science. With a background in public health (MPH), Michael is passionate about developing visual narratives to help explain complex concepts to broad audiences.

### David Robinson

Topic: gganimate: Animation within the grammar of graphics

Bio: David Robinson is a Data Scientist at Stack Overflow. In May 2015 he received his PhD in Quantitative and Computational Biology from Princeton University, where he worked with John Storey on statistical genomics and experiment design. He is the author of the broom, fuzzyjoin and gganimate R packages, and writes about R, statistics, and education at his blog Variance Explained.

### Sahir Bhatnager

Topic: Finance visualizations for quants in R

Bio: Sahir is a PhD student at McGill University. His is interested in statistical methods for synthesizing genomic data. His current research focuses on developing a methodological approach for identifying clusters of features that are sensitive to environmental exposures.

### Tanya Cashorali

Topic: Sports data viz in R and R Shiny

Bio: Tanya Cashorali is the Chief Data Officer of Stattleship – a Boston-based sports content and data business that connects brands with sports fans through social media. She is also the founding partner of TCB Analytics – a Boston-based data consultancy. Tanya started her career in the data-rich field of bioinformatics and applied her experience to other data-rich verticals such as telecom, finance and sports. She brings over 10 years of experience in data scientist roles as well as managing and training data analysts. She’s helped grow a handful of Boston startups and prior to launching TCB Analytics, she worked as a data scientist at the Fortune 500 Biogen.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Plotcon 2016 – Speakers and topics in R

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

### Jenny Bryan

Topic: Extract plotting intent from spreadsheets in R

Bio: Jennifer Bryan is an Associate Professor in the Statistics Department and the Michael Smith Laboratories at the University of British Columbia in Vancouver. She’s a biostatistician specialized in genomics and takes a special interest and delight in data analysis and statistical computing.

### Kent Russell

Topic: Research in Finance Alive with Interactivity

Bio: Kent blends technology and finance as a portfolio manager in Birmingham, Alabama. His favorite tool is R combined with HTML/JavaScript to explore, decide, and communicate. Last year he built a R htmlwidget each week at http://buildingwidgets.com.

Topic: New open viz in R

Bio: Hadley is Chief Scientist at RStudio and a member of the R Foundation. He builds tools (both computational and cognitive) that make data science easier, faster, and more fun. His work includes packages for data science (ggplot2, dplyr, tidyr), data ingest (readr, readxl, haven), and principled software development (roxygen2, testthat, devtools). He is also a writer, educator, and frequent speaker promoting the use of R for data science.

### Nick Elprin

Topic: Automating visualization with cloud task scheduling

Bio: Before starting Domino, Nick was a senior technologist and technology manager at Bridgewater Associates, where he managed a team that designed, developed, and delivered Bridgewater’s next generation research platform. He has a BA and MS from Harvard College in computer science.

### Carson Sievert

Topic: Practical tools for exploratory web graphics

Bio: Carson is a PhD student in the Department of Statistics at Iowa State University working on interactive graphics, statistical computation, sports statistics, and web technologies. Carson is primarily interested in problems where visualization can augment/enhance/improve statistical methods and/or automated tasks. He is the lead R developer at Plotly.

### Kristen Beck

Topic: Bringing microscopic data to life using R Markdown and Plotly

Bio: Dr. Beck is a research staff member in the Industrial and Applied Genomics group at IBM Research. As a bioinformatician, she contributes to the development of a web application for exploration and analysis of microbiomes of food ingredient. This research is part of the Consortium for Sequencing the Food Supply Chain which aims to detect pathogenic bacteria, identify food fraud, and detect antimicrobial resistance. Terabytes of sequencing and derived data must be processed into intelligible reports for a nonscientist user. Enhanced visualizations are essential for distilling dense data and for communicating scientific results that have broader implications for food safety. Dr. Beck received a Ph.D. in Biochemistry, Molecular, Cellular, and Developmental Biology with a Designated Emphasis in Biotechnology from the University of California, Davis. She was the recipient of two NIH Doctoral T32 Training Grants in Bimolecular Technology and Molecular and Cellular Biology.

### Michael Freeman

Topic: Visualizing Concepts with D3.js

Bio: Michael Freeman is a faculty member at the University of Washington’s Information School where he teaches courses on interactive data visualization, web development, and data science. With a background in public health (MPH), Michael is passionate about developing visual narratives to help explain complex concepts to broad audiences.

### David Robinson

Topic: gganimate: Animation within the grammar of graphics

Bio: David Robinson is a Data Scientist at Stack Overflow. In May 2015 he received his PhD in Quantitative and Computational Biology from Princeton University, where he worked with John Storey on statistical genomics and experiment design. He is the author of the broom, fuzzyjoin and gganimate R packages, and writes about R, statistics, and education at his blog Variance Explained.

### Sahir Bhatnager

Topic: Finance visualizations for quants in R

Bio: Sahir is a PhD student at McGill University. His is interested in statistical methods for synthesizing genomic data. His current research focuses on developing a methodological approach for identifying clusters of features that are sensitive to environmental exposures.

### Tanya Cashorali

Topic: Sports data viz in R and R Shiny

Bio: Tanya Cashorali is the Chief Data Officer of Stattleship – a Boston-based sports content and data business that connects brands with sports fans through social media. She is also the founding partner of TCB Analytics – a Boston-based data consultancy. Tanya started her career in the data-rich field of bioinformatics and applied her experience to other data-rich verticals such as telecom, finance and sports. She brings over 10 years of experience in data scientist roles as well as managing and training data analysts. She’s helped grow a handful of Boston startups and prior to launching TCB Analytics, she worked as a data scientist at the Fortune 500 Biogen.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Plot some variables against many others with tidyr and ggplot2

(This article was first published on blogR, and kindly contributed to R-bloggers)

Want to see how some of your variables relate to many others? Here’s an example of just this:

``````library(tidyr)
library(ggplot2)

mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp, shape = factor(cyl))) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
``````

This plot shows a separate scatter plot panel for each of many variables against `mpg`; all points are coloured by `hp`, and the shapes refer to `cyl`.

Let’s break it down.

This post is an extension of a previous one that appears here: https://drsimonj.svbtle.com/quick-plot-of-all-variables.

In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use `purrr::keep()` if you want only variables of a particular type).

## Tidying our data

As in the previous post, I’ll mention that you might be interested in using something like a `for` loop to create each plot. Personally, however, I think this is a messy way to do it. Instead, we’ll make use of the `facet_wrap()` function in the `ggplot2` package, but doing so requires some careful data prep. Thus, assuming our data frame has all the variables we’re interested in, the first step is to get our data into a tidy form that is suitable for plotting.

We’ll do this using `gather()` from the `tidyr` package. In the previous post, we gathered all of our variables as follows (using `mtcars` as our example data set):

``````library(tidyr)
#>   key value
#> 1 mpg  21.0
#> 2 mpg  21.0
#> 3 mpg  22.8
#> 4 mpg  21.4
#> 5 mpg  18.7
#> 6 mpg  18.1
``````

This gives us a `key` column with the variable names and a `value` column with their corresponding values. This works well if we only want to plot each variable by itself (e.g., to get univariate information).

However, here we’re interested in visualising multivariate information, with a particular focus on one or two variables. We’ll start with the bivariate case. Within `gather()`, we’ll first drop our variable of interest (say `mpg`) as follows:

``````mtcars %>% gather(-mpg, key = "var", value = "value") %>% head()
#>    mpg var value
#> 1 21.0 cyl     6
#> 2 21.0 cyl     6
#> 3 22.8 cyl     4
#> 4 21.4 cyl     6
#> 5 18.7 cyl     8
#> 6 18.1 cyl     6
``````

We now have an `mpg` column with the values of `mpg` repeated for each variable in the `var` column. The `value` column contains the values corresponding to the variable in the `var` column. This simple extension is how we can use `gather()` to get our data into shape.

## Creating the plot

We now move to the `ggplot2` package in much the same way we did in the previous post. We want a scatter plot of `mpg` with each variable in the `var` column, whose values are in the `value` column. Creating a scatter plot is handled by `ggplot()` and `geom_point()`. Getting a separate panel for each variable is handled by `facet_wrap()`. We also want the scales for each panel to be “free”. Otherwise, `ggplot` will constrain them all the be equal, which doesn’t make sense for plotting different variables. For a clean look, let’s also add `theme_bw()`.

``````mtcars %>%
gather(-mpg, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
``````

We now have a scatter plot of every variable against `mpg`. Let’s see what else we can do.

## Extracting more than one variable

We can layer other variables into these plots. For example, say we want to colour the points based on `hp`. To do this, we also drop `hp` within `gather()`, and then include it appropriately in the plotting stage:

``````mtcars %>%
gather(-mpg, -hp, key = "var", value = "value") %>%
#>    mpg  hp var value
#> 1 21.0 110 cyl     6
#> 2 21.0 110 cyl     6
#> 3 22.8  93 cyl     4
#> 4 21.4 110 cyl     6
#> 5 18.7 175 cyl     8
#> 6 18.1 105 cyl     6

mtcars %>%
gather(-mpg, -hp, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp)) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
``````

Let’s go crazy and change the point shape by `cyl`:

``````mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
#>    mpg cyl  hp  var value
#> 1 21.0   6 110 disp   160
#> 2 21.0   6 110 disp   160
#> 3 22.8   4  93 disp   108
#> 4 21.4   6 110 disp   258
#> 5 18.7   8 175 disp   360
#> 6 18.1   6 105 disp   225

mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp, shape = factor(cyl))) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
``````

## Perks of ggplot2

If you’re familiar with `ggplot2`, you can go to town. For example, let’s add loess lines with `stat_smooth()`:

``````mtcars %>%
gather(-mpg, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg)) +
geom_point() +
stat_smooth() +
facet_wrap(~ var, scales = "free") +
theme_bw()
``````

The options are nearly endless at this point, so I’ll stop here.

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Merge a list of datasets together

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

Last week I showed how to read a lot of datasets at once with R, and this week I’ll continue from there and show a very simple function that uses this list of read datasets and merges them all together.

First we’ll use `read_list()` to read all the datasets at once (for more details read last week’s post):

``````library("readr")
library("tibble")

data_files <- list.files(pattern = ".csv")

print(data_files)``````
``## [1] "data_1.csv" "data_2.csv" "data_3.csv"``
``````list_of_data_sets <- read_list(data_files, read_csv)

glimpse(list_of_data_sets)``````
``````## List of 3
##  \$ data_1:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..\$ col1: chr [1:19] "0,018930679" "0,8748013128" "0,1025635934" "0,6246140983" ...
##   ..\$ col2: chr [1:19] "0,0377725807" "0,5959457638" "0,4429121533" "0,558387159" ...
##   ..\$ col3: chr [1:19] "0,6241767189" "0,031324594" "0,2238059868" "0,2773350732" ...
##  \$ data_2:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..\$ col1: chr [1:19] "0,9098418493" "0,1127788509" "0,5818891392" "0,1011773532" ...
##   ..\$ col2: chr [1:19] "0,7455905887" "0,4015039612" "0,6625796605" "0,029955339" ...
##   ..\$ col3: chr [1:19] "0,327232932" "0,2784035673" "0,8092386735" "0,1216045306" ...
##  \$ data_3:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..\$ col1: chr [1:19] "0,9236124896" "0,6303271761" "0,6413583054" "0,5573887416" ...
##   ..\$ col2: chr [1:19] "0,2114708388" "0,6984538266" "0,0469865249" "0,9271510226" ...
##   ..\$ col3: chr [1:19] "0,4941919971" "0,7391538511" "0,3876723797" "0,2815014394" ...``````

You see that all these datasets have the same column names. We can now merge them using this simple function:

``````multi_join <- function(list_of_loaded_data, join_func, ...){

require("dplyr")

output <- Reduce(function(x, y) {join_func(x, y, ...)}, list_of_loaded_data)

return(output)
}``````

This function uses `Reduce()`. `Reduce()` is a very important function that can be found in all functional programming languages. What does `Reduce()` do? Let’s take a look at the following example:

``Reduce(`+`, c(1, 2, 3, 4, 5))``
``## [1] 15``

`Reduce()` has several arguments, but you need to specify at least two: a function, here `+` and a list, here `c(1, 2, 3, 4, 5)`. The next code block shows what `Reduce()` basically does:

``````0 + c(1, 2, 3, 4, 5)
0 + 1 + c(2, 3, 4, 5)
0 + 1 + 2 + c(3, 4, 5)
0 + 1 + 2 + 3 + c(4, 5)
0 + 1 + 2 + 3 + 4 + c(5)
0 + 1 + 2 + 3 + 4 + 5``````

`0` had to be added as in “init”. You can also specify this “init” to `Reduce()`:

``Reduce(`+`, c(1, 2, 3, 4, 5), init = 20)``
``## [1] 35``

So what `multi_join()` does, is the same operation as in the example above, but where the function is a user supplied join or merge function, and the list of datasets is the one read with `read_list()`.

Let’s see what happens when we use `multi_join()` on our list:

``merged_data <- multi_join(list_of_data_sets, full_join)``
``class(merged_data)``
``## [1] "tbl_df"     "tbl"        "data.frame"``
``glimpse(merged_data)``
``````## Observations: 57
## Variables: 3
## \$ col1 <chr> "0,018930679", "0,8748013128", "0,1025635934", "0,6246140...
## \$ col2 <chr> "0,0377725807", "0,5959457638", "0,4429121533", "0,558387...
## \$ col3 <chr> "0,6241767189", "0,031324594", "0,2238059868", "0,2773350...``````

You should make sure that all the data frames have the same column names but you can also join data frames with different column names if you give the argument `by` to the join function. This is possible thanks to `...` that allows you to pass further argument to `join_func()`.

This function was inspired by the one found on the blog Coffee and Econometrics in the Morning.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## R moves up to 5th place in IEEE language rankings

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

IEEE Spectrum has just published its third annual ranking with its 2016 Top Programming Languages, and the R Language is once again near the top of the list, moving up one place to fifth position.

As I said last year (when R moved up to take sixth place), this is an extraordinary result for a domain-specific language. The other four languages in the top 5 (C, Java, Python amd C++) are all general-purpose languages, suitable for just about any programming task. R by contrast is a language specifically for data science, and its high ranking here reflects both the critical importance of data science as a discipline today, and of R as the language of choice for data scientists.

IEEE Spectrum ranks languages according to a large number of factors, including search rankings and trends, social media mentions, and job posting. (You can adjust the weighting of these factors to generate your own rankings using this interactive tool.) It also includes scholarly citations of the languages, a factur that influenced R’s rise in this ranking:

Another language that has continued to move up the rankings since 2014 is R, now in fifth place. R has been lifted in our rankings by racking up more questions on Stack Overflow—about 46 percent more since 2014. But even more important to R’s rise is that it is increasingly mentioned in scholarly research papers. The Spectrumd efault ranking is heavily weighted toward data from IEEE Xplore, which indexes millions of scholarly articles, standards, and books in the IEEE database. In our 2015 ranking there were a mere 39 papers talking about the language, whereas this year we logged 244 papers.

In related news, R also increased its ranking in the recently-released RedMonk Language Rankings for June 2016, moving up one spot to take 12th place. Unlike IEEE Spectrum, RedMonk ranks language using just two criteria: activity of the language on GitHub and StackOverflow. Analyst Stephen O’Grady had this to say about R’s performance in the RedMonk rankings:

Out of all the back half of the Top 20 languages, R has shown the most consistent upwards movement over time. From its position of 17 back in 2012, it has made steady gains over time, but had seemed to stall at 13 having stuck there for three consecutive quarters. This time around, however, R took over #12 from Perl which in turn dropped to #13. There’s still an enormous amount of Perl in circulation, but the fact that the more specialized R has unseated the language once considered the glue of the web says as much about Perl as it does about R.

Again, R’s steady growth in this and numerous other surveys and rankings over time reflects the growing importance of data science applied using R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Data Visualisation

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

by Nic Crane

Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.

Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.

The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.

Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.

This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.

The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Eclipse – an alternative to RStudio – part 2

By Placidia

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In part one of this tutorial, I showed how to install R and LaTeX for Eclipse. Sadly, Eclipse doesn’t yet know where to find these resources, so we need to configure it appropriately. Statet provides a wizard to walk you through the setup, but I find the wizard difficult to follow. However, we can set up everything from the menus.

If you have used Eclipse for other projects, such as Java or Python, the setup for Statet will be familiar and intuitive. But if this is your first time on Eclipse, there’s a lot to take in. Basically, there are four steps to go through.

• Switch to the Statet perspective
• Tell Eclipse how to use R. This requires two configurations.
• Tell Eclipse how to use LaTeX
• Tell Eclipse where to find your pdf viewer

Let’s walk through each of these steps in detail.

## The Statet perspective

Eclipse offers its users a way to organize programming projects by displaying the files and resources these projects require and by coordinating their use. A perspective is Eclipse speak for what you see after launch: the display of windows, folder trees and other features you might want for your work flow. You can change the displays during your session. Eclipse remembers what you did and launches a new session with the same perspective that you had upon shutdown.

The previous tutorial ended with a clean install of Eclipse, to which we had just installed Statet. Upon relaunch, we get the default Java perspective. You can see the word Java in highlight at the upper right corner. To get the Statet perspective, select Window -> Perspective -> Open perspective -> Other.

This gives you the list of all installed perspectives. Select Statet from the list.

Done! Now you should see the default layout of the Statet perspective. Note that Statet appears in highlight to the right of the Java button. We can switch between these perspectives by clicking their respective buttons. Alternatively, you can go through the drop down menu as before, selecting Window and Perspective.

## A quick tour of the Statet perspective

I’ll talk more about working with this perspective in a later post, but for now, let’s take a quick look at what you get.

• A menu bar with the usual items. We have already used Window and Help.
• An icon toolbar with shortcuts to useful items from the menu bar.I usually work with the icons.
• Upper left: the project manager. When we create projects and add project folders, they will show up here.
• Top centre: a blank window for Editors. To Eclipse, anything that creates a document or displays a document is an Editor. This is where you will write your Sweave documents. It’s also where the compiled pdf’s will appear.
• Bottom centre: a blank window for the R console.
• Upper right: Outline. This panel provides an outline of your LaTeX/Sweave document. Section headings and sub-headings appear as indents. The outline also references any R chunks you may have in your document.
• Lower left: other options. Click on the tabs for the object browser, R help files and other useful information. The Object browser gives you the output from the R `search()` function.

None of these goodies are actually available to us yet, because we need to set up the Run configurations for R and LaTeX.

## The R run configuration

Let’s try something that doesn’t work! Select Run -> Run as, or click on the context menu next to the green menu button.

Typically, the green button will run the previous configuration, but since we have no launch history, nothing happens! We need to create a Run configuration first.

Select Run Configurations to get the configuration screen. Select R Console and click the document icon at the upper left to get a template for the new configuration.

Note the error message along the top of the screen: The R environment preferences of the Workspace are invalid. They are invalid because we haven’t set them up yet. We’ll get to that later.

Give the configuration a useful name. I’m running R 3.3.0, Essentially Educational, so I name the configuration Educational. Then enter the working directory. The plus sign to the right of the form opens to allow you to browse to the folder of your choice.

You could configure a separate configuration for every R project and direct R to launch from the project folder. I typically use the Eclipse workspace here, and change my working directory from the R console as needed.

Click Apply to save the changes and select the R Config tab. This is where we tell Eclipse which version of R we want to use. Currently, we have no configurations, so let’s add one.

## The R configuration screen

Click on Select configuration and then click on Configure.

Select R Environment and click Add.

If you installed R using the default settings, this screen will typically complete itself with little prompting from you.

• Give the environment a name.
• Insert the location of R home. The context menu item try find automatically usually works. If it fails, open an R console (outside of Eclipse!) and run `R.home()`. This will give you the absolute path to the top level directory of your R installation. Use this value.

Now click on Detect Default Properties/Settings. Click OK to complete the screen.

Eclipse can build an R index to enable access to package help files. There are several options here from the drop down menu. I use Check and update automatically. And finally, don’t forget to click Apply.

### Back to the Run Configuration template

The previous step returns you to the Run Configuration template. But this time, we can supply an R environment. With only one version of R on my system, I select the default. If you are running several versions of R, you can create a configuration for each and select the one you need for the current Run configuration. Complete the screen and click Apply.

### Finishg up

The default settings from the remaining tabs typically work fine. However, if you need to specify additional R environment variables, a specific version of Java or working directories, you can do this here. For me, I only need to make a few adjustments to the Common tab.

I select Debug and Run. This lets my configuration appear on the menus of the green Run button and of the bug shaped Debug button.

You may also want to adjust the coding. The template defaults to the default encoding of your operating system. On my Linux box, this is UTF-8. Windows users may have a different default.

### Encoding for Windows users in bilingual workplaces

If you are running Windows as an English speaker, your default will be cp1250. If you need to write a document in a different language, cp1250 may not play well with laTeX’s babel package. If you find that your pdf’s look weird, check the encoding under the Common tab and switch to utf-8, or whatever you need.

### And finally …

Click Apply one last time to save the configuration. Now click Close to leave the template, or Run to test your work. An R console should launch in the console window of Eclipse.

The first time, you will have to wait a few moments before you get the R prompt while the system indexes your help files. I choose the global index option. Subsequent launches will go a lot faster.

## Configure LaTeX

If you followed the previous steps, you should now be able to run an R console from within Eclipse. You can also create R scripts, save and run them, as I show in a different tutorial. But the power of Statet lies in allowing you to embed R commands in a LaTeX document, thus combining the analysis and its documentation in one place. To get this up and running, Eclipse needs to know where to find LaTeX and how to run it.

After what you’ve been through with configuring R, configuring LaTeX should be straightforward. Let’s walk through the details.

Select Run -> External Tools -> External tools configuration from the menu, or select the context menu next to the green arrow with the briefcase and External tools configuration.

The external tools configuration menu is similar to the Run configuration menu that we saw before. Highlight Sweave document and click on the icon for a new launch configuration.

The configuration menu appears. Give the configuration a name. Most of the defaults work fine as they are, but you probably want change the build option. Click on the tab marked LaTeX.

This screen lets you control the LaTeX build. The default creates a dvi file, which is a bit old fashioned. I prefer to use pdflatex to obtain my document as a pdf.

Select the menu arrow next to dvi and select pdflatex from the menu.

And don’t forget to click Apply to save the change.

Select tab Common and select External tools. This causes the Sweave configuration to appear under the External Tools icon on the toolbar. Check that the encoding suits the language you plan to use. Finally, click Apply.

Close the menu. Eclipse is now configured for Sweave.

### Setting the PDF viewer

When Sweave compiles your document, it will attempt to display the PDF in the editor window using the default Eclipse viewer. Current versions of Eclipse do not specify the default PDf reader, so we need to do that now.

Select Window -> Preferences to get the preferences menu. Select Editors -> File associations. Click Add.

Enter `*.pdf` in the form. Click OK

We have just added a new file extension. Now, we need to add an associated viewer. Click Add next to the window of Associated Editors. A list will appear of applications internal to Eclipse, followed by external applications. I am running Kubuntu, so I select Okular. On a Windows system, you could select the Adobe Acrobat Reader, or whatever you have installed. It should appear on the list.

## Troubleshooting

Congratulations! Grab a coffee and enjoy using R and Sweave in Eclipse. If you followed all the instructions as indicated, you should be able to compose a document in LaTeX with R code chunks and compile your work to a PDf document.

But what if it fails?

• Double check the previous steps.
• Ensure that you clicked Apply on every screen that has an Apply button. These are a bit random, so make sure you didn’t miss one of them.
• Fails often occur in completing the R configuration menu. In particular, make sure that you clicked Detect default properties/settings from the R-config menu.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Interactive Subsetting Exercises

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The function, “`subset()`” is intended as a convienent, interactive substitute for subsetting with brackets. `subset()` extracts subsets of matrices, data frames, or vectors (including lists), according to specified conditions.

Answers to the exercises are available here.

Exercise 1

Subset the vector, “`mtcars[,1]`“, for values greater than “`15.0`“.

Exercise 2

Subset the dataframe, “`mtcars`” for rows with “`mpg`” greater than , or equal to, 21 miles per gallon.

Exercise 3

Subset “`mtcars`” for rows wih “`cyl`” less than “`6`“, and “`gear`” exactly equal to “`4`“.

Exercise 4

Subset “`mtcars`” for rows greater than, or equal to, 21 miles per gallon. Also, select only the columns, “`mpg`” through “`hp`“.

Exercise 5

Subset “`airquality`” for “`Ozone`” greater than “`28`“, or “`Temp`” greater than “`70`“. Return the first five rows.

Exercise 6

Subset “`airquality`” for “`Ozone`” greater than “`28`“, and “`Temp`” greater than “`70`“. Select the columns, “`Ozone`” and “`Temp`“. Return the first five rows.

Exercise 7

Subset the “`CO2`” dataframe for “`Treatment`” values of “`chilled`“,
and “`uptake`” values greater that “`15`“. Remove the category, “`conc`“. Return the first 10 rows.

Exercise 8

Subset the “`airquality`” dataframe for rows without “`Ozone`” values of “`NA`“.

Exercise 9

Subset “`airquality`” for “`Ozone`” greater than “`100`“. Select the columns “`Ozone`“, “`Temp`“, “`Month`” and “`Day`” only.

Exercise 10

Subset “`LifeCycleSavings`” for “`sr`” greater than “`8`“, and less than “`10`“. Remove columns “`pop75`” through “`dpi`“.

Image by Clker-free-vector-images (Pixabay post) [CC0 Public Domain ], via Pixabay.