Copying tables from R to Outlook

By Abhijit

I work in an ecosystem that uses Outlook for e-mail. When I have to communicate results with collaborators one of the most frequent tasks I face is to take a tabular output in R (either a summary table or some sort of tabular output) and send it to collaborators in Outlook. One method is certainly to export the table to Excel and then copy the table from there into Outlook. However, I think I prefer another method which works a bit quicker for me.

I’ve been writing full reports using Rmarkdown for a while now, and it’s my preferred report-generation method. Usually I use knitr::kable to generate a Markdown version of a table in R. I can then copy the generated Markdown version of the table into a Markdown editor (I use Minimalist Markdown Editor), then just copy the HTML-rendered table from the preview pane to Outlook. This seem to work pretty well for me

Source:: R News

testing R code [book review]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

When I saw this title among the CRC Press novelties, I immediately ordered it as I though it fairly exciting. Now that I have gone through the book, the excitement has died. Maybe faster than need be as I read it while being stuck in a soulless Schipol airport and missing the only ice-climbing opportunity of the year!

Testing R Code was written by Richard Cotton and is quite short: once you take out the appendices and the answers to the exercises, it is about 130 pages long, with a significant proportion of code and output. And it is about some functions developed by Hadley Wickham from RStudio, for testing the coherence of R code in terms of inputs more than outputs. The functions are assertive and testthat. Intended for run-time versus development-time testing. Meaning that the output versus the input are what the author of the code intends them to be. The other chapters contain advices and heuristics about writing maintainable testable code, and incorporating a testing feature in an R package.

While I am definitely a poorly qualified reader for this type of R books, my disappointment stems from my expectation of a book about debugging R code, which is possibly due to a misunderstanding of the term testing. This is an unrealistic expectation, for sure, as testing for a code to produce what it is supposed to do requires some advanced knowledge of what the output should be, at least in some representative situations. Which means using interface like RStudio is capital in spotting unsavoury behaviours of some variables, if not foolproof in any case.

Filed under: R, Statistics, Travel Tagged: CRC Press, debugging, R, R package, RStudio, testing, Testing R Code

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A (much belated) update to plotting Kaplan-Meier curves in the tidyverse

By Abhijit

One of the most popular posts on this blog has been my attempt to create Kaplan-Meier plots with an aligned table of persons-at-risk below it under the ggplot paradigm. That post was last updated 3 years ago. In the interim, Chris Dardis has built upon these attempts to create a much more stable and feature-rich version of this function in his package survMisc; the function is called autoplot.

Source:: R News

Playing with dimensions: from Clustering, PCA, t-SNE… to Carl Sagan!

By Pabloc

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

(This article was first published on R – Data Science Heroes Blog, and kindly contributed to R-bloggers)

Playing with dimensions

Hi there! This post is an experiment combining the result of t-SNE with two well known clustering techniques: k-means and hierarchical. This will be the practical section, in R.

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

But also, this post will explore the intersection point of concepts like dimension reduction, clustering analysis, data preparation, PCA, HDBSCAN, k-NN, SOM, deep learning…and Carl Sagan!

First published at: http://blog.datascienceheroes.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan

PCA and t-SNE

For those who don’t know t-SNE technique (official site), it’s a projection technique -or dimension reduction- similar in some aspects to Principal Component Analysis (PCA), used to visualize N variables into 2 (for example).

When the t-SNE output is poor Laurens van der Maaten (t-SNE’s author) says:

As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong.

In my experience, doing PCA with dozens of variables with:

  • some extreme values
  • skewed distributions
  • several dummy variables,

Doesn’t lead to good visualizations.

Check out this example comparing the two methods:

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

Source: Clustering in 2-dimension using tsne

Makes sense, doesn’t it?

Surfing higher dimensions 🏄

Since one of the t-SNE results is a matrix of two dimensions, where each dot reprents an input case, we can apply a clustering and then group the cases according to their distance in this 2-dimension map. Like a geography map does with mapping 3-dimension (our world), into two (paper).

t-SNE puts similar cases together, handling non-linearities of data very well. After using the algorithm on several data sets, I believe that in some cases it creates something like circular shapes like islands, where these cases are similar.

However I didn’t see this effect on the live demonstration from the Google Brain team: How to Use t-SNE Effectively. Perhaps because of the nature of input data, 2 variables as input.

The swiss roll data

t-SNE according to its FAQ doesn’t work very well with the swiss roll -toy- data. However it’s a stunning example of how a 3-Dimension surface (or manifold) with a concrete spiral shape *is unfols like paper thanks to a reducing dimension technique.

The image is taken from this paper where they used the manifold sculpting technique.

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

Now the practice in R!

t-SNE helps make the cluster more accurate because it converts data into a 2-dimension space where dots are in a circular shape (which pleases to k-means and it’s one of its weak points when creating segments. More on this: K-means clustering is not a free lunch).

Sort of data preparation to apply the clustering models.

library(caret)  
library(Rtsne)

######################################################################
## The WHOLE post is in: https://github.com/pablo14/post_cluster_tsne
######################################################################

## Download data from: https://github.com/pablo14/post_cluster_tsne/blob/master/data_1.txt (url path inside the gitrepo.)
data_tsne=read.delim("data_1.txt", header = T, stringsAsFactors = F, sep = "t")

## Rtsne function may take some minutes to complete...
set.seed(9)  
tsne_model_1 = Rtsne(as.matrix(data_tsne), check_duplicates=FALSE, pca=TRUE, perplexity=30, theta=0.5, dims=2)

## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)  

Different runs of Rtsne lead to different results. So more than likely you will not see exactly the same model as the one present here.

According to the official documentation, perplexity is related to the importance of neighbors:

  • “It is comparable with the number of nearest neighbors k that is employed in many manifold learners.”
  • “Typical values for the perplexity range between 5 and 50”

Object tsne_model_1$Y contains the X-Y coordinates (V1 and V2 variables) for each input case.

Plotting the t-SNE result:

## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +  
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab("") + ylab("") +
  ggtitle("t-SNE") +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank()) +
  scale_colour_brewer(palette = "Set2")

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

And there are the famous “islands” 🏝️. At this point, we can do some clustering by looking at it… But let’s try k-Means and hierarchical clustering instead 😄. t-SNE’s FAQ page suggest to decrease perplexity parameter to avoid this, nonetheless I didn’t find a problem with this result.

Creating the cluster models

Next piece of code will create the k-means and hierarchical cluster models. To then assign the cluster number (1, 2 or 3) to which each input case belongs.

## keeping original data
d_tsne_1_original=d_tsne_1

## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 3)  
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)

## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))

## setting 3 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=3))  

Plotting the cluster models onto t-SNE output

Now time to plot the result of each cluster model, based on the t-SNE map.

plot_cluster=function(data, var_cluster, palette)  
{
  ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
  geom_point(size=0.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab("") + ylab("") +
  ggtitle("") +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        legend.direction = "horizontal", 
        legend.position = "bottom",
        legend.box = "horizontal") + 
    scale_colour_brewer(palette = palette) 
}


plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Accent")  
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")

## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)  
grid.arrange(plot_k, plot_h,  ncol=2)  

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

Visual analysis

In this case, and based only on visual analysis, hierarchical seems to have more common sense than k-means. Take a look at following image:

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

Note: dashed lines separating the clusters were drawn by hand

In k-means, the distance in the points at the bottom left corner are quite close in comparison to the distance of other points inside the same cluster. But they belong to different clusters. Illustrating it:

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

So we’ve got: red arrow is shorter than blue arrow…

Note: Different runs may lead to different groupings, if you don’t see this effect in that part of the map, search it in other.

This effect doesn’t happen in the hierarchical clustering. Clusters with this model seems more even. But what do you think?

Biasing the analysis (cheating)

It’s not fair to k-means to be compared like that. Last analysis based on the idea of density clustering. This technique is really cool to overcome the pitfalls of simpler methods.

HDBSCAN algorithm bases its process in densities.

Find the essence of each one by looking at this picture:

Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!

Surely you understood the difference between them…

Last picture comes from Comparing Python Clustering Algorithms. Yes, Python, but it’s the same for R. The package is largeVis. (Note: Install it by doing: install_github("elbamos/largeVis", ref = "release/0.2").

Deep learning and t-SNE

Quoting Luke Metz from a great post (Visualizing with t-SNE):

Recently there has been a lot of hype around the term “deep learning“. In most applications, these “deep” models can be boiled down to the composition of simple functions that embed from one high dimensional space to another. At first glance, these spaces might seem to large to think about or visualize, but techniques such as t-SNE allow us to start to understand what’s going on inside the black box. Now, instead of treating these models as black boxes, we can start to visualize and understand them.

A deep comment 👏.

Final toughts 🚀

Beyond this post, t-SNE has proven to be a really great general purpose tool to reduce dimensionality. It can be use to explore the relationships inside the data by building clusters, or to analyze anomaly cases by inspecting the isolated points in the map.

Playing with dimensions is a key concept in data science and machine learning. Perplexity parameter is really similar to the k in nearest neighbors algorithm (k-NN). Mapping data into 2-dimension and then do clustering? Hmmm not new buddy: Self-Organising Maps for Customer Segmentation.

When we select the best features to build a model, we’re reducing the data’s dimension. When we build a model, we are creating a function that describes the relationships in data… and so on…

Did you know the general concepts about k-NN and PCA? Well this is one more step, just plug the cables in the brain and that’s it. Learning general concepts gives us the opportunity to do this kind of associations between all of these techniques. Despite comparing programming languages, the power -in my opinion- is to have the focus on how data behaves, and how these techniques are and can ultimately be- connected.

Explore the imagination with this Carl Sagan‘s video: Flatland and the 4th Dimension. A tale about the interaction of 3D objects into 2D plane…


(open source).

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Forecasting gentrification in city neighborhoods, with R

By David Smith

If you’ve lived in a big city, you’re likely familiar with the impact of gentrification. For longtime residents of a neighbourhood, it can represent a decline in the culture and vibrancy of your community; for recent or prospective residents, it can represent a financial opportunity in rising home prices. For those that live in a gentrifying neighbourhood, it’s one of those you-know-it-when-you-see-it things, but for economists and urban planners it can be difficult to identify. So a team of analysts at Urban Spatial to build a longitudinal model based on census tract data to quantify gentrification. Their motivation?

Neighborhoods change because people and capital are mobile and when new neighborhood demand emerges, incumbent residents rightfully worry about displacement.

Acknowledging these economic and social realities, policy makers have a responsibility balance economic development and equity. To that end, analytics can help us understand how the wave of reinvestment moves across space and time and how to pinpoint neighborhoods where active interventions are needed today in order to avoid negative outcomes in the future.

They also provide a detailed data visualization tutorial showing how they used R to visualize the results, like this sequence of panels showing the dramatic rise in prices for homes in the Mission District of San Francisco over the last eight years.

The R full code behind this and other interesting charts is included, often making extensive use of the ggmap package.

Zooming out from neighbourhoods to cities, the full report describes the analysts built a model to predict gentrification within “legacy” US cities (and even produced detailed maps within those cities showing where gentrification was likely to occur). At the city level, the impact is shown in rising or declining average housing values:

If you’re into the economic aspect, check out the full report for lots of interesting analysis of housing trends in US cities. Or, if you just want to learn how all the charts and maps were made, click the link below to see the tutorial.

Urban Spatial: #Dataviz tutorial: Mapping San Francisco home prices using R (via Sharon Machlis)

Source:: R News

forecast 8.0

By Rob J Hyndman

(This article was first published on R – Hyndsight, and kindly contributed to R-bloggers)

In what is now a roughly annual event, the forecast package has been updated on CRAN with a new version, this time 8.0.

A few of the more important new features are described below.

Check residuals

A common task when building forecasting models is to check that the residuals satisfy some assumptions (that they are uncorrelated, normally distributed, etc.). The new function checkresiduals makes this very easy: it produces a time plot, an ACF, a histogram with super-imposed normal curve, and does a Ljung-Box test on the residuals with appropriate number of lags and degrees of freedom.

fit <- auto.arima(WWWusage)
checkresiduals(fit)
## 
##  Ljung-Box test
## 
## data:  residuals
## Q* = 7.8338, df = 8, p-value = 0.4499
## 
## Model df: 2.   Total lags used: 10

This should work for all the modelling functions in the package, as well as some of the time series modelling functions in the stats package.

Different types of residuals

Usually, residuals are computed as the difference between observations and the corresponding one-step forecasts. But for some models, residuals are computed differently; for example, a multiplicative ETS model or a model with a Box-Cox transformation. So the residuals() function now has an additional argument to deal with this situation.

“Innovation residuals”” correspond to the white noise process that drives the evolution of the time series model. “Response residuals” are the difference between the observations and the fitted values (as with GLMs). For homoscedastic models, the innovation residuals and the one-step response residuals are identical. “Regression residuals” are also available for regression models with ARIMA errors, and are equal to the original data minus the effect of the regression variables. If there are no regression variables, the errors will be identical to the original series (possibly adjusted to have zero mean).

library(ggplot2)
fit <- ets(woolyrnq)
res <- cbind(Residuals = residuals(fit), 
             Response.residuals = residuals(fit, type='response'))
autoplot(res, facets=TRUE)

Some new graphs

The geom_histogram() function in the ggplot2 package is nice, but it does not have a good default bandwidth. So I added the gghistogram function which provides a quick histogram with good defaults. You can also overlay a normal density curve or a kernel density estimate.

gghistogram(lynx)

The ggseasonplot function is useful for studying seasonal patterns and how they change over time. It now has a polar argument to create graphs like this.

ggseasonplot(USAccDeaths, polar=TRUE)

I often want to add a time series line to an existing plot. Base graphics has line() which works well when a time series is passed as an argument. So I added autolayer which is similar (but more general). It is an S3 method like autoplot, and adds a layer to an existing ggplot object. autolayer will eventually form part of the next release of ggplot2, but for now it is available in the forecast package. There are methods provided for ts and forecast objects:

WWWusage %>% ets %>% forecast(h=20) -&gt; fc
autoplot(WWWusage, series="Data") + 
  autolayer(fc, series="Forecast") + 
  autolayer(fitted(fc), series="Fitted")

Cross-validation

The tsCV and CVar functions have been added. These were discussed in a previous post.

Bagged ETS

The baggedETS function has been added, which implements the procedure discussed in Bergmeir et al (2016) for bagging ETS forecasts.

head and tail of time series

I’ve long found it annoying that head and tail do not work on multiple time series. So I added some functions to the package so they now work.

Imports and Dependencies

The pipe operator from the magrittr package is now imported. So you don’t need to load the magrittr package to use it.

There are now no packages that are loaded with forecast – everything required is imported. This makes the start up much cleaner (no more annoying messages from all those packages being loaded). Instead, some random tips are occasionally printed when you load the forecast package (much like ggplot2 does).

There is quite a bit more — see the Changelog for a list.

To leave a comment for the author, please follow the link and comment on their blog: R – Hyndsight.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to create correlation network plots with corrr and ggraph (and which countries drink like Australia)

By Simon Jackson

init-example-a-1.jpeg

@drsimonj here to show you how to use ggraph and corrr to create correlation network plots like these:

ggraph and corrr

The ggraph package by Thomas Lin Pedersen, has just been published on CRAN and it’s so hot right now! What does it do?

“ggraph is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees.”

A relational metric I work with a lot is correlations. Becuase of this, I created the corrr package, which helps to explore correlations by leveraging data frames and tidyverse tools rather than matrices.

So…

  • corrr creates relational data frames of correlations intended to work with tidyverse tools like ggplot2.
  • ggraph extends ggplot2 to help plot relational structures.

Seems like a perfect match!

Libraries

We’ll be using the following libraries:

library(tidyverse)
library(corrr)
library(igraph)
library(ggraph)

Basic approach

Given a data frame d of numeric variables for which we want to plot the correlations in a network, here’s a basic approach:

# Create a tidy data frame of correlations
tidy_cors <- d %>% 
  correlate() %>% 
  stretch()

# Convert correlations stronger than some value
# to an undirected graph object
graph_cors <- tidy_cors %>% 
  filter(abs(r) > `VALUE_BETWEEN_0_AND_1`) %>% 
  graph_from_data_frame(directed = FALSE)

# Plot
ggraph(graph_cors) +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_graph()

Example 1: correlating variables in mtcars

Let’s follow this for the mtcars data set. By default, all variables are numeric, so we don’t need to do any pre-processing.

We first create a tidy data frame of correlations to be converted to a graph object. We do this with two corrr functions: correlate(), to create a correlation data frame, and stretch(), to convert it into a tidy format:

tidy_cors <- mtcars %>% 
  correlate() %>% 
  stretch()

tidy_cors
#> # A tibble: 121 × 3
#>        x     y          r
#>    <chr> <chr>      <dbl>
#> 1    mpg   mpg         NA
#> 2    mpg   cyl -0.8521620
#> 3    mpg  disp -0.8475514
#> 4    mpg    hp -0.7761684
#> 5    mpg  drat  0.6811719
#> 6    mpg    wt -0.8676594
#> 7    mpg  qsec  0.4186840
#> 8    mpg    vs  0.6640389
#> 9    mpg    am  0.5998324
#> 10   mpg  gear  0.4802848
#> # ... with 111 more rows

Next, we convert these values to an undirected graph object. The graph is undirected because correlations do not have a direction. For example, correlations do not assume cause or effect. This is done using the igraph function, graph_from_data_frame(directed = FALSE).

Because, we typically don’t want to see ALL of the correlations, we first filter() out any correlations with an absolute value less than some threshold. For example, let’s include correlations that are .3 or stronger (positive OR negative):

graph_cors <- tidy_cors %>%
  filter(abs(r) > .3) %>%
  graph_from_data_frame(directed = FALSE)

graph_cors
#> IGRAPH UN-- 11 88 -- 
#> + attr: name (v/c), r (e/n)
#> + edges (vertex names):
#>  [1] mpg --cyl  mpg --disp mpg --hp   mpg --drat mpg --wt   mpg --qsec
#>  [7] mpg --vs   mpg --am   mpg --gear mpg --carb mpg --cyl  cyl --disp
#> [13] cyl --hp   cyl --drat cyl --wt   cyl --qsec cyl --vs   cyl --am  
#> [19] cyl --gear cyl --carb mpg --disp cyl --disp disp--hp   disp--drat
#> [25] disp--wt   disp--qsec disp--vs   disp--am   disp--gear disp--carb
#> [31] mpg --hp   cyl --hp   disp--hp   hp  --drat hp  --wt   hp  --qsec
#> [37] hp  --vs   hp  --carb mpg --drat cyl --drat disp--drat hp  --drat
#> [43] drat--wt   drat--vs   drat--am   drat--gear mpg --wt   cyl --wt  
#> + ... omitted several edges

We now plot this object with ggraph. Here’s a basic plot:

ggraph(graph_cors) +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name))

car-plot-basic-1.jpeg

and here’s one that’s polished to look nicer:

ggraph(graph_cors) +
  geom_edge_link(aes(edge_alpha = abs(r), edge_width = abs(r), color = r)) +
  guides(edge_alpha = "none", edge_width = "none") +
  scale_edge_colour_gradientn(limits = c(-1, 1), colors = c("firebrick2", "dodgerblue2")) +
  geom_node_point(color = "white", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_graph() +
  labs(title = "Correlations between car variables")

car-plot-1.jpeg

For an excellent resource on how these graphing parts work, Thomas has some great posts like this one on his blog, data-imaginist.com.

Example 2: countries with similar drinking habits

This example requires some data pre-processing, and we’ll only look at strong positive correlations.

I’m about to finish my job in Australia and am looking for work elsewhere. As is typical of Australians, a friend suggested I look for work in countries where people drink like us. This is probably not the best approach for job hunting, but it makes for a fun example!

Conveniently, FiveThirtyEight did a story on the amount of beer, wine, and spirits, drunk by countries around the world. Even more conveniently, the data is included in the fivethirtyeight package! Let’s take a look:

library(fivethirtyeight)

drinks
#> # A tibble: 193 × 5
#>              country beer_servings spirit_servings wine_servings
#>                <chr>         <int>           <int>         <int>
#> 1        Afghanistan             0               0             0
#> 2            Albania            89             132            54
#> 3            Algeria            25               0            14
#> 4            Andorra           245             138           312
#> 5             Angola           217              57            45
#> 6  Antigua & Barbuda           102             128            45
#> 7          Argentina           193              25           221
#> 8            Armenia            21             179            11
#> 9          Australia           261              72           212
#> 10           Austria           279              75           191
#> # ... with 183 more rows, and 1 more variables:
#> #   total_litres_of_pure_alcohol <dbl>

I wanted to find which countries in Europe and the Americas had similar patterns of beer, wine, and spirit drinking, and where Australia fit in. Using the countrycode package to bind continent information and find the countries I’m interested, let’s get this data into shape for correlations:

library(countrycode)

# Get relevant data for Australia and countries
# in Europe and the Americas
d <- drinks %>% 
  mutate(continent = countrycode(country, "country.name", "continent")) %>% 
  filter(continent %in% c("Europe", "Americas") | country == "Australia") %>% 
  select(country, contains("servings"))

# Scale data to examine relative amounts,
# rather than absolute volume, of
# beer, wine and spirits drunk
scaled_data <- d %>% mutate_if(is.numeric, scale)

# Tidy the data
tidy_data <- scaled_data %>% 
  gather(type, litres, -country) %>% 
  drop_na() %>% 
  group_by(country) %>% 
  filter(sd(litres) > 0) %>% 
  ungroup()

# Widen into suitable format for correlations
wide_data <- tidy_data %>% 
  spread(country, litres) %>% 
  select(-type)

wide_data
#> # A tibble: 3 × 78
#>      Albania     Andorra `Antigua & Barbuda`   Argentina  Australia
#> *      <dbl>       <dbl>               <dbl>       <dbl>      <dbl>
#> 1 -1.0798330  0.68479335          -0.9327808  0.09658458  0.8657807
#> 2 -0.1146881 -0.04560957          -0.1607405 -1.34658934 -0.8054739
#> 3 -0.4577044  2.16796347          -0.5492974  1.24185582  1.1502628
#> # ... with 73 more variables: Austria <dbl>, Bahamas <dbl>,
#> #   Barbados <dbl>, Belarus <dbl>, Belgium <dbl>, Belize <dbl>,
#> #   Bolivia <dbl>, `Bosnia-Herzegovina` <dbl>, Brazil <dbl>,
#> #   Bulgaria <dbl>, Canada <dbl>, Chile <dbl>, Colombia <dbl>, `Costa
#> #   Rica` <dbl>, Croatia <dbl>, Cuba <dbl>, `Czech Republic` <dbl>,
#> #   Denmark <dbl>, Dominica <dbl>, `Dominican Republic` <dbl>,
#> #   Ecuador <dbl>, `El Salvador` <dbl>, Estonia <dbl>, Finland <dbl>,
#> #   France <dbl>, Germany <dbl>, Greece <dbl>, Grenada <dbl>,
#> #   Guatemala <dbl>, Guyana <dbl>, Haiti <dbl>, Honduras <dbl>,
#> #   Hungary <dbl>, Iceland <dbl>, Ireland <dbl>, Italy <dbl>,
#> #   Jamaica <dbl>, Latvia <dbl>, Lithuania <dbl>, Luxembourg <dbl>,
#> #   Macedonia <dbl>, Malta <dbl>, Mexico <dbl>, Moldova <dbl>,
#> #   Monaco <dbl>, Montenegro <dbl>, Netherlands <dbl>, Nicaragua <dbl>,
#> #   Norway <dbl>, Panama <dbl>, Paraguay <dbl>, Peru <dbl>, Poland <dbl>,
#> #   Portugal <dbl>, Romania <dbl>, `Russian Federation` <dbl>, `San
#> #   Marino` <dbl>, Serbia <dbl>, Slovakia <dbl>, Slovenia <dbl>,
#> #   Spain <dbl>, `St. Kitts & Nevis` <dbl>, `St. Lucia` <dbl>, `St.
#> #   Vincent & the Grenadines` <dbl>, Suriname <dbl>, Sweden <dbl>,
#> #   Switzerland <dbl>, `Trinidad & Tobago` <dbl>, Ukraine <dbl>, `United
#> #   Kingdom` <dbl>, Uruguay <dbl>, USA <dbl>, Venezuela <dbl>

This data includes the z-scores of the amount of beer, wine and spirits drunk in each country.

We can now go ahead with our standard approach. Because I’m only interested in which countries are really similar, we’ll filter(r > .9):

country-plot-1.jpeg

It looks like the drinking behaviour of these countries group into three clusters. I’ll leave it to you do think about what defines those clusters!

The important thing for my friend: Australia appears in the top left cluster along with many West and North European countries like the United Kingdom, France, Netherlands, Norway, and Sweden. Perhaps this is the region I should look for work if I want to keep up Aussie drinking habits!

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

Source:: R News

How to annotate a plot in ggplot2

By Sharp Sight

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

“Master the basics.”

That’s a common mantra here at Sharp Sight.

Loyal readers know what I mean by “master the basics.” To master data science, you need to master the foundational tools.

That means knowing how to create essential plots like:

  • Bar charts
  • Line charts
  • Histograms

And performing data manipulations like:

  • Creating new variables
  • Subsetting
  • Aggregating
  • Sorting
  • Joining datasets

After the foundations, the little details matter
(like annotations)

Having said that, the little details matter.

After you master the basics, you need to learn the little details.

A great example of this is plot annotation.

Adding little details like plot annotations help you communicate more clearly and “tell a story” with your plots.

Moreover, annotations help your data visualizations “stand on their own.”

What I mean by this, is that they help your plots communicate and “tell a story,” without you being there to do the communicating.

Certainly, as a data scientist, if you create a report or analysis there will be many instances where you will personally present your work to an audience. In these instances, you will be there to explain your work and give a personal “voice over” for your visualizations.

But that’s not always the case. For example, you’ll often send your analyses to partners as reference documents. Other times, you’ll publish them (maybe on a blog, or internal company website). In these instances, you won’t be there to explain your work. You need your work to “speak for you.” The charts need to communicate on their own. To make sure that your work is still effective when you are not there, you’ll often need to annotate your work.

Therefore, annotations are one of the “little details” that you need to learn after you learn the foundations.

A simple annotation in ggplot2

Here, I’ll show you a very simple example of a plot annotation in ggplot2.

Before I show it to you though, I want to introduce you to a learning principle that you can use when you’re learning and practicing data science techniques.

To learn annotate(), start with a very simple example

When you’re learning a new skill, it’s very effective to learn and practice with very simple examples. The simpler the better.

As a side note, this is one of the reasons that I strongly discourage the “jump in and build something” method of learning. When people “jump in” and work on a project, they typically select something that’s far too complicated, so they spend all of their time struggling to do simple things that they should have learned before working on a project.

So before you jump into a project, learn individual techniques. And when you’re learning a new technique, use very, very basic examples.

Code: how to create an annotation in ggplot2

With that in mind, I’ll show you a very, very simple plot annotation in ggplot2.

Here we have a histogram with a dashed line at the mean. The mean is 5.

library(ggplot2)

set.seed(10)
df.rnorm <- data.frame(rnorm = rnorm(10000, mean = 5))

ggplot(data = df.rnorm, aes(x = rnorm)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = 5, color = "red", linetype = "dashed") 

We want to add an annotation that explicitly calls out the value of the mean.

Add an annotation with the annotate() function

ggplot(data = df.rnorm, aes(x = rnorm)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = 5, color = "red", linetype = "dashed") +
  annotate("text", label = "Mean = 5", x = 4.5, y = 170, color = "white")



Here, we’ve added the annotation that says “Mean = 5.”

This is very straightforward.

To accomplish this, we’ve used the annotate() function.

The first argument of the function is “text”. This specifies that we want to use a text annotation. As it turns out, there are several different annotation types, including rectangles and line segments, so you need to specify exactly what type of annotation you want to add. Because we’re adding a text annotation, “text” is the appropriate option.

The next piece of syntax within annotate() is the label = parameter. label just specifies the exact text that we want to add to the plot. Here, we’re specifying that we want to add the text “Mean = 5”.

Next, we specify the exact location of the annotation by detailing the x and y coordinates with the x and y parameters respectively.

Finally, we specify the color. In this case, I’ve set the color to “white”.

You could simplify this code even further by removing the color specification and letting it default to black. When you practice making annotations, I would probably recommend that you remove the color specification for the sake of simplicity. (You are practicing R, right?)

Having said that, in this case, the white text looks best against the dark grey histogram, so I left that code in.

To remember the annotate() technique, you need to practice it

annotate() is a fairly simple technique.

Nevertheless, my bet is that many people will forget it in the long run because they fail to practice it. They’ll “cut and paste” once or twice, and then quickly forget it.

I’ve said before that it’s not enough just to learn a new technique. You need to remember it in the long run.

The problem is that even if you learn a new ggplot2 technique today, you’re very, very likely to forget within a few days.

Having said that, if you want to remember how to use annotate() and other R tools – if you want to master R – you need to practice.

Sign up to learn ggplot2

Discover how to rapidly master ggplot2 and other R data science tools.

Sign up for our email list HERE.

If you sign up, you’ll get free tutorials about ggplot2 and other R tools, delivered to your inbox.

The post How to annotate a plot in ggplot2 appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggedit 0.1.1: Shiny module to interactvely edit ggplots within Shiny applications

By Jonathan Sidi

(This article was first published on R posts, and kindly contributed to R-bloggers)

ggedit is a package that lets users interactively edit ggplot layer and theme aesthetics. In a previous post we showed you how to use it in a collaborative workflow using standard R scripts. More importantly, we highlighted that ggedit outputs to the user, after editing, updated: gg plots, layers, scales and themes as both self-contained objects and script that you can paste directly in your code.

Installation


devtools::install_github('metrumresearchgroup/ggedit',subdir='ggedit')

version 0.1.1 Updates

  • ggEdit Shiny module: use ggedit as part of any Shiny application.
  • gggsave: generalized ggsave to write multiple outputs of ggplot to a single file and/or multiple files from a single call. Plots can be saved to various graphic devices.

ggEdit Shiny module

This post will demonstrate a new method to use ggedit, Shiny modules. A Shiny module is a chunk of Shiny code that can be reused many times in the same application, but generic enough so it can be applied in any Shiny app (in simplest terms think of it as a Shiny function). By making ggedit a Shiny module we can now replace any renderPlot() call that inputs a ggplot and outputs in the UI plotOutput(), with an interactive ggedit layout. The analogy between how to use the ggEdit module in comparison to a standard renderPlot call can be seen in the table below.

Standard Shiny Shiny Module
Server output$id=renderPlot(p) reactiveOutput=callModule(ggEdit,id,reactive(p))
UI plotOutput(id) ggEditUI(id)

We can see that there are a few differences in the calls. To call a module you need to run a Shiny function callModule, in this case ggEdit. Next, a character id for the elements the module will create in the Shiny environment and finally the arguments that are expected by the module, in this case a reactive object that outputs a ggplot or list of ggplots. This is coupled with ggEditUI, which together create a ggedit environment to edit the plots during a regular Shiny app.
In addition to the output UI the user also gets a reactive output that has all the objects that are in the regular ggedit package (plots, layers, scales, themes) both in object and script forms. This has great advantages if you want to let users edit plots while keeping track of what they are changing. A realistic example of this would be clients (be it industry or academia) that are shown a set of default plots, with the appropriate data, and then they are given the opportunity to customize according to their specifications. Once they finish editing, the script is automatically saved to the server, updating the clients portfolio with their preferred aesthetics. No more email chains on changing a blue point to an aqua star!
Below is a small example of a static ggplot using renderPlot/plotOutput and how to call the same plot and a list of plots using ggEdit/ggeditUI. We added a small reactive text output so you can see the real-time changes of the aesthetic editing being returned to the server

Source Code for example


library(ggedit)
server = function(input, output,session) {
p1=ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+geom_point()
p2=ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,colour=Species))+geom_line()+geom_point()
p3=list(p1=p1,p2=p2)
output$p=renderPlot({p1})

outp1=callModule(ggEdit,'pOut1',obj=reactive(list(p1=p1)))
outp2=callModule(ggEdit,'pOut2',obj=reactive(p3))

output$x1=renderUI({
layerTxt=outp1()$UpdatedLayerCalls$p1[[1]]
aceEditor(outputId = 'layerAce',value=layerTxt,
mode = 'r', theme = 'chrome',
height = '100px', fontSize = 12,wordWrap = T)
})

output$x2=renderUI({
themeTxt=outp1()$UpdatedThemeCalls$p1
aceEditor(outputId = 'themeAce',value=themeTxt,
mode = 'r', theme = 'chrome',
height = '100px', fontSize = 12,wordWrap = T)
})
}

ui=fluidPage(
conditionalPanel("input.tbPanel=='tab2'",
sidebarPanel(uiOutput('x1'),uiOutput('x2'))),
mainPanel(
tabsetPanel(id = 'tbPanel',
tabPanel('renderPlot/plotOutput',value = 'tab1',plotOutput('p')),
tabPanel('ggEdit/ggEditUI',value = 'tab2',ggEditUI('pOut1')),
tabPanel('ggEdit/ggEditUI with lists of plots',value = 'tab3',ggEditUI('pOut2'))
)))
shinyApp(ui, server)

gggedit

ggsave is the device writing function written for the ggplot2 package. A limitation of it is that only one figure can be written at a time. gggsave is a wrapper of ggsave that allows for list of ggplots to be called and then passes arguments to base graphics devices to create multiple outputs automatically, without the need of loops.


library(ggedit) 
#single file output to pdf 
gggsave('Rplots.pdf',plot=pList) 

#multiple file output to pdf 
gggsave('Rplots.pdf',plot=pList,onefil = FALSE) 

#multiple file output to png 
gggsave('Rplots.png',plot=pList)


Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.


To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.


Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue in github.

To leave a comment for the author, please follow the link and comment on their blog: R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggraph: ggplot for graphs

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A graph, a collection of nodes connected by edges, is just data. Whether it’s a social network (where nodes are people, and edges are friend relationships), or a decision tree (where nodes are branch criteria or values, and edges decisions), the nature of the graph is easily represented in a data object. It might be represented as a matrix (where rows and columns are nodes, and elements mark whether an edge between them is present) or as a data frame (where each row is an edge, with columns representing the pair of connected nodes).

The trick comes in how you represent a graph visually; there are many different options each with strengths and weaknesses when it comes to interpretation. A graph with many nodes and edges may become an unintelligible hairball without careful arrangement, and including directionality or other attributes of edges or nodes can reveal insights about the data that wouldn’t be apparent otherwise. There are many R packages for creating and displaying graphs (igraph is a popular one, and this CRAN task view lists many others) but that’s a problem in its own right: an important part of the data exploration process is trying and comparing different visualization options, and the myriad packages and interfaces makes that process difficult for graph data.

Now, there’s the new ggraph package, recently published to CRAN by author Thomas Lin Pederson, which promises to make exploring graph data easier. Unlike other graphing packages, ggraph uses the grammar of graphics paradigm of the ggplot2 package, unifying the data structures and attributes associated with graphics. It also includes a wide range of visual representations of graphs — layouts — and makes it easy to switch between them. The basic “mesh” visualization of nodes and edges provides 11 different options for arranging the nodes:

Other types of visualizations are supported, too: hive plots, dendrograms, treemaps, and circle plots, to name just a few. Note that only static graphs are available, though: unlike igraph and some other packages, you can’t rearrange the location of the nodes or otherwise manipulate the graphics with a mouse.

For the R programmer, most of the work is done by the ggraph function. It’s analagous to the ggplot function, except that you don’t provide data for the locations of the nodes; their position is selected by an algorithm. (Similarly, layout choices are automatically made for visualization types other than the mesh.) There are also various themes suited to graphs you can use to style your chart: goodbye gridlines and axes; hello labels, annotations and edge arrows.

The ggraph package is available on CRAN now, and works with R version 2.10 and later. For more on the ggraph package, see the announcement blog post linked below.

Data Imaginist: Announcing ggraph: A grammar of graphics for relational data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News