leaflet 1.1.0

By Joe Cheng

leaflet-choro

Leaflet 1.1.0 is now available on CRAN! The Leaflet package is a tidy wrapper for the Leaflet.js mapping library, and makes it incredibly easy to generate interactive maps based on spatial data you have in R.

This release was nearly a year in the making, and includes many important new features.

  • Easily add textual labels on markers, polygons, etc., either on hover or statically
  • Highlight polygons, lines, circles, and rectangles on hover
  • Markers can now be configured with a variety of colors and icons, via integration with Leaflet.awesome-markers
  • Built-in support for many types of objects from sf, a new way of representing spatial data in R (all basic sf/sfc/sfg types except MULTIPOINT and GEOMETRYCOLLECTION are directly supported)
  • Projections other than Web Mercator are now supported via Proj4Leaflet
  • Color palette functions now natively support viridis palettes; use "viridis", "magma", "inferno", or "plasma" as the palette argument
  • Discrete color palette functions (colorBin, colorQuantile, and colorFactor) work much better with color brewer palettes
  • Integration with several Leaflet.js utility plugins
  • Data with NA points or zero rows no longer causes errors
  • Support for linked brushing and filtering, via Crosstalk (more about this to come in another blog post)

Many thanks to @bhaskarvk who contributed much of the code for this release.

Going forward, our intention is to prevent any more Leaflet.js plugins from accreting in the core leaflet package. Instead, we have made it possible to write 3rd party R packages that extend leaflet (though the process to do this is not documented yet). In the meantime, Bhaskar has started developing his own leaflet.extras package; it already supports several plugins, for everything from animated markers to heatmaps.

Source:: R News

Data Transformation in R: The #Tidyverse-Approach of Organizing Data #rstats

By Daniel

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Yesterday, I had the pleasure to give a talk at the 8th Hamburg R User-Group meeting. I talked about data wrangling and data transformation, and how the philosophy behind the tidyverse makes these tasks easier. If you like, you can download the slides. Feel free to add your comments to the slide here.

Tagged:

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Finding Radiohead’s most depressing song, with R

By David Smith

Radiohead is known for having some fairly maudlin songs, but of all of their tracks, which is the most depressing? Data scientist and R enthusiast Charlie Thompson ranked all of their tracks according to a “gloom index”, and created the following chart of gloominess for each of the band’s nine studio albums. (Click for the interactive version, crated with with highcharter package for R, which allows you to explore individual tracks.)

If you’re familiar with the albums, this looks pretty reasonable. Radiohead’s debut, “Honey Pablo” was fairly poppy musically, but contained some pretty dark lyrics (especially in the break-out hit, Creep). Their most recent album “A Moon Shaped Pool” is a fantastic listen, but it isn’t exactly going to get the party started.

The “Gloom Index” charted above is a combination of three quantities, and then scaled from 1 (Radiohead’s gloomiest song, True Love Waits), to 100 (the cheeriest, 15 Step).

The first quantity is Spotify’s “valence score“, which Spotify describes as a “quantity describing the musical positiveness of a track”. Valence scores range from 0 (songs that sound depressed or angry) to 1 (songs that are happy or euphoric). Charlie extracted the list of Radiohead’s 101 singles and the valence score for each from the Spotify developer API, using the httr package for R. This is useful in its own right, but several songs have the same valence score, so Charlie also looks at song lyrics to further differentiate them.

The second quantity is the percentage of words in the lyrics that that are “sad”. Charlie scraped the song lyrics from Genius using the rvest package, and then used the tidytext package to break the lyrics into words, eliminate common “stop words” like ‘the’ and ‘a’, and count the number with negative sentiment.

The third quantity is the “lyrical density” (following a method described by Myles Harrison): the number of words per second, easily calculated from the Spotify track length data and the word counts calculated in the prior step.

The three quantities are combined together to create the “gloom index” as follows:

$$ mathrm{gloomIndex} = frac{1-mathrm{valence}}{2} + frac{mathrm{pctSad}*(1+mathrm{lyricalDensity})}{2} $$

Roughly, this is the average of the valence score and (almost) the number of sad words per second. (I’m guessing Charlie adds 1 to the lyrical density to get the two terms to about the same scale, so that both have about equal weight.)

It would be interesting to compare the “Gloom Index” for Radiohead with that for other famously downbeat artists (say, Bon Iver or Low). You’d need to so away with scaling the Gloom Index from 1 to 100 as Charlie has done here, but the formula could easily be adapted to make a universal score. If you’d like to give it a try, all of the R code is included in Charlie’s blog post, linked below.

RCharlie: fitteR happieR

Source:: R News

Raccoon | Ch 2.5 – Unbalanced and Nested Anova

By Quantide

Anova special cases - Unbalanced and nested anova

(This article was first published on R blog | Quantide – R training & consulting, and kindly contributed to R-bloggers)

Download the Unbalanced and Nested Anova cheat sheet in full resolution: Anova special cases

This article is part of Quantide’s web book “Raccoon – Statistical Models with R“. Raccoon is Quantide’s third web book after “Rabbit – Introduction to R” and “Ramarro – R for Developers“. See the full project here.

The second chapter of Raccoon is focused on T-test and Anova. Through example it shows theory and R code of:

This post is the fifth section of the chapter, about Unbalanced Anova with one obs dropped and fixed effects Nested Anova.

Throughout the web-book we will widely use the package qdata, containing about 80 datasets. You may find it here: https://github.com/quantide/qdata.

Example: Brake distance 1 (unbalanced anova with one obs dropped)

Data description

We use the same data as in our previous article about 3-way Anova, but here one observation (row) has been removed. The detailed data description is in 3-way Anova.

Data loading

data(distance)
head(distance)
## # A tibble: 6 × 4
##     Tire  Tread      ABS Distance
##   <fctr> <fctr>   <fctr>    <dbl>
## 1     GT     10  enabled 19.35573
## 2     GT    1.5 disabled 23.38069
## 3     MX    1.5  enabled 24.00778
## 4     MX     10  enabled 25.07142
## 5     LS     10 disabled 26.39833
## 6     GT     10  enabled 18.60888
str(distance)
## Classes 'tbl_df', 'tbl' and 'data.frame':    24 obs. of  4 variables:
##  $ Tire    : Factor w/ 3 levels "GT","LS","MX": 1 1 3 3 2 1 2 2 2 3 ...
##  $ Tread   : Factor w/ 2 levels "1.5","10": 2 1 1 2 2 2 1 1 2 2 ...
##  $ ABS     : Factor w/ 2 levels "disabled","enabled": 2 1 2 2 1 2 2 1 1 1 ...
##  $ Distance: num  19.4 23.4 24 25.1 26.4 ...

Let us drop one observation so that all the factor levels combinations do not contain the same number of observations. We drop the observation with the values “LS 10 enabled”

distance1 <- distance[-24,]

Descriptives

As usual, we first plot the univariate effects:

plot.design(Distance ~ ., data = distance1)

anova20Univariate effects plot of unbalanced model

Secondly we look at the two-way interaction plot:

op <- par(mfrow = c(3, 1))
with(distance1, {
  interaction.plot(ABS, Tire, Distance)
  interaction.plot(ABS, Tread, Distance)
  interaction.plot(Tread, Tire, Distance)
  }
)
par(op)

anova21

Two-way interaction effects plots of unbalanced model

We notice that all effects do not seem to change with respect to the previous example of 3-way anova.

Inference and models

In this section is where we’ll start noticing some differences with the balanced model. First let us fit the model with all interactions

fm <- aov(Distance ~ ABS * Tire * Tread, data = distance1)
summary(fm)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## ABS             1  81.95   81.95  39.870 5.72e-05 ***
## Tire            2  47.95   23.98  11.665  0.00191 ** 
## Tread           1   0.19    0.19   0.092  0.76771    
## ABS:Tire        2   6.72    3.36   1.635  0.23898    
## ABS:Tread       1   3.26    3.26   1.588  0.23365    
## Tire:Tread      2   3.42    1.71   0.831  0.46107    
## ABS:Tire:Tread  2   4.99    2.50   1.215  0.33360    
## Residuals      11  22.61    2.06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As none of the interactions are significant, we drop them one by one, starting from the three-way interaction:

fm <- update(fm, . ~ . -ABS:Tire:Tread)
summary(fm)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## ABS          1  81.95   81.95  38.594 3.16e-05 ***
## Tire         2  47.95   23.98  11.291  0.00144 ** 
## Tread        1   0.19    0.19   0.089  0.77049    
## ABS:Tire     2   6.72    3.36   1.583  0.24259    
## ABS:Tread    1   3.26    3.26   1.537  0.23691    
## Tire:Tread   2   3.42    1.71   0.805  0.46829    
## Residuals   13  27.60    2.12                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We continue on by dropping the two way interactions:

fm1 <- update(fm, .~ABS+Tire+Tread)
summary(fm1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## ABS          1  81.95   81.95  35.972 1.13e-05 ***
## Tire         2  47.95   23.98  10.524  0.00094 ***
## Tread        1   0.19    0.19   0.083  0.77694    
## Residuals   18  41.01    2.28                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And we then make sure that the model without the interaction is actually better than that with interactions:

anova(fm, fm1)
## Analysis of Variance Table
## 
## Model 1: Distance ~ ABS + Tire + Tread + ABS:Tire + ABS:Tread + Tire:Tread
## Model 2: Distance ~ ABS + Tire + Tread
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     13 27.603                           
## 2     18 41.005 -5   -13.402 1.2624  0.337

As expected, the model with interactions is not significantly better than that without interactions.

The final model is hence the same as the model in the previous example of balanced Anova. However, notice that the sums of squares of the following two models (that we expect to be equal), are different:

fm <- aov(Distance ~ ABS + Tire, data = distance1)
summary(fm)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## ABS          1  81.95   81.95   37.80 6.57e-06 ***
## Tire         2  47.95   23.98   11.06 0.000653 ***
## Residuals   19  41.19    2.17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fminv <- aov(Distance ~ Tire + ABS, data = distance1)
summary(fminv)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Tire         2  53.09   26.55   12.24 0.000383 ***
## ABS          1  76.80   76.80   35.42 9.95e-06 ***
## Residuals   19  41.19    2.17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since aov() performs Type I SS ANOVA (we will see a wide explanation of Types of Sum of Squares in the Appendix) and this example uses data from unbalanced design, the previous 2 models give different results in terms of SS and respective p-values. In fact Type I SS ANOVA depends on the order in which factors are included in the model: fm is based on SS(ABS) and SS(Tire|ABS), whereas fminv is based on SS(Tire) and SS(ABS|Tire).

In order to avoid this problem, we may use Type II ANOVA: drop1() function allows to do this.

drop1(object=fm,test="F")
## Single term deletions
## 
## Model:
## Distance ~ ABS + Tire
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>               41.194 21.404                      
## ABS     1    76.802 117.996 43.609  35.424 9.949e-06 ***
## Tire    2    47.950  89.144 35.159  11.058 0.0006532 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(object=fminv,test="F")
## Single term deletions
## 
## Model:
## Distance ~ Tire + ABS
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>               41.194 21.404                      
## Tire    2    47.950  89.144 35.159  11.058 0.0006532 ***
## ABS     1    76.802 117.996 43.609  35.424 9.949e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, the results are equal. Alternatively, the function Anova() of the package car is available. Anova() allows Type II and III Sum of Squares too.

Notice that, until now, at least six types of sum of squares have been introduced in literature.
However, there are open discussions among statisticians about the use and pros/cons of different Types of SS.

Residual analysis

Together with the model results, one should always provide some statistics/plots on the residuals.

op <- par(mfrow = c(2, 2))
plot(fm)
par(op)

anova22

Residual plots of unbalanced model

In this case, since the leverages are not constant (unbalanced design) 4th plot draws the leverages in x-axis.

Example: Pin diameters (Fixed effects Nested ANOVA)

Data description

The dataframe considered in this example contains data collected from five different lathes, each of them used by two different operators. The goal of the study is to evaluate if significant differences in the mean diameter of pins occur between lathes and/or operators. Notice that here we are concerned with the effect of operators, so the layout of experiment is nested. If we were concerned with shift instead of operator, the layout would have been the other way around.

Data loading

data(diameters)
str(diameters)
## Classes 'tbl_df', 'tbl' and 'data.frame':    50 obs. of  3 variables:
##  $ Lathe   : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Operator: Factor w/ 2 levels "D","N": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Pin.Diam: num  0.125 0.124 0.118 0.116 0.123 0.122 0.126 0.126 0.118 0.125 ...

Descriptives

Let us first carry out some descriptive statistics and plots in order to get a glimpse of the data. The next few lines of code show descriptive statistics for each variable and the mean for each combination of Lathe and Operator factor levels.

summary(diameters)
##  Lathe  Operator    Pin.Diam    
##  1:10   D:25     Min.   :0.114  
##  2:10   N:25     1st Qu.:0.122  
##  3:10            Median :0.125  
##  4:10            Mean   :0.124  
##  5:10            3rd Qu.:0.126  
##                  Max.   :0.130
xtabs(formula=Pin.Diam~Lathe+Operator,data=diameters)
##      Operator
## Lathe     D     N
##     1 0.631 0.634
##     2 0.603 0.605
##     3 0.623 0.618
##     4 0.636 0.634
##     5 0.615 0.603

Above statistics are not completely well-advised, since the operators working in the same part of the day (day or night) are different (nested anova) for different lathes. Let us draw a box-plot for each Late x Operator combination.

ggp <- ggplot(data = diameters, mapping = aes(x=Lathe, y=Pin.Diam, fill=Operator)) + 
  geom_boxplot()

print(ggp)

anova23

Boxplot of Pin.Diam by Lathe x Operator

It may seem natural to perform the following (incorrect) ANOVA to analyze diameters conditional on Lathe and Shift (i.e., considering Operator levels as equivalent to different shifts of working days) as for classical factorial layout.

fm1 <- aov(formula = Pin.Diam~Lathe*Operator, data = diameters)
summary(fm1)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Lathe           4 0.0003033 7.583e-05   8.766 3.52e-05 ***
## Operator        1 0.0000039 3.920e-06   0.453    0.505    
## Lathe:Operator  4 0.0000147 3.670e-06   0.424    0.790    
## Residuals      40 0.0003460 8.650e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above results give however an incorrect model for the data under study. In fact the actual data structure is the following:

diameters$Lathe_op <- factor(diameters$Lathe:diameters$Operator)
xtabs(formula=Pin.Diam~Lathe+Lathe_op,data=diameters)
##      Lathe_op
## Lathe   1:D   1:N   2:D   2:N   3:D   3:N   4:D   4:N   5:D   5:N
##     1 0.631 0.634 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##     2 0.000 0.000 0.603 0.605 0.000 0.000 0.000 0.000 0.000 0.000
##     3 0.000 0.000 0.000 0.000 0.623 0.618 0.000 0.000 0.000 0.000
##     4 0.000 0.000 0.000 0.000 0.000 0.000 0.636 0.634 0.000 0.000
##     5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.615 0.603

The correct ANOVA to perform is thus the following:

fm1 <- aov(formula=Pin.Diam~Lathe+Lathe/Operator,data=diameters)
summary(fm1)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Lathe           4 0.0003033 7.583e-05   8.766 3.52e-05 ***
## Lathe:Operator  5 0.0000186 3.720e-06   0.430    0.825    
## Residuals      40 0.0003460 8.650e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The / formula operator means that the levels of Operator factor are nested within the levels of Lathe factor. If we read the output we find that lathes seem to produce on average different products, whereas the difference between the means of the pins’ diameter conditional on the operator does not seem to be significant. Although the final results of the two Anovas (the first of which is incorrect and the second one correct!) may seem similar nested Anova is the one to use because there is a control over the variability given by the (fixed) effect of the operators. Nested Anova is hence useful for reducing the general variability of the plan and getting more significant differences among the levels of the factors.

An equivalent model formulation for nested ANOVA is given by:

fm1a <- aov(formula=Pin.Diam~Lathe+Operator:Lathe,data=diameters)
summary(fm1a)
##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Lathe           4 0.0003033 7.583e-05   8.766 3.52e-05 ***
## Lathe:Operator  5 0.0000186 3.720e-06   0.430    0.825    
## Residuals      40 0.0003460 8.650e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

or

#alternative correct model
fm1b <- aov(formula=Pin.Diam~Lathe+Operator:Lathe_op,data=diameters)
summary(fm1b)
##                   Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Lathe              4 0.0003033 7.583e-05   8.766 3.52e-05 ***
## Operator:Lathe_op  5 0.0000186 3.720e-06   0.430    0.825    
## Residuals         40 0.0003460 8.650e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residuals analysis

Finally, residuals may be plotted for model’s diagnostics:

op <-  par(mfrow = c(2, 2))
plot(fm1)
par(op)

anova24

Residual plot of Late by Operator nested model

The post Raccoon | Ch 2.5 – Unbalanced and Nested Anova appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide – R training & consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Zero Bug

By John Mount

The zero bug

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem.

Our Example

For our example problem please consider aggregating significant earthquake events in the United States of America.

To do this we will start with data from:

The National Geophysical Data Center / World Data Service (NGDC/WDS): Significant Earthquake Database, National Geophysical Data Center, NOAA, doi:10.7289/V5TD9V7K.

They database is described thusly:

The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.

I queried the form for “North America and Hawaii”:”USA” in tab delimited form. For simplicity and reproducibility I saved a the result in the url given in the R example below. Starting our example we can use R to load the data from the url and start our project.

url <- 'http://www.win-vector.com/dfiles/earthquakes.tsv'
d <- read.table(url, 
                header=TRUE, 
                stringsAsFactors = FALSE,
                sep='t')
View(d)
head(d[,c('I_D','YEAR','INTENSITY','STATE')])
#    I_D YEAR INTENSITY STATE
# 1 6697 1500        NA    HI
# 2 6013 1668         4    MA
# 3 9954 1700        NA    OR
# 4 5828 1755         8    MA
# 5 5926 1788         7    AK
# 6 5927 1788        NA    AK

We see the data is organizes such that each row is an event (with I_D), and contains many informational columns including “YEAR” (the year the event happened) and “STATE” (the state abbreviation where the event happened). Using R tools and packages we can immediately start to summarize and visualize the data.

For example: we can count modern (say 1950 and later) US earthquakes by year.

library("ggplot2")
library("dplyr")

dModern <- d[d$YEAR>=1950, , drop=FALSE]

# histogram
ggplot(data=dModern, mapping=aes(x=YEAR)) +
  geom_histogram(binwidth=1) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

Or we can get use dplyr to build the count summaries by hand and present the summary in a stem-plot instead of the histogram.

# aggregate the data
byYear <- dModern %>% 
  group_by(YEAR) %>%
  summarize(count = n()) %>%
  arrange(YEAR)

# produce a stem-style plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

NewImage

The problem

We have already snuck in the mistake. The by-hand aggregation “byYear” is subtly wrong. The histogram is almost correct (given its graphical convention of showing counts as height), but the stem plot is revealing problems.

Here is the problem:

summary(byYear$count)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    1.00    1.00    2.00    2.50    3.75    7.00 

Notice the above summary implies the minimum number of significant earthquakes seen in the United States in a modern year is 1. Looking at our graphs we can see it should in fact be 0. This wasn’t so bad for graphing, but can be disastrous in calculation or in directing action.

The fix

This is the kind of situation that anti_join is designed to fix (and is how replyr::replyr_coalesce deals with the problem).

A simple ad-hoc fix I recommend is to build a second synthetic summary frame that carries explicit zero counts. We then add these counts into our aggregation and get correct summaries.

For example the following code:

# add in zero summaries
zeroObs <- data.frame(YEAR=1950:2016, count=0)
byYear <- rbind(byYear, zeroObs) %>%
  group_by(YEAR) %>%
  summarize(count = sum(count)) %>%
  arrange(YEAR)

# re-plot
ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=YEAR, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byYear$count))) +
  ggtitle('number of modern USA earthquakes by year')

gives us the corrected stem plot:

NewImage

The figure above is the correct stem-plot. Remember: while a histogram denotes counts by filled-heights of bars, a stem-plot denotes counts by positions of visible points. The point being: in a proper stem-plot zero counts are not invisible (and are in fact distinguishable from missing summaries).

A harder example

This issue may seem trivial but that is partly because I deliberately picked a simple example where it is obvious there is missing data. This is not always the case. Remember: hidden errors can be worse than visible errors. In fact even in the original histogram it was not obvious what to think about the missing years 1950 (which apparently had no significant US earthquakes) and 2017 (which is an incomplete year). We really have to explicitly specify the complete universe (also called range or support set) of keys (as in YEAR=1950:2016, and not use a convenience such as YEAR=min(dModern$YEAR):max(dModern$YEAR)).

This is much more obvious if we summarize by state instead of year.

byState <- dModern %>% 
  group_by(STATE) %>%
  summarize(count = n()) %>%
  arrange(STATE)

ggplot(data=byState, mapping=aes(x=STATE, y=count)) +
  geom_point(color='darkgreen') +
  geom_segment(mapping=aes(xend=STATE, yend=0),
               color='darkgreen') +
  scale_y_continuous(limits=c(0, max(byState$count))) +
  ggtitle('number of modern USA earthquakes by state')

NewImage

We do not have a proper aggregation where each state count is represented with an explicit zero, and not by implicit missingness (which can’t differentiate states with zero-counts from non-states or misspellings). To produce the proper summary we need a list of US state abbreviations to merge in.

A user of the above graph isn’t likely to be too confused. Such a person likely knows there are 50 states and will presume the missing states have zero counts. However downstream software (which can be punishingly literal) won’t inject such domain knowledge and will work on the assumption that there are 17 states and all states have had significant earthquakes in modern times. Or suppose we are aggregating number treatments given to a set of patients; in this case the unacceptable confusion of not-present and zero-counts can really hide huge, problems: such as not noticing some patients never received treatment.

R actually has list of state abbreviations (in “datasets::state.abb“) and we can use replyr::replyr_coalesce to quickly fix our issue:

library('replyr')

support <- data.frame(STATE= c('', datasets::state.abb),
                      stringsAsFactors = FALSE)

byState <- byState %>%
  replyr_coalesce(support,
                  fills= list(count= 0)) %>%
  arrange(STATE)

An important feature of replyr_coalesce is: it checks that the count-carrying rows (the data) are contained in the support rows (the range definition). It would (intentionally) throw an error if we tried to use just datasets::state.abb as the support (or range) definition as this signals the analyst didn’t expect the blank state in the data set.

“The Dog that Didn’t Bark”

An additional illustrative example is from Sir Arthur Conan Doyle’s story “The Adventure of Silver Blaze”.

NewImage

In this story Sherlock Holmes deduces that a horse had been absconded not by a stranger, but by somebody well known at its stable. Holmes explains the key clue here:

   Gregory (Scotland Yard): "Is there any other point to which you 
                             would wish to draw my attention?"
   Holmes: "To the curious incident of the dog in the night-time."
   Gregory: "The dog did nothing in the night-time."
   Holmes: "That was the curious incident."

For this type of reasoning to work: Holmes has to know that there was a dog present, the dog would have barked if a stranger approached the stables, and none of his witnesses reported hearing the dog. Holmes needs an affirmative record that no barks were heard: a zero written in a row, not a lack of rows.

Conclusion

When describing “common pitfalls” the reaction is often: “That isn’t really interesting, as I would never make such an error.” I believe “the zero bug” is in fact common and not noticed as it tends to hide. The bug is often there and invisible, but can produce invalid results.

Any summary that is produced by counting un-weighted rows can never explicitly produce a value of zero (it can only hint at such through missingness). n() can never form zero, so if zero is important it must be explicitly joined in after counting. In a sense organizing rows for counting with n() censors out zeros.

I can’t emphasize enough how important it is to only work with explicit representations (records indicating counts, even if they are zero), and not implicit representations (assuming non-matched keys indicate zero-counts). The advantages of explicit representations are one of the reasons R has a notation for missing values (“NA“) in the first place.

This is, of course, not a problem due to use of dplyr. This is a common problem in designing SQL database queries where you can think of it as “inner-join collapse” (failing to notice rows that disappear due to join conditions).

My advice: you should be suspicious of any use of summarize(VALUE=n()) until you find the matching zero correction (replyr_coalesce, or a comment documenting why no such correction is needed). This looking from n() to a matching range-control should be become as habitual as looking from an opening parenthesis to its matching closing parenthesis. Even in data science, you really have to know you have all the little things right before you rely on the large things.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Free DataCamp for your Classroom

By DataCamp Blog

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Announcing: DataCamp for the classroom, a new free plan for Academics.

We want to support every student that wants to learn Data Science. That is why, as of today, professors/teachers/TA’s/… can give their students 6 months of FREE access to the full DataCamp course curriculum when used in the classroom. Request your free classroom today.

1) Student Benefits

When you set-up your DataCamp for the classroom account, each student will automatically have full access to the entire course curriculum (>250 hours). This includes access to premium courses such as:

In addition, students can participate in leaderboards and private discussion forums with their fellow classmates.

2) Professor/Instructors/Teacher/… Benefits

Via DataCamp for the classroom, the instructor has access to all our handy group features:

  • Progress Tracking: Track each of your student’s progress in detail. You can see for example how many XP points they have gained, the number of courses completed and total chapters finished.

  • Assignments & Deadlines: Save yourself time via automated grading. Automatically set homework using assignment and deadlines throughout the semester.

  • Export Data: Get all of your student’s data via the export functionality. See in depth how active students are on DataCamp, when they started and completed chapters, and use this data to make your own analysis.

  • Learning Management System Integration: Seamlessly integrate DataCamp with your universities learning management system (Sakai, Blackboard, Canvas…). Automatically send student assignment completion data where it is expected to be found.

3) Get Started

Get started with DataCamp for the classroom and join professors from Duke University, Notre-Dame, Harvard, University College London, UC Berkeley and much more, in learning data science by doing with DataCamp.

START TODAY

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Three R Shiny tricks to make your Shiny app shines (2/3): Semi-collapsible sidebar

By guillotantoine

(This article was first published on R – Enhance Data, and kindly contributed to R-bloggers)

EDIT:

Actually there is a much easier way to do so, by just adding the code below to the UI:
tags$script(HTML(“$(‘body’).addClass(‘sidebar-mini’);”))

Thanks at @_pvictorr for suggesting it!

Original post:

In this tutorials sequence, we are going to see three tricks to do the following in a Shiny app:

  1. Add Next and Previous buttons to navigate in a tabBox
  2. Build a non completely collapsible sidebar to keep the icon visible on collapse
  3. Add button on a datatable output to delete/modify/ do an action on a given row.

Today, we are going to see ho to build a semi-collapsible sidebar.

2.Semi-collapsible sidebar

We are trying to build a semi-collapsible side which only show the icon instead of completely hiding the sidebar when clicking on the collapse button. You can see an animated example below:

a. What you need to build this:

As in the previous tutorial you only need R with its package Shiny and Shiny Dashboard. We’ll use R, some javascript and some CSS to manage to do this.

b. It all starts with a Dashboard

We just need a classic Shiny app, with an ui.R file containing a dashboard and a server.R file.

  1. The sidebar menu will be put in the server since we want to be able to hide it when clicking on the collapse button (and we want to be able to remove text from it when hiding it). In addition to this, putting it in the server can be useful if you want to change the sidebar dynamically in your server.
  2. We are linking the application with a CSS file to change the way the sidebar appearance change on collapse.

ui.R file

library(shiny)
library(shinydashboard)

dashboardPage(
 dashboardHeader(title="Semi-collapsible Sidebar"),
###point 1.
 dashboardSidebar(sidebarMenuOutput("Semi_collapsible_sidebar")),
###point 2.
 dashboardBody(tags$head(tags$link(rel = "stylesheet",
type = "text/css", href = "style.css")))
)

server.R file

library(shiny)
library(shinydashboard)
shinyServer(function(input,output){

 output$Semi_collapsible_sidebar=renderMenu({

 sidebarMenu(
    menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
    menuItem("Widgets", icon = icon("th"), tabName = "widgets",
      badgeLabel = "new",
      badgeColor = "green"),
    menuItem("Charts", icon = icon("bar-chart-o"),
    menuSubItem("Sub-item 1", tabName = "subitem1"),
    menuSubItem("Sub-item 2", tabName = "subitem2")
 ))

})

c. Creating the CSS file:

The semi-collapsed bar needs:

  • To show only the icons from the uncollapsed sidebar.
  • To be narrower to free space.

When performing the collapse action, Shiny is translating the sidebar by its width in piwel to the left (which hides it).
To achieve the semi-collapsed sidebar, the new CSS will overwrite the previous translation to no translation and change the sidebar width to 40px (this is the icon width).

To do this, create a www/ folder in your working directory and put the following style.css file in it.

.sidebar-collapse .left-side, .sidebar-collapse .main-sidebar{ 
 transform: translate(0,0);
 width: 40px;
}

Start your app again, the sidebar is semi-collapsing ! However the text is still there and the new label is going over the icon. We need to hide all of these on collapse !

d. Hiding the text and badge on collapse:

To hide the text on collapse, Shiny needs to know that the sidebar is collapsed. To do so, we will add a Shiny Input that will respond to a click on the collapse button.

tags$script("$(document).on('click', '.sidebar-toggle',
function () {
Shiny.onInputChange('SideBar_col_react', Math.random())
});")

‘.sidebar-toggle’ is the button class.Hence,if the button is clicked, a random number is generated and assigned to input$SideBar_col_react.
The script above need to be added in the ui.R after the DashboardSidebar().

vals=reactiveValues()
vals$collapsed=FALSE
observeEvent(input$SideBar_col_react,
{ vals$collapsed=!vals$collapsed }
) 

Now, we need to observe this input and to decide whether or not the sidebar is collapsed. So we will create a reactive value with a collapsed attribute which will be True is the sidebar is collapsed, and False otherwise. To do so, you can add the previous script before output$Semi_collapsible_sidebar.

Now that the server know whether or not the sidebar is collapsed. The sidebar will change according to this value using this code:

output$Semi_collapsible_sidebar=renderMenu({
 if (vals$collapsed)
   sidebarMenu(
    menuItem(NULL, tabName = "dashboard", icon = icon("dashboard")),
    menuItem(NULL, icon = icon("th"), tabName = "widgets"),
    menuItem(NULL, icon = icon("bar-chart-o"),
    menuSubItem(span(class="collapsed_text","Sub-item 1"), tabName = "subitem1"),
    menuSubItem(span(class="collapsed_text","Sub-item 2"), tabName = "subitem2")
 ))
 else
   sidebarMenu(
    menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
    menuItem("Widgets", icon = icon("th"), tabName = "widgets", badgeLabel = "new",
    badgeColor = "green"),
    menuItem("Charts", icon = icon("bar-chart-o"),
    menuSubItem("Sub-item 1", tabName = "subitem1"),
    menuSubItem("Sub-item 2", tabName = "subitem2")
  ))
 })
})

Basically, if vals$collapsed is true, the labels will be set to NULL and the badge will be hidden. Otherwise, the sidebar will be showed as previously.

e. Why can’t I uncollapse the menuSubItem ?

To be frank, I don’t know (If anyone has a clue, please comment) . There is a way to be able to collapse it back using this .

tags$script("$(document).on('click', '.treeview.active', function () {
 $(this).removeClass('active');
 $(this).find( 'ul' ).removeClass('menu-open');
 $(this).find( 'ul' ).css('display', 'none');
});")),

When the menu-item is active, it is getting the class active. So, when clicking on such an element, the script removes all the attribute linked to the menu-item being open and hide it using the CSS display option.

f. Improving the aspect

Right now the collapse is not smooth and when hovering the text in the menu items, it is whitish (over white) which is not great. To improve it, we will just modify the CSS to this:

.sidebar-collapse .left-side, .sidebar-collapse .main-sidebar{ 
 transform: translate(0,0);
 width: 40px;
 transition: transform 1s, width 1s;
}

.collapsed_text { 
 color: black;
}

.main-sidebar, .left-side
{
 transition: transform 1s, width 1s;
}

The .collapsed_text is a custom class we created at step c.

g. Comments

I think all of this could also be done using js and css only.
However, handling the collapse on the server side allow us to do more advanced things like hiding/showing the sidebar programmatically, changing what the server display depending on the sidebar status and even showing different semi-collapsed/not-collapsed sidebar (Which can be useful from a UI/UX point of view).

You can find the code HERE.

Thanks for reading the tutorial,
Antoine

You can follow me on Twitter:

Follow @AntGuilllot

To leave a comment for the author, please follow the link and comment on their blog: R – Enhance Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Coming soon!

By Gianluca Baio

We’ve just received a picture of the cover of the BCEA book, which is really, really close to being finally published!

I did mention this in a few other posts (for example here and here) and it has been in fact a rather long process, so much so that I have made a point of joking in my talks about BCEA that we’d be publishing the book in 2036 $-$ so we’re in fact nearly 20 years early…

The official book website is actually already online and I’ll prepare another one (I know, I know $-$ it’s empty for now!), where we’ll put all the relevant material (examples, code, etc).

I think this may be very helpful for practitioners and our ambition is to make the use of BCEA as wide as possible among health economists $-$ even those who do not currently use R. The final chapter of the book also presents our BCEAweb application, which can do (almost everything!) that the actual package can (the nice thing about it is that the computational engine is stored on a remote server and so the user does not even need to have R installed on their machine).

We’ll probably have to make this part of the next edition of our summer school

Source:: R News

How to make a global map in R, step by step

By Sharp Sight

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

In the last several blog posts at Sharp Sight, I’ve created several different maps.

Maps are great for practicing data visualization. First of all, there’s a lot of data available on places like Wikipedia that you can map.

Moreover, creating maps typically requires several essential skills in combination. Specifically, you commonly need to be able to retrieve the data (e.g., scrape it), mold it into shape, perform a join, and visualize it. Because creating maps requires several skills from data manipulation and data visualization, creating them will be great practice for you.

And if that’s not enough, a good map just looks great. They’re visually compelling.

With that in mind, I want to walk you through the logic of building one step by step.

In the last several blog posts about maps, I gave some detail about how to create them, but in this post, I want to give a little more detail about the logic.

I want you to understand the thinking behind the code. This post will show you how to make a map, and explain the R code that creates it.

Ok, let’s get to it.

Install packages

First, we’ll just install the packages we’re going to use. Pretty straightforward.

#=================
# INSTALL PACKAGES
#=================
library(tidyverse)
library(rvest)
library(magrittr)
library(ggmap)
library(stringr)

Scrape data from website

The data that we’re going to plot exists as a table on a website.

The data is a list of the top 10 countries with the highest ‘talent competitiveness’ (i.e., the countries that are most attractive to talented workers). The data was created as part of an analysis by the business school, INSEAD.

If you go to the webpage that summarizes the talent competitiveness report, you’ll find the following table:

Global Talent Competitiveness Index 2017 Rankings: Top Ten

This is the data we’re after, and we have to scrape it and pull it into R.

To do this, we need to use a few tools from the rvest package.

First, we’ll use read_html() to connect the the URL where the data lives (i.e., connect to the webpage).

Next, you’ll see that we use the magrittr pipe %>%. We’re using the pipe operator to chain together a set of other web scraping functions. Specifically, you’ll see that we’re using html_nodes() and html_table(). Essentially, we’re using these in combination to extract the table that contains the data and parse it into a dataframe.

#============
# SCRAPE DATA
#============

html.global_talent <- read_html("https://www.insead.edu/news/2017-global-talent-competitiveness-index-davos")

df.global_talent_RAW <- html.global_talent %>%
                          html_nodes("table") %>%
                          extract2(1) %>%
                          html_table()

Data manipulation: split and recombine into long-form dataframe

If you inspect the data, you’ll notice that the ranks are split up between two columns. The country names are also split between two columns.

print(df.global_talent_RAW)

# X1              X2  X3          X4
#  1     Switzerland   6   Australia
#  2       Singapore   7  Luxembourg
#  3  United Kingdom   8     Denmark
#  4   United States   9     Finland
#  5          Sweden  10      Norway

This is an improper structure for the data, so we need to restructure it. Specifically, we need all of the ranks in one column and all of the countries in another column.

To do this we’re going to split df.global_talent_RAW into two parts: df.global_talent_1 and df.global_talent_2. Then we’ll take those two parts and stack them up on top of each other, such that all of the ranks are in one column, and the country names are in another column.

To split the data, we’ll just use dplyr::select() to select the specific columns we want in df.global_talent_1 and df.global_talent_2 respectively.

Notice that we also rename the columns using dplyr::rename():

#=============================================
# SPLIT INTO 2 DATA FRAMES
# - the data are split into 4 columns, whereas
#   we want all of the data in columns
#=============================================

df.global_talent_1 <- df.global_talent_RAW %>% select(X1, X2) %>% rename(rank = X1, country = X2)
df.global_talent_2 <- df.global_talent_RAW %>% select(X3, X4) %>% rename(rank = X3, country = X4)

After splitting them, we’ll recombine them with rbind():

#===========
# RECOMBINE
#===========

df.global_talent <- rbind(df.global_talent_1, df.global_talent_2)


# INSPECT
glimpse(df.global_talent)
print(df.global_talent)

Data manipulation: trim excess whitespace

Now, we’ll do some simple string manipulation to modify the country names.

Specifically, if you use glimpse() and look at the data, you’ll see that the country names have leading spaces.

glimpse(df.global_talent)

# Observations: 10
# Variables: 2
# $ rank    <chr> " 1", " 2", " 3", " 4", " 5", " 6", " 7", " 8", " 9", " 10"
# $ country <chr> " Switzerland", " Singapore", " United Kingdom", " United States", " Sw...

We need to remove these.

To do so, we’ll use str_trim() from the stringr package.

#==========================
# STRIP LEADING WHITE SPACE
#==========================

df.global_talent <- df.global_talent %>% mutate(country = str_trim(country)
                                                ,rank = str_trim(rank)
                                                )

Notice that we’re doing this inside of dplyr::mutate(), which allows us to modify a variable inside of a data frame. Essentially, we’re taking the dataset df.global_talent and piping it into mutate(). Then inside of mutate(), we’re calling str_trim() to do the real work of stripping off the excess whitespace.

Now if we print out the data, it looks good.

# INSPECT
print(df.global_talent)

#   rank         country
#     1     Switzerland
#     2       Singapore
#     3  United Kingdom
#     4   United States
#     5          Sweden
#     6       Australia
#     7      Luxembourg
#     8         Denmark
#     9         Finland
#    10          Norway

All of the ranks are in one column, and the country names are in another column.

Get world map

Next, we’re going to get a map of the world.

This is very straightforward. To do this, we’ll use map_data(“world”).

#==============
# GET WORLD MAP
#==============

map.world <- map_data("world")

Recode country names

The next thing we need to do is join the global talent data, df.global_talent (which we want to map), to the map itself, map.world.

To do this, we need to use a join operation. However, to join these two separate datasets together, we’ll need the country names to be exactly the same. The problem, is that the country names are not all exactly the same in the two different datasets.

For the time being, I’ll set aside how to detect these dissimilarities between country names. Suffice it to say, the country names in df.global_talent do not all exactly match the country names in map.world.

That being the case, we’ll need to recode the names that don’t match. Specifically, out of the 10 countries in df.global_talent, 2 don’t match the names in map.world: United States and United Kingdom.

We’ll recode them with dplyr::recode():

#===========================================
# RECODE NAMES
# - Two names in the 'global talent' data
#   are not the same as the names in the 
#   map
# - We need to re-name these so they match
# - If they don't match, we won't be able to 
#   join the datasets
#===========================================

# INSPECT
as.factor(df.global_talent$country) %>% levels()

# RECODE NAMES
df.global_talent$country <- recode(df.global_talent$country
                                  ,'United States' = 'USA'
                                  ,'United Kingdom' = 'UK'
                                  )

This is fairly straightforward. The recode() function operates like all of the other dplyr functions. The first argument inside of recode() is the item we want to change (df.global_talent$country). Then we have a set of name-pairs, with the old name on the left hand side (i.e., United States) and the new name on the right hand side (i.e., USA).

If we take a quick look after executing this, you’ll see that United States and United Kingdom were successfully recoded to USA and UK:

# INSPECT
print(df.global_talent)

#   rank     country
#     1 Switzerland
#     2   Singapore
#     3          UK
#     4         USA
#     5      Sweden
#     6   Australia
#     7  Luxembourg
#     8     Denmark
#     9     Finland
#    10      Norway

Join 2 datasets together

Next, we’ll join together map.world and df.global_talent.

#================================
# JOIN
# - join the 'global talent' data 
#   to the world map
#================================

# LEFT JOIN
map.world_joined <- left_join(map.world, df.global_talent, by = c('region' = 'country'))

If you’re familiar with joins, this is fairly straightforward. Of course, if you don’t use joins often, this might not make sense.

Essentially, we’re using left_join() to join these together. Explaining how joins work is beyond the scope of this blog post, but I’ll quickly explain how this works.

Take a look at the dataset map.world. It contains the data to create a world map.

head(map.world)

It contains essentially all of the countries of the world, as well as information that’s required to plot those countries as polygons. That being the case, it contains information on over 100 countries.

On the other hand, df.global_talent, only contains 10 countries.

The objective right now is to join them together, trying to “join” the two datasets by looking for a match on the country name.

If the join operation finds a match, then great. It’s a match, and it will combine the records from the two different datasets.

But what if there’s not a match? For example, “Brazil” is in map.world, but it’s not in df.global_talent. What happens then?

How these cases of “match” and “no match” are handled is the critical feature of different join types.

In the case of a “left join” we’ll keep everything in the “left hand” dataset. Which is the “left hand” dataset? Don’t overthink it. It’s the dataset that’s syntactically on the left in the following line of code:

left_join(map.world, df.global_talent, by = c(‘region’ = ‘country’)).

That is, map.world is the “left hand” dataset.

In a left join, the operation will keep everything in the left hand dataset, even if there’s not a match. Moreover, when there is a match, it will attach the data in the “right hand” dataset.

This is important, because we want to plot all of the countries in the world (so we need to keep everything in map.world). This is why we’re using left_join().

Create “flag” to highlight specific countries

Even though we’re going to plot the entire world map, the point of this map is to highlight the top ten countries with the highest talent competitiveness. That being the case, we need a way to distinguish those countries in the newly merged data.

There are a few ways to do this, but we’ll do it by using dplyr::mutate() to create a “flag” variable called fill_flg.

Basically, if rank is null, we’ll set the flag to “F” (i.e. ‘false’) and otherwise we’ll set fill_flg to “T”.

#===================================================
# CREATE FLAG
# - in the map, we're going to highlight
#   the countries with high 'talent competitiveness'
# - Here, we'll create a flag that will
#   indicate whether or not we want to 
#   "fill in" a particular country 
#   on the map
#===================================================

map.world_joined <- map.world_joined %>% mutate(fill_flg = ifelse(is.na(rank),F,T))
head(map.world_joined)

This will give us an indicator variable that we can use to highlight the countries with high talent competitiveness.

Create point locations for Singapore and Luxembourg

One last thing before we plot the map.

Two of the countries with high talent competitiveness, Singapore and Luxembourg, are very small. This makes them quite difficult to see on a map.

To make them more visible, we’re going to plot them as points.

To do this, we need to create a separate dataset that contains the latitude and longitude data for the center of these countries.

We’ll simply create a new dataset df.country_points that contains the country names. Then we’ll use ggmap::geocode() to get the latitude and longitude information.

Finally, we’ll use cbind() to attach the lat/long data to df.country_points.

#=======================================================
# CREATE POINT LOCATIONS FOR SINGAPORE AND LUXEMBOURG
# - Luxembourg and Singapore are countries with
#   high 'talent competitiveness'
# - But, they are both small on the map, and hard to see
# - We'll create points for each of these countries
#   so they are easier to see on the map
#=======================================================

df.country_points <- data.frame(country = c("Singapore","Luxembourg"),stringsAsFactors = F)
glimpse(df.country_points)

#--------
# GEOCODE
#--------

geocode.country_points <- geocode(df.country_points$country)

df.country_points <- cbind(df.country_points,geocode.country_points)

# INSPECT
print(df.country_points)

#    country        lon       lat
#  Singapore 103.819836  1.352083
# Luxembourg   6.129583 49.815273

When we finally inspect the data, we see that this dataset now has the names of those two countries, and the lat/long data that we need in order to plot them as points.

Plot the map using ggplot2

Here we go. Everything is ready. We can plot.

#=======
# MAP
#=======

ggplot() +
  geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = fill_flg)) +
  geom_point(data = df.country_points, aes(x = lon, y = lat), color = "#e60000") +
  scale_fill_manual(values = c("#CCCCCC","#e60000")) +
  labs(title = 'Countries with highest "talent competitiveness"'
       ,subtitle = "source: INSEAD, https://www.insead.edu/news/2017-global-talent-competitiveness-index-davos") +
  theme(text = element_text(family = "Gill Sans", color = "#FFFFFF")
        ,panel.background = element_rect(fill = "#444444")
        ,plot.background = element_rect(fill = "#444444")
        ,panel.grid = element_blank()
        ,plot.title = element_text(size = 30)
        ,plot.subtitle = element_text(size = 10)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,legend.position = "none"
        )

Here’s what the ggplot code produces:

Let’s break this down a little.

You’ll notice that we’re plotting the whole world map. We’re able to do this because we used the left_join() above, and kept all of our countries.

Now take a look at the highlighted countries. These are the 10 countries in the data that we scraped. How did we highlight them? We used the “flag” variable that we created, fill_flg, and mapped it to the fill = aesthetic. You’ll see a few lines later that we use scale_fill_manual(). scale_fill_manual() is specifying that we want to color a country grey (i.e., #CCCCCC) if fill_flg = F, and red (#e60000) if fill_flg = T. Do you get the logic now? We created the flag specifically so we could do this. We mapped the flag to the fill aesthetic, and then used scale_fill_manual() to control the exact colors.

One more thing to point out. We used 2 distinct layers in this map: the country map (given by geom_polygon(), but then a separate layer of points given by geom_point(). And what are the points that we’ve plotted? The locations of Singapore and Luxembourg. Earlier in this tutorial, we created that separate dataset with Singapore and Luxembourg, and this is why. We wanted to plot those 2 countries as points (because they are otherwise too small to really see on the map).

To do this, you mostly just need the foundations

Loyal Sharp Sight readers and talented data scientists will know the drill: learn the foundations.

  • ggplot2
  • dplyr tools, especially mutate()
  • a few tools from base R like cbind, rbind, ifelse()
  • a few tools from stringr

If you’re a beginner, the full code in this blog post might look complicated, but you need to realize that there are only a couple dozen core tools that we’re using here. That’s it.

That’s why I strongly recommend that you master the foundations first. If you can master a few dozen critical, high-frequency tools, a whole world of possibilities will open up. I repeat: if you want to master data science, master the basics.

Sign up to learn data science

People who know data science will have massive opportunities in the age of big data.

Sign up now to get our free Data Science Crash Course, where you’ll learn:

  • a step-by-step data science learning plan

  • the 1 programming language you need to learn

  • 3 essential data visualizations
  • how to do data manipulation in R
  • how to get started with machine learning
  • and more …

SIGN UP NOW

The post How to make a global map in R, step by step appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Who is Alan Turing?

By @aschinchon

This government is committed to introducing posthumous pardons for people with certain historical sexual offence convictions who would be innocent of any crime now (British Government Spokesperson, September 2016)

Last September, the British government announced its intention to pursue what has become known as the Alan Turing law, offering exoneration to the tens of thousands of gay men convicted of historic charges. The law was finally unveiled on 20 October 2016.

This plot shows the daily views of the Alan Turing’s wikipedia page during the last 365 days:

There are three huge peaks in May 27th, July 30th, and October 29th that can be easily detected using AnomalyDetection package:

After substituting these anomalies by a simple linear imputation, it is clear that the time series has suffered a significant impact since the last days of September:

To estimate the amount of incremental views since September 28th (this is the date I have chosen as starting point) I use CausalImpact package:


Last plot shows the accumulated effect. After 141 days, there have been around 1 million of incremental views to the Alan Turing’s wikipedia page (more than 7.000 per day) and it does not seem ephemeral.

Alan Turing has won another battle, this time posthumous. And thanks to it, there is a lot of people that have discovered his amazing legacy: long life to Alan Turing.

This is the code I wrote to do the experiment:

library(httr)
library(jsonlite)
library(stringr)
library(xts)
library(highcharter)
library(AnomalyDetection)
library(imputeTS)
library(CausalImpact)
library(dplyr)

# Views last 365 days
(Sys.Date()-365) %>% str_replace_all("[[:punct:]]", "") %>% substr(1,8) -> date_ini
Sys.time()       %>% str_replace_all("[[:punct:]]", "") %>% substr(1,8) -> date_fin
url="https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Alan%20Turing/daily"
paste(url, date_ini, date_fin, sep="/") %>% 
  GET  %>% 
  content("text") %>% 
  fromJSON %>% 
  .[[1]] -> wikistats

# To prepare dataset for highcharter
wikistats %>% 
  mutate(day=str_sub(timestamp, start = 1, end = 8)) %>% 
  mutate(day=as.POSIXct(day, format="%Y%m%d", tz="UTC")) -> wikistats

# Highcharts viz
rownames(wikistats)=wikistats$day
wikistats %>% select(views) %>% as.xts  %>% hchart

# Anomaly detection
wikistats %>% select(day, views) -> tsdf
tsdf %>%  
  AnomalyDetectionTs(max_anoms=0.01, direction='both', plot=TRUE)->res
res$plot

# Imputation of anomalies
tsdf[tsdf$day %in% as.POSIXct(res$anoms$timestamp, format="%Y-%m-%d", tz="UTC"),"views"]<-NA 
ts(tsdf$views, frequency = 365) %>% 
  na.interpolation() %>% 
  xts(order.by=wikistats$day) -> tscleaned
tscleaned %>% hchart

# Causal Impact from September 28th
x=sum(index(tscleaned)<"2016-09-28 UTC")
impact <- CausalImpact(data = tscleaned %>% as.numeric, 
                       pre.period = c(1,x),
                       post.period = c(x+1,length(tscleaned)), 
                       model.args = list(niter = 5000, nseasons = 7),
                       alpha = 0.05)
plot(impact)

Source:: R News