By Sharp Sight
Maps are great for practicing data visualization. First of all, there’s a lot of data available on places like Wikipedia that you can map.
Moreover, creating maps typically requires several essential skills in combination. Specifically, you commonly need to be able to retrieve the data (e.g., scrape it), mold it into shape, perform a join, and visualize it. Because creating maps requires several skills from data manipulation and data visualization, creating them will be great practice for you.
And if that’s not enough, a good map just looks great. They’re visually compelling.
With that in mind, I want to walk you through the logic of building one step by step.
In the last several blog posts about maps, I gave some detail about how to create them, but in this post, I want to give a little more detail about the logic.
I want you to understand the thinking behind the code. This post will show you how to make a map, and explain the R code that creates it.
Ok, let’s get to it.
First, we’ll just install the packages we’re going to use. Pretty straightforward.
#================= # INSTALL PACKAGES #================= library(tidyverse) library(rvest) library(magrittr) library(ggmap) library(stringr)
Scrape data from website
The data that we’re going to plot exists as a table on a website.
The data is a list of the top 10 countries with the highest ‘talent competitiveness’ (i.e., the countries that are most attractive to talented workers). The data was created as part of an analysis by the business school, INSEAD.
If you go to the webpage that summarizes the talent competitiveness report, you’ll find the following table:
Global Talent Competitiveness Index 2017 Rankings: Top Ten
This is the data we’re after, and we have to scrape it and pull it into R.
To do this, we need to use a few tools from the rvest package.
First, we’ll use read_html() to connect the the URL where the data lives (i.e., connect to the webpage).
Next, you’ll see that we use the magrittr pipe %>%. We’re using the pipe operator to chain together a set of other web scraping functions. Specifically, you’ll see that we’re using html_nodes() and html_table(). Essentially, we’re using these in combination to extract the table that contains the data and parse it into a dataframe.
#============ # SCRAPE DATA #============ html.global_talent <- read_html("https://www.insead.edu/news/2017-global-talent-competitiveness-index-davos") df.global_talent_RAW <- html.global_talent %>% html_nodes("table") %>% extract2(1) %>% html_table()
Data manipulation: split and recombine into long-form dataframe
If you inspect the data, you’ll notice that the ranks are split up between two columns. The country names are also split between two columns.
print(df.global_talent_RAW) # X1 X2 X3 X4 # 1 Switzerland 6 Australia # 2 Singapore 7 Luxembourg # 3 United Kingdom 8 Denmark # 4 United States 9 Finland # 5 Sweden 10 Norway
This is an improper structure for the data, so we need to restructure it. Specifically, we need all of the ranks in one column and all of the countries in another column.
To do this we’re going to split df.global_talent_RAW into two parts: df.global_talent_1 and df.global_talent_2. Then we’ll take those two parts and stack them up on top of each other, such that all of the ranks are in one column, and the country names are in another column.
To split the data, we’ll just use dplyr::select() to select the specific columns we want in df.global_talent_1 and df.global_talent_2 respectively.
Notice that we also rename the columns using dplyr::rename():
#============================================= # SPLIT INTO 2 DATA FRAMES # - the data are split into 4 columns, whereas # we want all of the data in columns #============================================= df.global_talent_1 <- df.global_talent_RAW %>% select(X1, X2) %>% rename(rank = X1, country = X2) df.global_talent_2 <- df.global_talent_RAW %>% select(X3, X4) %>% rename(rank = X3, country = X4)
After splitting them, we’ll recombine them with rbind():
#=========== # RECOMBINE #=========== df.global_talent <- rbind(df.global_talent_1, df.global_talent_2) # INSPECT glimpse(df.global_talent) print(df.global_talent)
Data manipulation: trim excess whitespace
Now, we’ll do some simple string manipulation to modify the country names.
Specifically, if you use glimpse() and look at the data, you’ll see that the country names have leading spaces.
glimpse(df.global_talent) # Observations: 10 # Variables: 2 # $ rank <chr> " 1", " 2", " 3", " 4", " 5", " 6", " 7", " 8", " 9", " 10" # $ country <chr> " Switzerland", " Singapore", " United Kingdom", " United States", " Sw...
We need to remove these.
To do so, we’ll use str_trim() from the stringr package.
#========================== # STRIP LEADING WHITE SPACE #========================== df.global_talent <- df.global_talent %>% mutate(country = str_trim(country) ,rank = str_trim(rank) )
Notice that we’re doing this inside of dplyr::mutate(), which allows us to modify a variable inside of a data frame. Essentially, we’re taking the dataset df.global_talent and piping it into mutate(). Then inside of mutate(), we’re calling str_trim() to do the real work of stripping off the excess whitespace.
Now if we print out the data, it looks good.
# INSPECT print(df.global_talent) # rank country # 1 Switzerland # 2 Singapore # 3 United Kingdom # 4 United States # 5 Sweden # 6 Australia # 7 Luxembourg # 8 Denmark # 9 Finland # 10 Norway
All of the ranks are in one column, and the country names are in another column.
Get world map
Next, we’re going to get a map of the world.
This is very straightforward. To do this, we’ll use map_data(“world”).
#============== # GET WORLD MAP #============== map.world <- map_data("world")
Recode country names
The next thing we need to do is join the global talent data, df.global_talent (which we want to map), to the map itself, map.world.
To do this, we need to use a join operation. However, to join these two separate datasets together, we’ll need the country names to be exactly the same. The problem, is that the country names are not all exactly the same in the two different datasets.
For the time being, I’ll set aside how to detect these dissimilarities between country names. Suffice it to say, the country names in df.global_talent do not all exactly match the country names in map.world.
That being the case, we’ll need to recode the names that don’t match. Specifically, out of the 10 countries in df.global_talent, 2 don’t match the names in map.world: United States and United Kingdom.
We’ll recode them with dplyr::recode():
#=========================================== # RECODE NAMES # - Two names in the 'global talent' data # are not the same as the names in the # map # - We need to re-name these so they match # - If they don't match, we won't be able to # join the datasets #=========================================== # INSPECT as.factor(df.global_talent$country) %>% levels() # RECODE NAMES df.global_talent$country <- recode(df.global_talent$country ,'United States' = 'USA' ,'United Kingdom' = 'UK' )
This is fairly straightforward. The recode() function operates like all of the other dplyr functions. The first argument inside of recode() is the item we want to change (df.global_talent$country). Then we have a set of name-pairs, with the old name on the left hand side (i.e., United States) and the new name on the right hand side (i.e., USA).
If we take a quick look after executing this, you’ll see that United States and United Kingdom were successfully recoded to USA and UK:
# INSPECT print(df.global_talent) # rank country # 1 Switzerland # 2 Singapore # 3 UK # 4 USA # 5 Sweden # 6 Australia # 7 Luxembourg # 8 Denmark # 9 Finland # 10 Norway
Join 2 datasets together
Next, we’ll join together map.world and df.global_talent.
#================================ # JOIN # - join the 'global talent' data # to the world map #================================ # LEFT JOIN map.world_joined <- left_join(map.world, df.global_talent, by = c('region' = 'country'))
If you’re familiar with joins, this is fairly straightforward. Of course, if you don’t use joins often, this might not make sense.
Essentially, we’re using left_join() to join these together. Explaining how joins work is beyond the scope of this blog post, but I’ll quickly explain how this works.
Take a look at the dataset map.world. It contains the data to create a world map.
It contains essentially all of the countries of the world, as well as information that’s required to plot those countries as polygons. That being the case, it contains information on over 100 countries.
On the other hand, df.global_talent, only contains 10 countries.
The objective right now is to join them together, trying to “join” the two datasets by looking for a match on the country name.
If the join operation finds a match, then great. It’s a match, and it will combine the records from the two different datasets.
But what if there’s not a match? For example, “Brazil” is in map.world, but it’s not in df.global_talent. What happens then?
How these cases of “match” and “no match” are handled is the critical feature of different join types.
In the case of a “left join” we’ll keep everything in the “left hand” dataset. Which is the “left hand” dataset? Don’t overthink it. It’s the dataset that’s syntactically on the left in the following line of code:
left_join(map.world, df.global_talent, by = c(‘region’ = ‘country’)).
That is, map.world is the “left hand” dataset.
In a left join, the operation will keep everything in the left hand dataset, even if there’s not a match. Moreover, when there is a match, it will attach the data in the “right hand” dataset.
This is important, because we want to plot all of the countries in the world (so we need to keep everything in map.world). This is why we’re using left_join().
Create “flag” to highlight specific countries
Even though we’re going to plot the entire world map, the point of this map is to highlight the top ten countries with the highest talent competitiveness. That being the case, we need a way to distinguish those countries in the newly merged data.
There are a few ways to do this, but we’ll do it by using dplyr::mutate() to create a “flag” variable called fill_flg.
Basically, if rank is null, we’ll set the flag to “F” (i.e. ‘false’) and otherwise we’ll set fill_flg to “T”.
#=================================================== # CREATE FLAG # - in the map, we're going to highlight # the countries with high 'talent competitiveness' # - Here, we'll create a flag that will # indicate whether or not we want to # "fill in" a particular country # on the map #=================================================== map.world_joined <- map.world_joined %>% mutate(fill_flg = ifelse(is.na(rank),F,T)) head(map.world_joined)
This will give us an indicator variable that we can use to highlight the countries with high talent competitiveness.
Create point locations for Singapore and Luxembourg
One last thing before we plot the map.
Two of the countries with high talent competitiveness, Singapore and Luxembourg, are very small. This makes them quite difficult to see on a map.
To make them more visible, we’re going to plot them as points.
To do this, we need to create a separate dataset that contains the latitude and longitude data for the center of these countries.
We’ll simply create a new dataset df.country_points that contains the country names. Then we’ll use ggmap::geocode() to get the latitude and longitude information.
Finally, we’ll use cbind() to attach the lat/long data to df.country_points.
#======================================================= # CREATE POINT LOCATIONS FOR SINGAPORE AND LUXEMBOURG # - Luxembourg and Singapore are countries with # high 'talent competitiveness' # - But, they are both small on the map, and hard to see # - We'll create points for each of these countries # so they are easier to see on the map #======================================================= df.country_points <- data.frame(country = c("Singapore","Luxembourg"),stringsAsFactors = F) glimpse(df.country_points) #-------- # GEOCODE #-------- geocode.country_points <- geocode(df.country_points$country) df.country_points <- cbind(df.country_points,geocode.country_points) # INSPECT print(df.country_points) # country lon lat # Singapore 103.819836 1.352083 # Luxembourg 6.129583 49.815273
When we finally inspect the data, we see that this dataset now has the names of those two countries, and the lat/long data that we need in order to plot them as points.
Plot the map using ggplot2
Here we go. Everything is ready. We can plot.
#======= # MAP #======= ggplot() + geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill = fill_flg)) + geom_point(data = df.country_points, aes(x = lon, y = lat), color = "#e60000") + scale_fill_manual(values = c("#CCCCCC","#e60000")) + labs(title = 'Countries with highest "talent competitiveness"' ,subtitle = "source: INSEAD, https://www.insead.edu/news/2017-global-talent-competitiveness-index-davos") + theme(text = element_text(family = "Gill Sans", color = "#FFFFFF") ,panel.background = element_rect(fill = "#444444") ,plot.background = element_rect(fill = "#444444") ,panel.grid = element_blank() ,plot.title = element_text(size = 30) ,plot.subtitle = element_text(size = 10) ,axis.text = element_blank() ,axis.title = element_blank() ,axis.ticks = element_blank() ,legend.position = "none" )
Here’s what the ggplot code produces:
Let’s break this down a little.
You’ll notice that we’re plotting the whole world map. We’re able to do this because we used the left_join() above, and kept all of our countries.
Now take a look at the highlighted countries. These are the 10 countries in the data that we scraped. How did we highlight them? We used the “flag” variable that we created, fill_flg, and mapped it to the fill = aesthetic. You’ll see a few lines later that we use scale_fill_manual(). scale_fill_manual() is specifying that we want to color a country grey (i.e., #CCCCCC) if fill_flg = F, and red (#e60000) if fill_flg = T. Do you get the logic now? We created the flag specifically so we could do this. We mapped the flag to the fill aesthetic, and then used scale_fill_manual() to control the exact colors.
One more thing to point out. We used 2 distinct layers in this map: the country map (given by geom_polygon(), but then a separate layer of points given by geom_point(). And what are the points that we’ve plotted? The locations of Singapore and Luxembourg. Earlier in this tutorial, we created that separate dataset with Singapore and Luxembourg, and this is why. We wanted to plot those 2 countries as points (because they are otherwise too small to really see on the map).
To do this, you mostly just need the foundations
Loyal Sharp Sight readers and talented data scientists will know the drill: learn the foundations.
- dplyr tools, especially mutate()
- a few tools from base R like cbind, rbind, ifelse()
- a few tools from stringr
If you’re a beginner, the full code in this blog post might look complicated, but you need to realize that there are only a couple dozen core tools that we’re using here. That’s it.
That’s why I strongly recommend that you master the foundations first. If you can master a few dozen critical, high-frequency tools, a whole world of possibilities will open up. I repeat: if you want to master data science, master the basics.
Sign up to learn data science
People who know data science will have massive opportunities in the age of big data.
Sign up now to get our free Data Science Crash Course, where you’ll learn:
- a step-by-step data science learning plan
- the 1 programming language you need to learn
- 3 essential data visualizations
- how to do data manipulation in R
- how to get started with machine learning
- and more …
SIGN UP NOW
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News