By Sharp Sight
As a quick followup to last week’s mapping exercise (where we mapped the largest European cities), I want to map the largest cities in Asia.
When we did this last week, we used a variety of tools from the Tidyverse to scrape and wrangle the data, and we ultimately mapped the data using base ggplot2.
In this blog post, we’re going to scrape and wrangle the data in a very similar way, but we will visualize with a combination of ggmap() and ggplot(). Using ggmap() will allow us to plot the data on a specialized “watercolor” map. This is a minor change, but it makes the finalized map a little more aesthetically interesting.
Let’s jump in.
First, we’ll load the packages that we’re going to use.
#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)
Next, we’ll scrape the data using the readr package.
Explaining exactly how readr works is beyond the scope of this post, but notice that we’re using several functions in series by using the pipe operator (%>%).
Essentially, we are using html_nodes() to select the “tables” from the Wikipedia page. We want the third table on the page, so we identify it with .[]. Then we use html_table() to extract the text from the table.
The brilliant thing about readr and the Tidyverse packages, is that you can ‘chain’ these functions together in series using %>%. This makes writing and debugging your code easier, because you can build and test everything step by step.
#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population % html_nodes("table") %>% .[] %>% html_table(fill = TRUE) # inspect df.asia_cities %>% head() df.asia_cities %>% names()
After executing the web scraping code and inspecting the resulting data, we are going to begin some data wrangling.
First, we’re just going to remove some variables.
When we scraped the data, there were several columns that we don’t care about, like Date and Image. These were appropriate on the Wikipedia page, but we don’t need them for our data visualization.
There are several ways to remove these, but a quick way is to us the select() function from dplyr. To remove unwanted variables, all we need to do is put a minus sign (‘-‘) in front of the variable.
Syntactically, this is really easy. In fact, one of the reasons that I strongly recommend using tools from the Tidyverse (like dplyr::select()) is that they are easy to learn, easy to memorize, and ultimately easy to use.
#============================ # REMOVE EXTRANEOUS VARIABLES #============================ #df.asia_cities % names() df.asia_cities %>% head()
Now we have only three variables in the data, but we need to clean the variable names up a little.
Ideally, you want simple variable names. You also typically want your variable names to start with lower case letters (they are easier to type that way).
Here, we will manually provide new column names. We are using the c() function to create a vector of strings (the new names), and we are assigning that vector of names as the column names of df.asia_cities by using the colnames() function.
#=============== # RENAME COLUMNS #=============== colnames(df.asia_cities) % colnames() df.asia_cities %>% head() #------------------------------------------------------------------- # REMOVE EXTRA ROW AT TOP # - when we scraped the data, part of the column name # for the 'Population, City proper" column # was parsed as a row of data, instead of part of the column name # - here, we're just removing that extraneous row #------------------------------------------------------------------- df.asia_cities % head()
Now that the names are cleaned up, we will do a little cleaning on the data values.
On the Wikipedia table that originally contained the data, there were some footnotes associated with the population numbers. The footnotes were marked by bracketed numbers (e.g. ).
We need to remove those footnote markers. To do this, we will use stringr::str_replace() inside of dplyr::mutate(). Remember that dplyr::mutate() allows us to “mutate” a dataframe and modify existing variables.
We will use a regular expression inside of str_replace() to identify the footnote markers and “replace” them with empty strings (i.e., we are replacing the information with nothing at all).
#========================================================================== # REMOVE "notes" FROM POPULATION NUMBERS NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.asia_cities % mutate(population = str_replace_all(population, "[.*]","") %>% parse_number()) # inspect df.asia_cities %>% head()
Now that we’ve removed the footnote characters, we need to create a new variable.
We will create a variable that has both the city and country in the following format: “Tokyo, Japan”.
You might be wondering why we are creating this new variable; the data already has a ‘city’ column as well as a ‘country’ column. Why do we want to duplicate this by creating a combined city/country column?
You’ll see why in a moment, but essentially, we will need this new column when we “geocode” our data to get the latitude and longitide coordinates of every city (we will need the lat/long when we plot the data).
The problem is that if we geocode based on only the city name (without the country), the geocoding process can encounter some errors due to ambiguity. (For example, would the city name “Naples” refer to Naples, Florida or Naples, Italy?)
To make sure that we don’t have any problems, we will create a variable that contains both city and country information.
#============================================================== # CREATE VARIABLE: "city_full_name" # - we need to have a combined name of the form 'City, Country' # - we need this because when we use the geocode() function to # get long/lat data, there is some ambiguity in the city names #============================================================== df.asia_cities % mutate(city_full_name = str_c(df.asia_cities$city, df.asia_cities$country, sep = ', ')) #inspect df.asia_cities %>% head()
Before we move on, we’ll quickly reorder the variables.
This is a quick and simple use of dplyr::select(). When we use select(), all we need to do is list out the variable names in the exact order we want them to appear in the data frame.
#================== # REORDER VARIABLES #================== df.asia_cities % select(city, country, city_full_name, population) # inspect df.asia_cities %>% head() #======================================== # COERCE TO TIBBLE # - this just makes the data print better #======================================== df.asia_cities % as_tibble()
Now we’re going to obtain the longitude and latitude data by using ggmap::geocode().
After obtaining the geo-data, we’ll join it to the original data using cbind().
#======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo % head()
To map the data points that we’ve just gathered, we will need a map on which to plot them.
In recent posts, we have been using the map_data() function to retrieve a set of polygons; essentially, we’ve been just getting the country shapes, plotting them with ggplot(), and then plotting data points on top of the polygons.
Here though, we’re going to do something different. We will use the get_map() function to get a map of Asia from a 3rd party source. I won’t explain ggmap and get_map() completely here, but essentially, these tools allow you to get maps from Google Maps and other sources.
Here, we will retrieve a map from Stamen Maps. Stamen is a design firm in San Francisco that has created a set of maps that we can query by using get_map(). There are a variety of different “types” of maps, and in this case, we will get a “watercolor” map.
#============= # GET ASIA MAP #============= map.asia % ggmap()
What’s great about this (and one of the reasons that I like the Tidyverse) is that the tools are largely interchangeable. It is extremely easy to use a map from get_map() in a visualization instead of a basic polygon from map_data().
Ok. Now that we have all of the components (the cleaned dataset and the background watercolor-map of Asia) we can plot the data.
First, we will do a quick “first iteration” version just to check everything. This version is unformatted; we just want to plot the data to make sure that the points are aligned properly, and that our data is properly “cleaned.”
#============================================================================ # PLOT CITIES ON MAP # - we are using the watercolor map of asia as the background (using ggmap()) # - we are using geom_point() to plot the city data as points # on top of the map #============================================================================ # FIRST ITERATION ggmap(map.asia) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1)
To be clear, the “first iteration” map that I’m showing your looks pretty good; there’s nothing wrong with the data, etc. However, when I initially ran this code, I found a few things amiss, and had to go back and make some adjustments to the previous data wrangling code. Keep that in mind. As you progress through a project, you may find things that are wrong with your data, and you need to iteratively go back and adjust your code until you get everything just right.
Now that we have a “first iteration” of our map that looks good, we’re going to polish everything.
This is where we will modify the theme elements of the plot, add a title and subtitles, remove extraneous non-data elements (like the axis ticks, etc).
#================================================== # FINALIZED MAP # - here I've added titles, modified theme elements # like the text, etc #================================================== ggmap(map.asia) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .1) + geom_point(data = df.asia_cities, aes(x = lon, y = lat, size = population), color = "red", shape = 1) + labs(x = NULL, y = NULL) + labs(size = 'Population (millions)') + labs(title = "Largest Cities in Asia", subtitle = "source: https://en.wikipedia.org/wiki/List_of_Asian_cities_by_population_within_city_limits") + scale_size_continuous(range = c(.6,18), labels = scales::comma_format(), breaks = c(1500000, 10000000, 20000000)) + theme(text = element_text(color = "#4A4A4A", family = "Gill Sans")) + theme(axis.text = element_blank()) + theme(axis.ticks = element_blank()) + theme(plot.title = element_text(size = 32)) + theme(plot.subtitle = element_text(size = 10)) + theme(legend.key = element_rect(fill = "white"))
And here is the finalized map:
Having said that, I want to stress that if you don’t already know those “most important functions” backwards and forwards, you should focus your time on memorizing those first. Your first goal is to memorize essential syntax so that you know it “backwards and forwards.” After you’ve mastered the basic syntax, small projects like this will help you “put the pieces together.”
Once you’ve mastered the basic toolkit, small projects like this are excellent practice.
Sign up now, and discover how to rapidly master data science
To rapidly master data science, you need to master the essential tools.
You need to know what tools are important, which tools are not important, and how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you’ll receive regular tutorials and lessons.
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News