By Sharp Sight
What that means is that you need to identify the most important tools and functions of the Tidyverse, and then practice them until you are fluent.
But once you have mastered the essential functions as isolated units, you need to put them together. By putting the individual piece together, you solidify your knowledge of how they work individually but also begin to learn how you can combine small tools together to create novel effects.
With that in mind, I want to show you another small project. Here, we’re going to use a fairly small set of functions to create a map of the largest cities in Europe.
As we do this, pay attention:
- How many packages and functions do you really need?
- Evaluate: how long would it really take to memorize each individual function? (Hint: it’s much, much less time than you think.)
- Which functions have you seen before? Are some of the functions and techniques used more often than others (if you look across many different analyses)?
Ok, with those questions in mind, let’s get after it.
First we’ll just load a few packages.
#============== # LOAD PACKAGES #============== library(rvest) library(tidyverse) library(ggmap) library(stringr)
Next, we’re going to use the rvest package to scrape data from Wikipedia. The data that we are gathering is data about the largest cities in Europe. You can read more about the data on Wikipedia.
#=========================== # SCRAPE DATA FROM WIKIPEDIA #=========================== html.population % html_nodes("table") %>% .[] %>% html_table() # inspect df.euro_cities %>% head() df.euro_cities %>% names()
Here at Sharp Sight, we haven’t worked with the rvest package in too many examples, so you might not be familiar with it.
Having said that, just take a close look. How many functions did we use from rvest? Could you memorize them? How long would it take?
Ok. Now we’ll do a little data cleaning.
First, we’re going to remove some of the variables using dplyr::select(). We are using the minus sign (‘-‘) in front of the names of the variables that we want to remove.
#============================ # REMOVE EXTRANEOUS VARIABLES #============================ df.euro_cities % names()
After removing the variables that we don’t want, we only have four variables. These remaining raw variable names could be cleaned up a little.
Ideally, we want names that are lower case (because they are easier to type). We also want variable names that are brief and descriptive.
In this case, renaming these variables to be brief, descriptive, and lower-case is fairly straightforward. Here, we will use very simple variable names: like rank, city, country, and population.
To add these new variable names, we can simply assign them by using the colnames() function.
#=============== # RENAME COLUMNS #=============== colnames(df.euro_cities) % names() df.euro_cities %>% head()
Now that we have clean variable names, we will do a little modification of the data itself.
When we scraped the data from Wikipedia, some extraneous characters appeared in the population variable. Essentially, there were some leading digits and special characters that appear to be useless artifacts of the scraping process. We want to remove these extraneous characters and parse the population data into a proper numeric.
To do this, we will use a few functions from the stringr package.
First, we use str_extract() to extract the population data. When we do this, we are extracting everything from the ‘♠’ character to the end of the string (note: to do this, we are using a regular expression in str_extract()).
This is a quick way to get the numbers at the end of the string, but we actually don’t want to keep the ‘♠’ character. So, after we extract the population numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using str_replace().
#======================================================================== # CLEAN UP VARIABLE: population # - when the data are scraped, there are some extraneous characters # in the "population" variable. # ... you can see leading numbers and some other items # - We will use stringr functions to extract the actual population data # (and remove the stuff we don't want) # - We are executing this transformation inside dplyr::mutate() to # modify the variable inside the dataframe #======================================================================== df.euro_cities % mutate(population = str_extract(population, "♠.*$") %>% str_replace("♠","") %>% parse_number()) df.euro_cities %>% head()
We will also do some quick data wrangling on the city names. Two of the city names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of this, those two city names had extra bracket characters when we read them in (e.g. “Istanbul[a]”).
We want to strip off those footnotes. To do this we will once again use str_replace() to strip away the information that we don’t want.
#========================================================================== # REMOVE "notes" FROM CITY NAMES # - two cities had extra characters for footnotes # ... we will remove these using stringr::str_replace and dplyr::mutate() #========================================================================== df.euro_cities % mutate(city = str_replace(city, "[.]","")) df.euro_cities %>% head()
For the sake of making the data a little easier to explain, we’re going to filter the data to records where the population is over 1,000,000.
Keep in mind: this is a straightforward use of dplyr::filter(); this is the sort of thing that you should be able to do with your eyes closed.
#========================= # REMOVE CITIES UNDER 1 MM #========================= df.euro_cities = 1000000) #================= # COERCE TO TIBBLE #================= df.euro_cities % as_tibble()
Before we map the cities on a map, we need to get geospatial information. That is, we need to geocode these records.
To do this, we will use the geocode() function to get the longitude and latitude.
After obtaining the geo data, we will join it back to the original data using cbind().
#======================================================== # GEOCODE # - here, we're just getting longitude and latitude data # using ggmap::geocode() #======================================================== data.geo
To map the data points, we also need a map that will sit in the background, underneath the points.
We will use the function map_data() to get a world map.
#============== # GET WORLD MAP #============== map.europe
Now that the data are clean, and we have a world map, we will plot the data.
#================================= # PLOT BASIC MAP # - this map is "just the basics" #================================= ggplot() + geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) + geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = "red", alpha = .3) + coord_cartesian(xlim = c(-9,45), ylim = c(32,70))
This first plot is a “first iteration.” In this version, we haven’t done any serious formatting. It’s just a “first pass” to make sure that the data are in the right format. If we had found anything “out of line,” we would go back to an earlier part of the analysis and modify our code to correct any problems in the data.
Based on this plot, it looks like the data are essentially correct.
Now, we just want to “polish” the visualization by changing colors, fonts, sizes, etc.
#==================================================== # PLOT 'POLISHED' MAP # - this version is formatted and cleaned up a little # just to make it look more aesthetically pleasing #==================================================== #------------- # CREATE THEME #------------- theme.maptheeme
Not too bad.
Keep in mind that as a reader, you get to see the finished product: the finalized visualization and the finalized code.
But as always, the process for creating a visualization like this is highly iterative. If you work on a similar project, expect to change your code dozens of times. You'll change your data-wrangling code as you work with the data and identify new items you need to change or fix. You'll also change your ggplot() visualization code multiple times as you try different colors, fonts, and settings.
If you master the basics, the hard things never seem hard
Creating this visualization is actually not terribly hard to do, but if you're somewhat new to R, it might seem rather challenging.
If you look at this, and it seems difficult then you need to understand: once you master the basics, the hard things never seem hard.
What I mean by that, is that this visualization is nothing more than a careful application of a few dozen simple tools, arranged in a way to create something new.
Once you master individual tools from ggplot2, dplyr, and the rest of the Tidyverse, projects like this become very easy to execute.
Sign up now, and discover how to rapidly master data science
To rapidly master data science, you need to master the essential tools.
You need to know what tools are important, which tools are not important, and how to practice.
Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.
Sign up now for our email list, and you'll receive regular tutorials and lessons.
- What data science tools you should learn (and what not to learn)
- How to practice those tools
- How to put those tools together to execute analyses and machine learning projects
- … and more
If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.
SIGN UP NOW
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News