As a beginning data scientist, you’ll have quite a few subject areas that you need to learn (and eventually master).
While you’ll certainly need to learn some math and statistics, math and stats are not the first things I recommend to most beginners.
Almost always, I recommend that people start with data visualization.
The reason for this, is that data visualization is so critical to almost every part of getting things done as a data scientist: reporting, analysis, exploratory analysis (e.g., EDA prior to machine learning). You need data visualization constantly. It’s necessary for nearly every data scientist at all levels.
Furthermore, I’ve argued that at junior levels of a data team job hierarchy, data visualization (when combined with data manipulation) is sufficient for being productive. If you’re a junior member of a data team, your core responsibilities may exclusively revolve around visualization (i.e., reporting, analysis, etc).
Because it’s necessary (and in some cases, sufficient) for productivity, it’s a skill that you need to master early.
ggplot2 is the visualization tool I recommend
Of course, the question is, what tool should you use for data visualization?
Long time readers of the Sharp Sight blog will know where I stand on this: I think that ggplot2 is a best-in-class data visualization tool, and arguably, the best data visualization tool.
As it turns out, a recent 2016 survey by O’Reilly media also showed that ggplot2 is the most frequently used data visualization tool among employed data scientists. This provides some evidence that suggests that you should learn it, if you want to get a job as a data scientist.
ggplot2 teaches you how to think about visualization
But setting aside the popularity of ggplot and it’s usefulness as a baseline productivity tool, there’s a deep-seated reason why I am so assertive about suggesting ggplot:
ggplot teaches you to how to think about visualizing your data.
It teaches you how to think about visualization, because there are two deep principles that underly the syntax (and a third principle that sort of arises as a result of the first two).
3 critical principles of visualization
Two important data visualization principles are sort of hard-wired into the structure of ggplot2:
- mapping data to aesthetics
There’s also a third principle that sort of arises as a result of layering:
- building plots iteratively
Understanding these will sharpen your intuition about how to visualize data and how to attack particular problems for which visual tools are a good solution.
To understand these principles, how they operate, and why they’re so important, let’s look at an example.
Principle 1: Mapping Data to aesthetics
Let’s say that we have a dataset:
#LOAD PACKAGE: tidyverse library(tidyverse) # This is the data we're going to plot ... foo <- c(-122.419416,-121.886329,-71.05888,-74.005941,-118.243685,-117.161084,-0.127758,-77.036871,116.407395,-122.332071,-87.629798,-79.383184,-97.743061,121.473701,72.877656,2.352222,77.594563,-75.165222,-112.074037,37.6173) bar <- c(37.77493,37.338208,42.360083,40.712784,34.052234,32.715738,51.507351,38.907192,39.904211,47.60621,41.878114,43.653226,30.267153,31.230416,19.075984,48.856614,12.971599,39.952584,33.448377,55.755826) zaz <- c(6471,4175,3144,2106,1450,1410,842,835,758,727,688,628,626,510,497,449,419,413,325,318) # CREATE DATA FRAME df.dummy <- data_frame(foo,bar) # INSPECT glimpse(df.dummy) head(df.dummy)
It has several numerical variables, so let’s make a quick scatterplot out of two of them, foo and bar.
Seemingly, not much to see here, but the code to accomplish this is pretty straightforward (if you’ve learned the basic ggplot2 syntax):
#----------------------------------------------------------- # LOAD GGPLOT # note: strictly speaking, we don't need to load this # since we already loaded "tidyverse" # however, this _is_ a blog post about ggplot2 after all ... #----------------------------------------------------------- library(ggplot2) #---------- # PLOT DATA #---------- ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_point()
Again, syntactically this is uncomplicated. Importantly though, underneath the syntax is a deep data visualization principle at work.
Once you get this principle, your understanding of data visualization will change forever (and you’ll become much more proficient with ggplot2).
When we create this chart, we’re actually mapping data to aesthetic attributes.
To explain what that means, let’s dissect the example a little bit.
The points in the scatterplot are “geometric objects” that we draw. In ggplot2 lingo, the points are “geoms.” More specifically, the points are “point geoms” that we can denote syntactically with geom_point().
But all “geometric objects” have “aesthetic attributes.” Aesthetic attributes are things like:
When we create a data visualization in ggplot2, we’re ultimately creating a mapping between variables in our data and the aesthetic attributes of the geometric objects in our visualization.
I’m going to repeat that, because it’s very important:
When we visualize data, we are mapping between the variables in our data and the aesthetic attributes of the geometric objects that we plot.
To bring this back to our simple scatterplot example, when we create this plot, we are mapping foo to the x-position aesthetic, and we’re mapping bar to the y-position aesthetic.
Mapping variables is a really important concept …
I know what you’re thinking:
” Yeah, I get it,
… ‘foo’ on the x-axis and ‘bar’ on the y-axis.
… I can do that in Excel. ”
Not so fast.
Understand: this is a simple example, but there’s a very deep principle at work here.
Theoretically, geometric objects (i.e., the things that we draw in a plot, like points) don’t just have attributes like x-position and y-position. As I mentioned above, geometric objects have a variety of other aesthetic attributes like transparency, color, size, etc. Moreover, if we can map variables to attributes like x-position and y-position, we should be able to map variables to attributes like color and size, right?
… and this is exactly what ggplot2 allows you to do.
Mapping variables to parts of your plot is not limited to the x and y axes in ggplot2. This is where ggplot2 begins to differentiate itself.
ggplot2 allows us to manipulate that larger set of aesthetic attributes like color, size, transparency, and shape.
More importantly, it allows us to map variables to essentially any of these aesthetics.
To show you this, let’s extend our example and create a bubble chart.
Extended example: mapping a variable to size
All we need to do is map a new variable to the size aesthetic.
#------------------------------------ # NEW PLOT # - map a variable to size and replot #------------------------------------ ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_point(aes(size = zaz))
What have we done here?
We’ve transformed the simple scatterplot into a bubble chart by mapping a new variable to the size aesthetic.
Let me say that again. We just changed a scatterplot to a bubble chart simply by mapping a new variable to the size aesthetic.
And it doesn’t end there.
As I already noted, there are other aesthetics to which you can map variables beyond x, y, and size. (Other aesthetics for point geoms would include shape and transparency.)
In some simplistic sense, that’s all we’re really doing when we visualize data.
When we create a visualization, we’re ultimately creating a mapping from variables in the data to aesthetic attributes of the geometric objects that we draw.
It’s simple, but critical: any visualization you see can be deconstructed into geom specifications and mappings from data to the aesthetic attributes of those geometric objects.
That might not sound like a big deal, but once you “get it” – once you really understand what this means – your approach to visualizing data will be changed forever. You’ll look at more complex visualizations and understand that that they are easy to produce, if you know what geom to specify and how to map your variables. Nearly all visualizations become much easier to produce.
Principle 2: Build plots in layers
In addition to learning to conceptualize visualizations as “mappings from data to aesthetics” there’s another principle you need to understand: building plots in layers.
The principle of layering is important because to create more advanced visualizations, you’ll often need to:
- Plot multiple datasets, or
- Plot a dataset with additional contextual information that’s contained in a second dataset, or
- Plot summaries or statistical transformations over the raw data
To see what I mean, let’s modify the bubble chart that I just showed you above.
We’re going to:
- Get some additional information
- Store it in a new data frame
- Plot it as a new layer, underneath the bubbles
#-------------------------- # GET ANOTHER LAYER OF DATA #-------------------------- library(maps) df.more_data <- map_data("world") # PLOT ggplot(data = df.dummy, aes(x = foo, y = bar)) + geom_polygon(data = df.more_data, aes(x = long, y = lat, group = group)) + geom_point(aes(size = zaz), color = "red")
And this is what the new chart looks like:
Are you starting to get it?
This is just the bubble chart from earlier in the post with a new layer added. That’s. It.
We just transformed a bubble chart into a new visualization called a “dot distribution map,” which is much more insightful and much more visually interesting.
In the beginning of the post (when we created our dataset), I didn’t tell you that this is geospatial data. I didn’t tell you, because I wanted you to see that this dot distribution map is essentially the same as a bubble chart, with a new layer of contextual information plotted underneath the bubbles.
Mapping and layering allow us to create complex charts
Moreover, as we saw earlier, the bubble chart is just a modified scatter plot. It’s a scatterplot with an additional variable mapped to the size = parameter.
So, this dot distribution map is just a bubble chart, and the bubble chart was just a scatterplot.
Ultimately, we used two of our data visualization principles – mapping and layering – in order to build this visualization from a scatter plot, to bubble chart, to the dot distribution map that we now see:
- To create the scatterplot, we mapped foo to the x-aesthetic and mapped bar to the y-aesthetic
- To create the bubble chart, we mapped a new variable to the size-aesthetic
- To create the dot distribution map, we added a layer of polygon data under the bubbles.
Mapping and layering. That’s really the essence of it.
To be come great a data visualization, you need to understand mapping variables to aesthetics and building plots in layers.
These are two critical ideas that you need to understand, both technically (in order to write ggplot code), but also conceptually. You need to start viewing data visualizations in this way. Once you do, you can begin deconstructing complex visualizations into simple, modular components.
Once you understand mapping and layering, you’ll begin to see that many “complex” visualizations are in fact, quite simple to make (if you know how to think about putting them together).
Principle 3: iteration
There’s actually a third principle at work here, that I haven’t mentioned yet: building plots iteratively.
This principle is only related to the syntax in a cursory way, but it does arise as a consequence of the ggplot2 syntax.
Part of becoming a data scientist is not only learning syntax, but also learning workflow. You need to learn processes.
You won’t learn workflow directly when you learn ggplot2 syntax, but learning visualization workflow is easier when you learn ggplot2, primarily because of the “layerability” of the syntax.
Let me explain.
When we build plots in layers, we are ultimately building a plot iteratively: we layer in new information, piece by piece, or modify existing parts of the plot, piece by piece.
As an example, let’s go back to the chart that we created above.
We ultimately created a dot distribution map, but step-by-step, how did we actually build it?
We followed this basic process:
- Plotted a scatterplot by mapping variables to the x and y axes
- We created a bubble chart by modifying the scatterplot. We essentially mapped a new variable to the “size” aesthetic.
- We layered in polygons to show the shape of the countries underneath the points.
Ultimately, we can break down the creation of the dot distribution map into discrete steps. We built the map iteratively.
If we wanted to go further, we could continue to polish the map by performing additional steps:
- Add a legend title
- Modify the size scale
- Modify the colors (note that even getting the colors perfect requires a lot of iterative, trial-and-error tinkering
If we performed these last few steps, our work could ultimately lead to a chart like this:
To a beginner, this finalized chart probably looks difficult to create. But, once you understand how to build a plot iteratively (in layers) it becomes easy.
My point is that ggplot‘s syntactic layerability enables and rewards iteration.
The structure of the syntax sort of requires you to build plots in layers, and this in turn builds your intuition about iteration and data visualization workflow.
Ultimately, this knowledge about workflow is language-agnostic and transferable if you move to another tool.
To learn how to think about visualization, learn ggplot2
This is why I think that ggplot2 is truly a best-in-class data visualization tool, and the best tool to learn if you’re a beginner:
- ggplot2 makes complex visualizations relatively easy, by allowing you to break down complex visualizations into simple mappings and layers
- ggplot2 enables, and in some sense encourages, iterative creation
- ggplot2 trains you to how to think about visualization (i.e., it trains you to think about visualizations as mappings and layers, and encourages you to work iteratively)
ggplot2 is an excellent tool for getting things done as a real world data scientist, but it also trains your mind how to think about visualizing data.
Now, I will admit that ggplot2 has a bit of a learning curve when you first get started, but once you “get it,” data visualization becomes much easier.
So by learning ggplot2, you are not just learning a toolkit. You also learn deep principles underlying data visualization. Once you learn these principles, your approach to visualizing data will change. Your ability to analyze data and create sophisticated visualizations will improve dramatically.
In turn, by mastering visualization – a core, necessary skill – you’ll become a better data scientist. You’ll be better at getting things done. And when you want to move on to higher-level skills like advanced visualization or machine learning, you’ll have the foundation you need.
Sign up to learn ggplot2
Discover how to rapidly learn ggplot2 (and other critical R packages).
If you sign up, you’ll get free tutorials about ggplot2 and other R tools, delivered to your inbox.
The post The best R package for learning to “think about visualization” appeared first on SHARP SIGHT LABS.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News