Leave-one-out subset for your ggplot smoother

By Nathan

(This article was first published on R – Nathan Chaney, and kindly contributed to R-bloggers)

I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year’s count so the trendline is more accurate for this purpose, so I thought I’d share my solution in a short post.

Here’s the code with a dummy dataset, with both the original trendline and the leave-one-out version:

x %
  arrange(desc(x)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_col() + 
  geom_smooth(method = "lm", se = F, color = "red") +
  geom_smooth(method = "lm", data = function(x) { slice(x, -1) }, se = F, fullrange = TRUE)

The magic happens by defining an anonymous function as the data argument for the geom_smooth function. I’m using the slice function to drop the first element of the data frame provided to geom_smooth.

There are two other important tips here. First, be sure and sort the data in the tidyverse pipe before passing it to ggplot — for my purpose, this was in descending date order because I wanted to drop the most recent entry. The other tip is an aesthetic one, which is to use the fullrange = TRUE argument in order to plot the trendline into the current date column to provide a rough prediction for the current date period.

NOTE: I’ve seen some commentary that the decision to omit elements as described here should be explicitly disclosed. However, I think it’s fairly apparent in this specific application that the extension of the smoother is predictive in nature, rather than a true representation of the last data element. I could probably shade or apply an alpha to the most recent data point using a similar technique to make this even more clear, but the purpose here was to demonstrate the leave-one-out technique.

What do you think of this solution?

To leave a comment for the author, please follow the link and comment on their blog: R – Nathan Chaney.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

From:: R News