Teasing Out Top Daily Topics with GDELT’s Television Explorer

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).

The R newsflash package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.

There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).

To start, we’ll need the data!

We provide start and end POSIXct times in the current time zone (the top_trending_range() function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")

from  2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics   [ [ [ [

The glimpse view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics:

## Variables: 2
## $ Station  "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics   [

Each individual data frame has the top topics of each tracked station.

To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:

select(trends, ts, station_top_topics) %>% 
  unnest() %>% 
  unnest() %>% 
  mutate(day = as.Date(ts)) %>% 
  rename(station=Station, topic=Topics) %>% 
  count(day, topic) %>% 
  group_by(day) %>% 
  top_n(25) %>% 
  slice(1:25) %>% 
  arrange(day, desc(n)) %>% 
  mutate(rnk = 25:1) -> top_25_trends

## Observations: 75
## Variables: 4
## $ day    2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic  "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n      546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk    25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...

Now, it’s just a matter of some ggplotting:

ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
  geom_text(vjust=0.5, hjust=0.5) +
  scale_x_date(expand=c(0,0.5)) +
  scale_size(name=NULL, range=c(3,8)) +
    x=NULL, y=NULL, 
    title="Top 25 Trending Topics Per Day",
    subtitle="Topic placed by rank and sized by frequency",
    caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text.y=element_blank()) +
  theme(legend.position=c(0.75, 1.05)) +

Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.