XML parsing made easy: is that podcast getting longer?

By nsaunders

(This article was first published on R – What You’re Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Sometime in 2009, I began listening to a science podcast titled This Week in Virology, or TWiV for short. I thought it was pretty good and listened regularly up until sometime in 2016, when it seemed that most episodes were approaching two hours in duration. I listen to several podcasts when commuting to/from work, which takes up about 10 hours of my week, so I found it hard to justify two hours for one podcast, no matter how good.

Were the episodes really getting longer over time? Let’s find out using R.

One thing I’ve learned as a data scientist: management want to see the key points first. So here it is:

Technical people want to see how we got there.

It turns out that the podcast has an RSS feed (in XML format), containing detailed information about every episode to date. Once there was the XML package, which was OK, but now we have rvest and xml2, which makes life much easier.

Full code and a markdown report can be found at Github, so don’t copy and paste from here. Here are the highlights. First, as data for each episode is found between item tags, it can be extracted from the feed like so:

feed_items %

Podcast date is stored in pubDate and duration in itunes:duration, so we can extract the text between those tags:

feed_items %>% 
  xml_nodes("pubDate") %>% 

feed_items %>% 
  xml_nodes("itunes:duration") %>% 

Durations less than 1 hour are represented as minutes:seconds, e.g. “55:42”, as opposed to “02:01:33”. So we can count the colons, prepend “00:” (zero hours) if there is only one, convert to hours, minutes and seconds using lubridate and then convert to numeric seconds. We can also convert the date/time string to class datetime.

# assume myDF is a data frame with columns named pubDate, duration
myDF %>%
  mutate(pubDate = as.POSIXct(strptime(pubDate, "%a, %d %b %Y %H:%M:%S +0000")),
         duration = ifelse(grepl(":d+:", duration), duration, paste0("00:", duration)),
         duration_seconds = as.numeric(hms(duration)))

We have dates, we have durations (bar one for episode 124 – not a big deal), we’re good to go.

I tried a scatter plot. I tried summarising mean duration by year. However, in the end I finally found a use for…the “joy plot” – or as it seems to have been renamed, the ridgeline plot. I still have mixed feelings about it: isn’t it just a density plot with facets? Should it obscure data by default? On the whole though, I think it has some merit for examining distributions over time.

And so back to the chart at the start, showing TWiV episode duration over time. We can see that the average episode in 2017 is around twice the duration of those in 2008 (close to 100 minutes versus 50). The duration distribution has shifted to the right over time but has come back to the left a little in 2017, from a maximum peak in 2016. So perhaps I’ll be able to listen again soon.

If only TWiV were more like TWiM – This Week in Microbiology.

TWiM rarely strays beyond 90 minutes and this year, forms a nice round peak right about the one hour mark. In my opinion, a sensible maximum for a podcast.

Filed under: R, statistics Tagged: parsing, podcast, rss, twiv, xml

To leave a comment for the author, please follow the link and comment on their blog: R – What You’re Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.