Sentiment Analysis of 5 popular romantic comedies

By Appsilon Data Science Blog

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

Background

With Valentine’s Day coming up, I was thinking about a fun analysis that I could convert into a blog post.

Inspired by a beautiful visualization “Based on a True True story?” I decided to do something similar and analyze the sentiment in the most popular romantic comedies.

After searching for romantic comedies Google suggests a list of movies, where the top 5 are: “When Harry Met Sally”, “Love Actually”, “Pretty Woman”, “Notting Hill”, and “Sleepless in Seattle”.

How to do text analysis in R?

We can use the subtools package to analyze the movies’ sentiment in R by loading the movie subtitles into R and then use tidytext to work with the text data.

library(subtools)
library(tidytext)
library(dplyr)
library(plotly)
library(purrr)
library(lubridate)
library(methods)

Working with movie data

I downloaded the srt subtitles for 5 comedies from Open Subtitles before the analysis.

Now let’s load them into R and have a sneak peak of what the data looks like.

romantic_comedies_titles  c("Love Actually", "Notting Hill",
                              "Pretty Woman", "Sleepless in Seattle", "When Harry Met Sally")
subtitles_path  "../assets/data/valentines/"

romantic_comedies  romantic_comedies_titles %>% map(function(title){
  title_no_space  gsub(" ", "_", tolower(title))
  title_file_name  paste0(subtitles_path, title_no_space, ".srt")
  
  read.subtitles(title_file_name)$subtitles %>%
    mutate(movie_title = title)
})
head(romantic_comedies[[1]])
##   ID  Timecode.in Timecode.out
## 1  1 00:01:12.292 00:01:14.668
## 2  2 00:01:14.753 00:01:18.130
## 3  3 00:01:18.339 00:01:19.715
## 4  4 00:01:19.799 00:01:22.134
## 5  5 00:01:22.218 00:01:23.802
## 6  6 00:01:23.887 00:01:26.430
##                                                   Text   movie_title
## 1   Whenever I get gloomy with the state of the world, Love Actually
## 2 I think about the arrivals gate at Heathrow airport. Love Actually
## 3                  General opinion started to make out Love Actually
## 4         that we live in a world of hatred and greed, Love Actually
## 5                                but I don't see that. Love Actually
## 6                 Seems to me that love is everywhere. Love Actually

Subtitles preprocessing

The next step is tokenization, chopping up the subtitles into single words. At this stage I also perform a minor cleaning task, which is removing stop words and adding information about the line and its duration.

tokenize_clean_subtitles  function(subtitles, stop_words) {
  subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by = "word") %>%
    left_join(subtitles %>% select(ID, Text), by = "ID") %>%
    mutate(
      line = paste(Timecode.in, Timecode.out),
      duration = as.numeric(hms(Timecode.out) - hms(Timecode.in)))
}

data("stop_words")

head(stop_words)
## # A tibble: 6 x 2
##   word      lexicon
##     
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
tokenize_romantic_comedies  romantic_comedies %>%
  map(~tokenize_clean_subtitles(., stop_words))

After tokenizing the data I need to classify the word sentiment. In this analysis I simply want to know if the word has positive or negative sentiment. Tidytext package comes with 3 lexicons. The bing lexicon categorizes words as positive or negative. I use bing lexicon to assign the extracted words into desired classes.

bing  sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

assign_sentiment  function(tokenize_subtitles, bing) {
  tokenize_subtitles %>%
    left_join(bing, by = "word") %>% 
    mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
    mutate(score = ifelse(sentiment == "positive", 1,
                          ifelse(sentiment == "negative", -1, 0)))
}

tokenize_romantic_comedies_with_sentiment  tokenize_romantic_comedies %>%
  map(~assign_sentiment(., bing))

Since I am interested in deciding the sentiment of the movie line, I need to aggregate the scores on the line level. I create a simple rule: if the overall sentiment score is >= 1 we classify the line as positive, negative when and neutral in the other cases.

summarized_movie_sentiment  function(tokenize_subtitles_with_sentiment) {
  tokenize_subtitles_with_sentiment %>%
    group_by(line) %>%
    summarise(sentiment_per_minute = sum(score),
              sentiment_per_minute = ifelse(sentiment_per_minute >= 1, 1,
                                            ifelse(sentiment_per_minute  -1,-1, 0)),
              line_duration = max(duration),
              line_text = dplyr::first(Text),
              movie_title = dplyr::first(movie_title)) %>% 
    ungroup() %>%
    mutate(perc_line_duration = line_duration/sum(line_duration))
}
summarized_sentiment_romantic_comedies  tokenize_romantic_comedies_with_sentiment %>%
  map(~summarized_movie_sentiment(.))

Crème de la crème – data viz

After I am done with data preparation and munging, the fun begins and I get to visualize the data. In order to achieve a similar look as the authors of “Based on a True True Story?” I use stack horizontal bar charts in plotly. The bar length represents the movie duration in minutes.

Hint Hover on the chart to see the actual line and time. This only works for the orginal post.


plot_sentiment  function(summarized_sentiment) {
  sentiment_freq  round(
    summarized_sentiment %>%
      group_by(factor(sentiment_per_minute)) %>%
      summarize(duration = sum(perc_line_duration)) %>% .$duration * 100, 0)
  
  plot_title  paste('', summarized_sentiment$movie_title[1], '',
                      'Positive', paste0(sentiment_freq[3], '%'),
                      'Negative', paste0(sentiment_freq[1], '%'))
  
  plot_ly(summarized_sentiment, y = ~movie_title, x = ~perc_line_duration,
          type = "bar", orientation = 'h', color = ~sentiment_per_minute,
          text = ~paste("Time:", line, "
"
, "Line:", line_text), hoverinfo = 'text', colors = c("#01A8F1", "#f7f7f7", "#FA0771"), width = 800, height = 200) %>% layout(xaxis = list(title = "", showgrid = FALSE, showline = FALSE, showticklabels = FALSE, zeroline = FALSE, domain = c(0, 1)), yaxis = list(title = "", showticklabels = FALSE), barmode = "stack", title = ~plot_title) %>% hide_colorbar()} htmltools::tagList(summarized_sentiment_romantic_comedies %>% map(~plot_sentiment(.)))

Next steps

Recently, I learned about sentimentR package that let’s you analyze the sentiment on the senetence level. This would be interesting to conduct the analysis that way and see what sentiment scores would be received.

If you enjoyed this post spread the ♥ and share this post with someone who loves R as much as you!

Read the original post at
Appsilon Data Science Blog.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News