Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R

By Abdul Majed Raja

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Sentiment Analysis is one of the most obvious things Data Analysts with unlabelled Text data (with no score or no rating) end up doing in an attempt to extract some insights out of it and the same Sentiment analysis is also one of the potential research areas for any NLP (Natural Language Processing) enthusiasts.

For an analyst, the same sentiment analysis is a pain in the neck because most of the primitive packages/libraries handling sentiment analysis perform a simple dictionary lookup and calculate a final composite score based on the number of occurrences of positive and negative words. But that often ends up in a lot of false positives, with a very obvious case being ‘happy’ vs ‘not happy’ – Negations, in general Valence Shifters.

Consider this sentence: ‘I am not very happy’. Any Primitive Sentiment Analysis Algorithm would just flag this sentence positive because of the word ‘happy’ that apparently would appear in the positive dictionary. But reading this sentence we know this is not a positive sentence.

While we could build our own way to handle these negations, there are couple of new R-packages that could do this with ease. One such package is sentimentr developed by Tyler Rinker.

Installing the package

sentimentr can be installed from CRAN or the development version can be installed from github.


Why sentimentr?

The author of the package himself explaining what does sentimentr do that other packages don’t and why does it matter?

“sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup. The next questions address why it matters.”

Sentiment Scoring:

sentimentr offers sentiment analysis with two functions: 1. sentiment_by() 2. sentiment()

Aggregated (Averaged) Sentiment Score for a given text with sentiment_by

sentiment_by('I am not very happy', by = NULL)

   element_id sentence_id word_count   sentiment
1:          1           1          5 -0.06708204

But this might not help much when we have multiple sentences with different polarity, hence sentence-level scoring with sentiment would help here.

sentiment('I am not very happy. He is very happy')

   element_id sentence_id word_count   sentiment
1:          1           1          5 -0.06708204
2:          1           2          4  0.67500000

Both the functions return a dataframe with four columns:

1. element_id – ID / Serial Number of the given text
2. sentence_id – ID / Serial Number of the sentence and this is equal to element_id in case of sentiment_by
3. word_count – Number of words in the given sentence
4. sentiment – Sentiment Score of the given sentence

Extract Sentiment Keywords

The extract_sentiment_terms() function helps us extract the keywords – both positive and negative that was part of the sentiment score calculation. sentimentr also supports pipe operator %>% which makes it easier to write multiple lines of code with less assignment and also cleaner code.

'My life has become terrible since I met you and lost money' %>% extract_sentiment_terms()
   element_id sentence_id      negative positive
1:          1           1 terrible,lost    money

Sentiment Highlighting:

And finally, the highight() function coupled with sentiment_by() that gives a html output with parts of sentences nicely highlighted with green and red color to show its polarity. Trust me, This might seem trivial but it really helps while making Presentations to share the results, discuss False positives and to identify the room for improvements in the accuracy.

'My life has become terrible since I met you and lost money. But I still have got a little hope left in me' %>% 
  sentiment_by(by = NULL) %>%

Output Screenshot:

Try using sentimentr for your sentiment analysis and text analytics project and do share your feedback in comments. Complete code used here is available on my github.

    Related Post

    1. Creating Reporting Template with Glue in R
    2. Predict Employee Turnover With Python
    3. Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates
    4. Time Series Analysis in R Part 2: Time Series Transformations
    5. Time Series Analysis in R Part 1: The Time Series Object
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News