How to Run Sentiment Analysis in Python using VADER

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Words Sentiment Score

We have explained how to get a sentiment score for words in Python. Instead of building our own lexicon, we can use a pre-trained one like the VADER which stands from Valence Aware Dictionary and sEntiment Reasoner and is specifically attuned to sentiments expressed in social media.

You can install the VADER library using pip like pip install vaderSentiment or you can get it directly from NTLK. You can have a look at VADER documentation.

Examples of Sentiment Scores

The VADER library returns 4 values such as:

  • pos: The probability of the sentiment to be positive
  • neu: The probability of the sentiment to be neutral
  • neg: The probability of the sentiment to be negative
  • compound: The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

Notice that the pos, neu and neg probabilities add up to 1. Also, the compound score is a very useful metric in case we want a single measure of sentiment. Typical threshold values are the following:

  • positive: compound score>=0.05
  • neutral: compound score between -0.05 and 0.05
  • negative: compound score<=-0.05

Let’s see these features in practice. We will work with a sample fo twitters obtained from NTLK.

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import twitter_samples 

nltk.download('twitter_samples')
nltk.download('vader_lexicon')

# get 5000 posivie and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

analyzer = SentimentIntensityAnalyzer()
 
 

Let’s get an arbitrary positive tweet and then a negative one.

# positive
all_positive_tweets[100]
 

Let’s have a look at the tweet.

"@metalgear_jp @Kojima_Hideo I want you're T-shirts ! They are so cool ! :D"

Let’s get its sentiment score:

analyzer.polarity_scores(all_positive_tweets[100])
 

The output is 56.8% positive ad 43.2% neutral. The compound score is 0.8476

{'neg': 0.0, 'neu': 0.432, 'pos': 0.568, 'compound': 0.8476}

Let’s do the same for a negative tweet.

all_negative_tweets[20]

which is:

'I feel lonely someone talk to me guys and girls :(\n\[email protected] @imarieuda @EiroZPegasus @AMYSQUEE @UdotV'

Let’s get the sentiment score:

analyzer.polarity_scores(all_negative_tweets[20])
 

The output is 70.7% neutral ad 29.3% negative. The compound score is -0.6597

{'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.6597}

Important Note about Sentiment Scores

In most NLP tasks we need to apply data cleansing first. In my opinion, this should be avoided when we run sentiment analysis. Notice that VADER:

  • It is case sensitive. The sentence This is great has a different score than the sentence This is GREAT.
  • Punctuation matters. The exclamation marks for example have a positive score
  • The emojis have also a score and actually very strong sentiments. Try the <3, :) , :p and :(
  • Words after @ and # have a neutral score.

Get the Sentiment Score of Thousands of Tweets

We will show how you can run a sentiment analysis in many tweets. We will work with the 10K sample of tweets obtained from NLTK. We start our analysis by creating the pandas data frame with two columns, tweets and my_labels which take values 0 (negative) and 1 (positive).

my_labels = [1]*len(all_positive_tweets)
negative_labels = [0]*len(all_negative_tweets)
my_labels.extend(negative_labels)

all_positive_tweets.extend(all_negative_tweets)

df = pd.DataFrame({'tweets' : all_positive_tweets, 
                   'my_labels' : my_labels})

df 
 
How to Run Sentiment Analysis in Python using VADER 1

Now, we will add 4 new columns such as the neg, neu, pos and compound using the lambda function.

df['neg'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
df['neu'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
df['pos'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
df['compound'] = df['tweets'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
df
 
How to Run Sentiment Analysis in Python using VADER 2

Analyze the Sentiment Score

Since we have tide the data and we have gathered the required information in a structured format, we can apply any kind of analysis. So for example let’s have a look at the compound score for the positive and negative labels.

df.groupby('my_labels')['compound'].describe()
 
How to Run Sentiment Analysis in Python using VADER 3

Let’s also have a look at the boxplot.

df.boxplot(by='my_labels', column='compound', figsize=(12,8))
 
How to Run Sentiment Analysis in Python using VADER 4

Discussion

It is obvious that VADER is a reliable tool to perform sentiment analysis, especially in social media comments. As we can see from the box plot above, the positive labels achieved much higher score compound score and the majority is higher than 0.5. On contrary, the negative labels got a very low compound score, with the majority to lie below 0.

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.