How To Get A Sentiment Score For Words with Naive Bayes

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

We have provided an example of how to get a sentiment score for words in Python based on ratio frequency. For this example, we will work with the Naive Bayes approach taking into consideration a Twitter dataset that comes with NLTK which has been manually annotated. The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly.

Positive and Negative Probability of a Word

We have provided an example of Naive Bayes Classification where we explain the theory. To compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:

  • \(freq_{pos}\) and \(freq_{neg}\) are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
  • \(N_{pos}\) and \(N_{neg}\) are the total number of positive and negative words for all documents (for all SLs), respectively.
  • \(V\) is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We’ll use these to compute the positive and negative probability for a specific word using this formula:

\(P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} \)
\(P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} \)

Notice that we add the “+1” in the numerator for additive smoothing. This wiki article explains more about additive smoothing.

Log likelihood

To compute the loglikelihood of that very same word, we can implement the following equations:

\(\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\)

Words with positive log likelihood means that they have a positive sentiment and vice versa. The log likelihood takes values from -inf to inf.

Coding

First things first, we will load the libraries and the data.

import numpy as np
import nltk   # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
from collections import Counter
 
nltk.download('twitter_samples')

# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')


# get a sample
all_positive_tweets[0:10]

A sample of positive tweets:

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of misc

Let’s continue by building the formulas above.

# Create three Counter objects to store positive, negative and total counts
V_freq = Counter()
freq_neg = Counter()
freq_pos = Counter()
 
 
for i in range(len(all_positive_tweets)):
    for word in all_positive_tweets[i].lower().split(" "):
        freq_pos[word]+=1
        V_freq[word]+=1
 
 
for i in range(len(all_negative_tweets)):
    for word in all_negative_tweets[i].lower().split(" "):
        freq_neg[word]+=1
        V_freq[word]+=1
        
# Get the V
V = len(V_freq.keys())  


# get the number of unique positive and negative words
N_pos = len(freq_pos.keys()) 
N_neg = len(freq_neg.keys()) 


# define the loglikelihood function
def word_loglikelihood(w):
    w = w.lower()
    if w in V_freq:
        p_w_pos = (freq_pos.get(w,0)+1 / (N_pos+V))
        p_w_neg = (freq_neg.get(w,0)+1 / (N_neg+V)) 
        return np.log(p_w_pos/p_w_neg)
    else:
        return(0)

# get the sentiment score of the words that have appeared at least 100 times

wl_dict = {}

for v in V_freq.keys():
    if V_freq[v]>=100:
        wl_dict[v] = (word_loglikelihood(v))

Results

Let’s see the 20 words with the highest positive and negative sentiment respectively.

Positive Words

We will sort the wl_dict and we will get the top 20 words:

# sort by dictionary by value reverse order
dict(sorted(wl_dict.items(), key=lambda item: item[1], reverse=True)[0:20])

And we get:

{':)': 18.63959625712883,
 ':-)': 17.004791788224438,
 ':d': 16.994987788504915,
 ':p': 15.43519992996352,
 ':))': 15.265300927223933,
 'thanks': 2.807678903079016,
 'great': 2.251290369790345,
 'thank': 2.1465800079419135,
 'happy': 2.003728796400575,
 'hi': 1.8458256608722114,
 '<3': 1.7086919541878434,
 'nice': 1.6204865995790583,
 '!': 1.6094369300047262,
 'our': 1.0426532872476826,
 'new': 1.0310186653054187,
 'an': 1.000973431018713,
 'us': 0.9842014904267523,
 'follow': 0.9831628218512771,
 'good': 0.9139890521105093,
 'your': 0.9117691277531761}

Negative Words

Similarly, we will return the top 20 negative words:

# sort by dictionary by value
dict(sorted(wl_dict.items(), key=lambda item: item[1])[0:20])

And we get:

{':-(': -16.72062100340213,
 ':((': -16.277236693311615,
 ':(((': -15.756702318116552,
 ':(': -8.222261541574424,
 'sad': -3.267660345532505,
 'miss': -2.573458846755138,
 'followed': -2.1355294279645554,
 'sorry': -2.1102119210977404,
 'why': -1.7176508188113777,
 'wish': -1.6211328621671777,
 "can't": -1.4271159399725901,
 'feel': -1.2425058888621603,
 'wanna': -1.2110897009657746,
 'want': -1.0801783823774231,
 'please': -1.045266117959381,
 'been': -0.9671488467945346,
 'still': -0.9373438513917545,
 'but': -0.9068343183526079,
 'too': -0.8574500795948041,
 'im': -0.8552028034988816}

The Takeaway

As we can see got expected results and more specifically:

  • The “emojis” are the most powerful tokens. This is another reason why we should be careful when removing punctuations in Sentiment Analysis and in NLP tasks
  • The Top 5 Positive tokens were related to smile face like :):-) , :D:P :)) . Notice that we have converted all the letters to lower case.
  • Other positive words that we found were the thanksgreathappy and nice,
  • The Top 4 Negative tokens were related to sad faces like :(((:((:-( and :(
  • Other negative words that we found were the sadmisssorrywhywishbutfeel etc

We can improve the sentiment analysis by applying different tokenizers, text mining etc. If we want to get a sentiment score of a word and we do not have annotated documents, we can work with other libraries like Vader as we have explained in another post.

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.