Naive Bayes Classification in NLP tasks from Scratch

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this tutorial, we will get the Bayesian Score of each word as well as of the whole Subject Line. The score will indicate the chance of a Subject Line and/or token being “spam”. You can find the dataset here. We have used the same dataset, in the Email Spam Detector Tutorial, so feel free to compare the Bayesian approach with the Logistic Regression.

Load the libraries

import pandas as pd
import numpy as np
import re
from collections import Counter
import string

Theory and Formulas

So how do you train a Naive Bayes classifier?

  • The first part of training a naive bayes classifier is to identify the number of classes that you have.
  • You will create a probability for each class. \(P(D_{pos})\) is the probability that the document is positive. \(P(D_{neg})\) is the probability that the document is negative.
    Use the formulas as follows and store the values in a dictionary:

\(P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}\)

$latexP(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$

Where \(D\) is the total number of documents, or Subject Lines in this case, \(D_{pos}\) is the total number of positive SL and \(D_{neg}\) is the total number of negative SL.

Prior and Logprior

The prior probability represents the underlying probability in the target population that a SL is positive versus negative. In other words, if we had no specific information and blindly picked a SL out of the population set, what is the probability that it will be positive versus that it will be negative? That is the “prior”.

The prior is the ratio of the probabilities \(\frac{P(D_{pos})}{P(D_{neg})}\).
We can take the log of the prior to rescale it, and we’ll call this the logprior

\(\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)\).

Note that \(log(\frac{A}{B})\) is the same as $log(A) – log(B)$. So the logprior can also be calculated as the difference between two logs:

\(\text{logprior} = \log (P(D_{pos})) – \log (P(D_{neg})) = \log (D_{pos}) – \log (D_{neg})\tag{3}\)

Positive and Negative Probability of a Word

To compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:

  • \(freq_{pos}\) and \(freq_{neg}\) are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
  • \(N_{pos}\) and \(N_{neg}\) are the total number of positive and negative words for all documents (for all SLs), respectively.
  • \(V\) is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We’ll use these to compute the positive and negative probability for a specific word using this formula:

\(P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} \)
\(P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} \)

Notice that we add the “+1” in the numerator for additive smoothing. This wiki article explains more about additive smoothing.

Log likelihood

To compute the loglikelihood of that very same word, we can implement the following equations:

\(\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}\)

The overall probability of each SL is:

\(p = logprior + \sum_i^N (loglikelihood_i)\)

where we sum up loglikelihoods of each word in the SL plus the logprior.

Coding

Let’s get our hands dirty by building the formulas above.

# load the data and set the spam=1 and ham=0
# convert the text into lower case and remove the puncuations
df = pd.read_csv("spam.csv")
df['target'] = df.target.map({'spam':1, 'ham':0})
df['text'] = df.text.apply(lambda x:x.lower())
df['text'] = df.text.apply(lambda x:x.translate(str.maketrans('', '', string.punctuation)))
df
 
Naive Bayes Classification in NLP tasks from Scratch 1
# Get the V Freq
V_freq = Counter(" ".join(df['text'].values.tolist()).split(" "))

# Get the V
V = len(V_freq.keys())   

# get the freq_pos
freq_pos = Counter(" ".join(df.loc[df.target==1]['text'].values.tolist()).split(" ")) 

# get the freq_neg
freq_neg = Counter(" ".join(df.loc[df.target==0]['text'].values.tolist()).split(" ")) 

# get the number of positive and negative documents
D_pos = sum(df.target==1)
D_neg = sum(df.target==0)

# get the number of unique positive and negative words
N_pos = len(freq_pos.keys()) 
N_neg = len(freq_neg.keys()) 

logprior = np.log(D_pos/D_neg)
 

def word_loglikelihood(w):
    w = w.lower()
    if w in V_freq:
        p_w_pos = (freq_pos.get(w,0)+1 / (N_pos+V))
        p_w_neg = (freq_neg.get(w,0)+1 / (N_neg+V)) 
        return np.log(p_w_pos/p_w_neg)
    else:
        return(0)
 

Let’s see the score of some words, like “lovable” and “free“.

Naive Bayes Classification in NLP tasks from Scratch 2

As we can see, the word “free” has a high score (>0) which means that this word is more related to spam emails. On contrary, the word “lovable” has a very low score (<0) which means that this word is not related to spam emails.

Let’s create a function that returns the score of the whole subject line by adding up the word likelihood of each word plus the logprior.

def text_loglikelihood(mytxt):
    mytxt = mytxt.lower().split(" ")
    score = logprior
    for w in mytxt:
        score+= word_loglikelihood(w)
        # print(w,word_loglikelihood(w))
    return(score)

Get the score of the first SL from our data frame:

text_loglikelihood(df.iloc[0]['text'])

We get:

-107.49288547485799

which implies that this SL is more likely to be Ham.

Make Predictions

Let’s say that we want to make predictions for all the SL. We will add two columns. The score and the label of the prediction by taking values 0 and 1, where 1 is when the score is positive and 0 otherwise.

df['score'] = df.text.apply(lambda x:text_loglikelihood(x))
df['prediction'] = df.score.apply(lambda x:int(x>0))

# confusion matrix
df.groupby(['target','prediction']).size().reset_index()
Naive Bayes Classification in NLP tasks from Scratch 3

Finally, the accuracy on the train dataset is 99.5%

np.mean(df.target==df.prediction)
Naive Bayes Classification in NLP tasks from Scratch 4

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.