Example of a Machine Learning Algorithm to Predict Spam Emails in Python

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Ham or Spam

One of the most common projects, especially for teaching purposes, is to build models to predict if a message is spam or not. Our dataset called Spam contains the subject lines and the target which takes values 0 and 1 for ham and spam respectively.

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)
 

text	target
0	Go until jurong point, crazy.. Available only ...	0
1	Ok lar... Joking wif u oni...	0
2	Free entry in 2 a wkly comp to win FA Cup fina...	1
3	U dun say so early hor... U c already then say...	0
4	Nah I don't think he goes to usf, he lives aro...	0
5	FreeMsg Hey there darling it's been 3 week's n...	1
6	Even my brother is not like to speak with me. ...	0
7	As per your request 'Melle Melle (Oru Minnamin...	0
8	WINNER!! As a valued network customer you have...	1
9	Had your mobile 11 months or more? U R entitle...	1

Split the Data into Train and Test Dataset

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)


Build the tf-idf on N-grams

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams)

vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)


Add Features

We apart from the tokens, we can add features such as the number of digits, the dollar sign , the length of the subject line and the number of characters (anything other than a letter, digit or underscore) . Let’s create a function for that.

def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')


# Train Data
add_length=X_train.str.len()
add_digits=X_train.str.count(r'\d')
add_dollars=X_train.str.count(r'\$')
add_characters=X_train.str.count(r'\W')

X_train_transformed = add_feature(X_train_vectorized , [add_length, add_digits,  add_dollars, add_characters])

# Test Data
add_length_t=X_test.str.len()
add_digits_t=X_test.str.count(r'\d')
add_dollars_t=X_test.str.count(r'\$')
add_characters_t=X_test.str.count(r'\W')


X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t,  add_dollars_t, add_characters_t])

Train the Logistic Regression Model

We will build the Logistic Regression Model and we will report the AUC score on the test dataset:

clf = LogisticRegression(C=100, solver='lbfgs', max_iter=1000)

clf.fit(X_train_transformed, y_train)

y_predicted = clf.predict(X_test_transformed)

auc = roc_auc_score(y_test, y_predicted)
auc
0.9674528462047772

Get the Most Important Features

We will show the 50 most important features which lead to either Ham of Spam respectively.

feature_names = np.array(vect.get_feature_names() + ['lengthc', 'digit', 'dollars', 'n_char'])
sorted_coef_index = clf.coef_[0].argsort()
smallest = feature_names[sorted_coef_index[:50]]
largest = feature_names[sorted_coef_index[:-51:-1]]
 

Features which lead to Spam:

largest
 
array(['text', 'sale', 'free', 'uk', 'content', 'tones', 'sms', 'reply',
       'order', 'won', 'ltd', 'girls', 'ringtone', 'to', 'comes',
       'darling', 'this message', 'what you', 'new', 'www', 'co uk',
       'std', 'co', 'about the', 'strong', 'txt', 'your', 'user',
       'all of', 'choose', 'service', 'wap', 'mobile', 'the new', 'with',
       'sexy', 'sunshine', 'xxx', 'this', 'hot', 'freemsg', 'ta',
       'waiting for your', 'asap', 'stop', 'll have', 'hello', 'http',
       'vodafone', 'of the'], dtype='<U31')

Features which lead to Ham:

smallest
 
array(['ì_ wan', 'for 1st', 'park', '1st', 'ah', 'wan', 'got', 'say',
       'tomorrow', 'if', 'my', 'ì_', 'call', 'opinion', 'days', 'gt',
       'its', 'lt', 'lovable', 'sorry', 'all', 'when', 'can', 'hope',
       'face', 'she', 'pls', 'lt gt', 'hav', 'he', 'smile', 'wife',
       'for my', 'trouble', 'me', 'went', 'about me', 'hey', '30', 'sir',
       'lovely', 'small', 'sun', 'silent', 'me if', 'happy', 'only',
       'them', 'my dad', 'dad'], dtype='<U31')

Discussion

We provided a practical and reproducible example of how you can build a decent Ham or Spam algorithm. This is one of the main tasks in the field of NLP. Our model achieved an AUC score of 97% on the test dataset which is really good. We were also able to add features and also to identify the features which are more likely to appear in a Spam email and vice versa.

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.