How to Find Similar Documents using N-grams and Word Embeddings

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Introduction

A very common task in NLP is to define the similarity between documents. Usually, the metric is the Cosine Similarity and there are two main approaches such as:

  • Transform the documents into a vector space by generating the Document-Term Matrix or the TF-IDF. This approach is based on n-grams, where usually we consider up to bi-grams.
  • Transform the documents into a vector space by taking the average of the pre-trained word embeddings.

In this tutorial, we will provide you a hands-on example of how you can find similar documents from a list of documents using these two different approaches. We will not try to optimize the performance of the algorithms by applying different approaches like “stemming”, “lemmatization”, different tokenizers, different number of n-grams etc.

For our example, we will consider the  Spam dataset which contains a list of subject lines marked as Ham or Spam. If you want to build a model to predict if the email is Ham or Spam you can have a look at our tutorial.

Similar Documents using N-grams

Let’s start by loading the data:

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
pd.set_option("max_colwidth", 300)

df = spam_data = pd.read_csv('spam.csv')
df
 
How to Find Similar Documents using N-grams and Word Embeddings 1

Now, let’s build the TF-IDF:

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), lowercase=True, min_df=2)
X = vectorizer.fit_transform(df.text)

Finally, let’s write a function which takes as input the content of the document, the existing documents and finally, the number of recommendations and it returns the n most similar documents.

def similar_documents(text, df, n=10):
    df = df.copy()
    input_vect = vectorizer.transform()
    df['similarity'] = cosine_similarity(input_vect, X).flatten()
    return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n))


user_input = """Nah I don't think he goes to usf, he lives around here though"""

similar_documents(text=user_input, df=df, n=10)
 

How to Find Similar Documents using N-grams and Word Embeddings 2

Similar Documents using Word Embeddings

Another approach is to work with Word Embeddings. We have provided a similar tutorial using GloVe. In this post, we will work with the SpaCy library.

import spacy

# load the word embeddings
nlp = spacy.load("en_core_web_md")

# in case we want to work with 2D Numpy arrayes we need to unnest the numpy array as follows
# np.stack(df.embedding.to_numpy()).shape
# np.vstack(df.embedding.to_numpy()).shape

# create a column of word embedding sectors
df['embedding'] = df['text'].apply(lambda x: nlp(x).vector)

df
 
How to Find Similar Documents using N-grams and Word Embeddings 3

Let’s create a similar function to the above, but this time by taking into consideration the word embeddings.

def emb_similar_documents(text, df, n=10):
    df = df.copy()
    input_vect = nlp(text).vector
    # reshape the inputs to 1, 300 since we are dealing with vectors of 300-D
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(input_vect.reshape(1,300), x.reshape(1,300))[0][0])
    return (df.sort_values(by='similarity', ascending=False)[['text', 'similarity']].head(n))


user_input = """I don't quite know what to do. I still can't get hold of anyone. I cud pick you up bout 7.30pm and we can see if they're in the pub?"""

emb_similar_documents(text=user_input, df=df, n=10)
 
How to Find Similar Documents using N-grams and Word Embeddings 4

Final Thoughts

I believe that in small documents, like subject lines, the TF-IDF approach is better than the Word Embeddings. Also, based on my experience, taking the average of the word embeddings does not lead to a meaningful vector representation for the document. There are other techniques like Doc2Vec where we can discuss in another post. Stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.