GloVe Word Embeddings on Plot of the Movies

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.


Every word can be represented into N-Dimension Space after applying Machine Learning Algorithms on documents. The most famous algorithms are the Word2Vec built by Google and the GloVe built by Stanford University. We will work with the GloVe pre-trained model.
The idea is to represent into 50-D space every Movie Plot Summary and based on this vector to find similar movies. Finally, we will do dimensionality reduction by applying the T-SNE algorithm and to represent the plot summaries into 2-D space.

We will work with the Wikipedia Movie Plot dataset, obtained it from Kaggle.

The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

  • Release Year – Year in which the movie was released
  • Title – Movie title
  • Origin/Ethnicity – Origin of movie (i.e. American, Bollywood, Tamil, etc.)
  • Director – Director(s)
  • Plot – Main actor and actresses
  • Genre – Movie Genre(s)
  • Wiki Page – URL of the Wikipedia page from which the plot description was scraped
  • Plot – Long form description of movie plot (WARNING: May contain spoilers!!!)

GloVe Word Embeddings

You can find a good description of the GloVe model by having a look at the NLP Stanford projects. For this project, we will work with the 50-D Pre-Trained GloVe model from Wikipedia articles where you can download it.

Pipeline of the Analysis

We will do some data cleaning by removing stop words and numbers, and punctuation and we will convert the documents into lower case. Then, will we will add the Word Embeddings of the plot summary words. Thus, every plot will be one vector, which is the sum of all 50-D Word Embeddings

import numpy as np
import pandas as pd
from scipy import spatial
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords
import re
import string

pd.set_option('max_colwidth', 200)

%matplotlib inline 

Load the Word Embeddings

We will build a key-value dictionary where key is the word and value is the word embedding.

embeddings_dict = {}
with open("glove.6B.50d.txt", 'r', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

Let’s get the Embedding of the word taste

array([-0.25826  ,  0.20755  , -1.8089   , -0.0045707,  1.002    ,
        0.30873  ,  0.23493  ,  0.13954  , -0.16455  ,  1.1608   ,
        0.12976  ,  0.39695  ,  0.9664   , -0.22466  ,  0.20641  ,
        0.22372  , -0.043133 ,  0.26844  , -0.090198 , -1.213    ,
       -0.20942  ,  0.19578  ,  1.1608   ,  0.41065  , -0.088691 ,
       -0.77222  , -1.1368   ,  0.98575  ,  1.5428   , -0.05683  ,
        2.2402   ,  0.63528  , -0.072622 , -0.23851  ,  0.29049  ,
        0.12906  , -1.1139   ,  0.89749  ,  0.55485  , -0.77596  ,
        0.71329  ,  0.062237 , -0.13661  ,  0.19611  ,  0.58233  ,
        1.5621   ,  0.2034   ,  0.23999  , -0.032633 ,  0.28185  ],

Load the Wikipedia Movie

df = pd.read_csv('wiki_movie_plots_deduped.csv')

Text Cleaning

Do some text cleaning to the Plot column'stopwords')
def process_text(text):
    """Process text function.
        text: a string containing a text
        text_clean: a list of words containing the processed text

    # turn it to lower case
    text = text.lower()
    # replace apostrophe with space
    text = re.sub('\'', ' ', text)
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # remove digits
    text = re.sub('\d', '', text)
    stopwords_english = stopwords.words('english')

    text_clean = ""
    text = text.split(' ')
    for word in text:
        if (word not in stopwords_english):  # remove punctuation
            # text_clean.append(word)
            text_clean = text_clean +" "+word
    # remove double spaces
    text_clean = re.sub(' +', ' ', text_clean)
    # strip text
    text_clean = text_clean.strip()

    return text_clean

Since we built the process_text function we can apply the lambda function:

df['clean_text'] = df['Plot'].apply(lambda x: process_text(str(x)))

Build the Doc2Vec function

The Doc2Vec function is the sum of all word embedding of the corresponding plot:

def doc2vecF(doc):
    vdoc = [embeddings_dict.get(x,0) for x in doc.lower().split(" ")]
    doc2vec = np.sum(vdoc, axis = 0)
    if np.sum(doc2vec == 0) ==1:
        doc2vec = np.zeros(50, "float32")
    return doc2vec

Convert each Movie Plot to a “Document” Embedding

We will use the doc2vecF to convert each plot to a 50-D vector. We will store the data to a list called data.

data = []
for i in df['clean_text']:

embd = pd.DataFrame(data)
(34886, 50)

Find Similar Movies

The similarity can be defined by calculating the Cosine Similarity. Let’s build a function:

def cosine_similarity(A, B):

    dot =,B)
    norma = np.sqrt(,A))
    normb = np.sqrt(,B))
    cos = dot / (norma*normb)
    return cos

Let’s find the similarity of the words tea and coffee

tea = embeddings_dict['tea']
coffee = embeddings_dict['coffee']

cosine_similarity(tea, coffee)

And we get similarity 0.807 which is quite high. We can also use the cosine similarity function from the scikit learn (from sklearn.metrics.pairwise import cosine_similarity).

Let’s get the plot of the movie Fight Club (1999) and let’s try to find a similar movie

The unnamed Narrator is a traveling automobile recall specialist who suffers from insomnia. When he is unsuccessful at receiving medical assistance for it, the admonishing doctor suggests he realize his relatively small amount of suffering by visiting a support group for testicular cancer victims. The group assumes that he, too, is affected like they are, and he spontaneously weeps into the nurturing arms of another man, finding a freedom from the catharsis that relieves his insomnia. He decides to participate in support groups of various kinds, always allowing the groups to assume that he suffers what they do. However, he begins to notice another impostor, Marla Singer, whose presence reminds him that he is attending these groups dishonestly, and this disturbs his bliss. The two negotiate to avoid their attending the same groups, but, before going their separate ways, Marla gives him her phone number.
On a flight home from a business trip, the Narrator meets Tyler Durden, a soap salesman with whom he begins to converse after noticing the two share the same kind of briefcase. After the flight, the Narrator returns home to find that his apartment has been destroyed by an explosion. With no one else to contact, he calls Tyler, and they meet at a bar. After a conversation about consumerism, outside the bar, Tyler chastises the Narrator for his timidity about needing a place to stay. Tyler requests that the Narrator hit him, which leads the two to engage in a fistfight. The Narrator moves into Tyler’s home, a large dilapidated house in an industrial area of their city. They have further fights outside the bar on subsequent nights, and these fights attract growing crowds of men. The fighting eventually moves to the bar’s basement where the men form a club (“Fight Club”) which routinely meets only to provide an opportunity for the men to fight recreationally.
Marla overdoses on pills and telephones the Narrator for help; he eventually ignores her, leaving his phone receiver without disconnecting. Tyler notices the phone soon after, talks to her and goes to her apartment to save her. Tyler and Marla become sexually involved. He warns the Narrator never to talk to Marla about him. More fight clubs form across the country and, under Tyler’s leadership (and without the Narrator’s knowledge), they become an anti-materialist and anti-corporate organization, Project Mayhem, with many of the former local Fight Club members moving into the dilapidated house and improving it.
The Narrator complains to Tyler about Tyler excluding him from the newer manifestation of the Fight Club organization Project Mayhem. Soon after, Tyler leaves the house without notice. When a member of Project Mayhem is killed by the police during a botched sabotage operation, the Narrator tries to shut down the project. Seeking Tyler, he follows evidence of Tyler’s national travels. In one city, a Project Mayhem member greets the Narrator as Tyler Durden. The Narrator calls Marla from his hotel room and discovers that Marla also believes him to be Tyler. Tyler suddenly appears in his hotel room, and reveals that they are dissociated personalities in the same body. When the Narrator has believed himself to be asleep, Tyler has been controlling his body and traveling to different locations.
The Narrator blacks out after the conversation, and when he awakes, he uncovers Tyler’s plans to erase debt by destroying buildings that contain credit card companies’ records. The Narrator tries to warn the police, but he finds that these officers are members of the Project. He attempts to disarm the explosives in a building, but Tyler subdues him and moves him to the uppermost floor. Held at gunpoint by Tyler, the Narrator realizes that, in sharing the same body with Tyler, he himself is actually in control holding “Tyler’s” gun. The Narrator fires it into his own mouth, shooting through the cheek without killing himself. Tyler collapses with an exit wound to the back of his head, and the Narrator stops mentally projecting him. Afterward, Project Mayhem members bring a kidnapped Marla to him, believing him to be Tyler, and leave them alone. Holding hands, the Narrator and Marla watch as the explosives detonate, collapsing many buildings around them.

# get the index of the Fight Club
fight_club_index = df.loc[df['Title']==df.loc[df['Title']=='Fight Club'].index[0]

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarities = cosine_similarity(embd[fight_club_index:fight_club_index+1], embd).flatten()

related_docs_indices = cosine_similarities.argsort()[:-3:-1]

# notice that the most similar is itsefl that is why we get the second most similar
Release Year                                                                                                                                                                                                           2013
Title                                                                                                                                                                                                              The East
Origin/Ethnicity                                                                                                                                                                                                    British
Director                                                                                                                                                                                           Director: Zal Batmanglij
Cast                                                                                                                     Director: Zal Batmanglij\r\nCast: Brit Marling, Alexander Skarsgård, Ellen Page, Patricia Clarkson
Genre                                                                                                                                                                                                               unknown
Wiki Page                                                                                                                                                           
Plot                \r\nJane, an operative for the private intelligence firm Hiller Brood, is assigned by her boss Sharon to infiltrate The East, an underground activist, anarchist and ecologist organization that has...
clean_text          jane operative private intelligence firm hiller brood assigned boss sharon infiltrate east underground activist anarchist ecologist organization launched several attacks corporations attempt expos...
Name: 21312, dtype: object

Let’s get the plot of The East (2013)

Jane, an operative for the private intelligence firm Hiller Brood, is assigned by her boss Sharon to infiltrate The East, an underground activist, anarchist and ecologist organization that has launched several attacks against corporations in an attempt to expose their corruption. Calling herself Sarah, she joins local drifters in hitching rides on trains, and when one drifter, Luca, helps her escape from the police, she identifies the symbol of The East hanging from Luca’s car mirror. Sarah self-inflicts an arm injury that she tells Luca was caused in the escape so he can get medical attention for her. He takes her to an abandoned house in the woods where members of The East live and one of the members, Doc, treats her cut.
Sarah is given two nights to recover before she must leave the squat. At an elaborate dinner involving straitjackets, Sarah is tested and fails, eExposing how selfishly she and many others live their lives. Sarah is caught one night when spying by the deaf Eve and has a conversation in sign language with her. Sarah tells Eve that she is an undercover agent and threatens Eve with jail if she stays with the group; Eve leaves the next morning.[3] Sarah is recruited to fill the missing member’s role on a “jam”, which is an old fashioned term for a direct action. After seeing the effectiveness of the pharmaceutical jam, compounded by her growing attraction to charismatic Benji (Alexander SkarsgÃ¥rd), Sarah gradually questions the moral underpinnings of her undercover duty.[4] Sarah reluctantly participates in The East’s next jam and comes to learn that each member of The East has been personally damaged by corporate activities. For example, Doc has been poisoned by a fluoroquinolone antibiotic and his neurosystem is degenerating. The East infiltrates a fancy party for the senior executives of the pharmaceutical company responsible for Doc’s poisoning and puts a strong dose of the risky antibiotic into everyone’s champagne. The East announce this action via YouTube and over time one executive’s mind and body begin to degenerate as a side effect of the antibiotic, revealing publicly the extreme risks of the drug.
Another East member, Izzy (Ellen Page), is the daughter of a petrochemical CEO. The group uses the father/daughter connection to gain intimate access to the CEO and forces him to bathe in the waterway he has been using as a toxic dumping ground. This jam goes wrong when security guards arrive and shoot Izzy in the back as she and the others flee. Back at the squat, because of his poisoning, Doc’s hands tremble too much for him to perform surgery on Izzy. Sarah offers to do it for him and he tells her what to do. She manages to remove the bullet from Izzy’s abdomen, but Izzy dies and is buried near The East’s house.
Even though Sarah and Benji have grown closer and Sarah implores him to just disappear, he insists that they go together to complete the final jam. Sarah refuses at first but finally gives in and the two begin a long drive, during which Sarah falls asleep. When she awakens, she realizes that Benji is driving her to the Hiller Brood headquarters outside Washington, D.C. He reveals that he has always suspected her of being a Hiller Brood operative, and that Luca also thought this, but brought her in as a test. Benji wants Sarah to obtain a NOC list of Hiller Brood agents all over the world, which will be The East’s third and final jam, to “watch” them. Having successfully obtained the NOC list using her cell phone’s memory card, Sarah runs into Sharon in the hall. She confronts Sharon about the firm’s activities, thus revealing her new allegiances. Sharon has Sarah’s cellphone confiscated as she leaves the building. As Hiller Brood was sharing information about their activities with the FBI, The East’s hideout is raided and Doc is arrested. He sacrifices himself to ensure the getaway of the remaining members. Sarah tells Benji she has failed to get the NOC list. Benji reveals he means to use the list to expose publicly all the Hiller Brood agents. Since they are undercover, however, it is likely they could be killed. Sarah chooses not to go on the run with Benji. She and Benji part at a truck-stop as Benji heads out of the country. In truth, Sarah has the NOC list (because it was not on her phone, she had swallowed the memory card instead). It is clear that her time undercover with The East has changed her. The film ends with an epilogue of her personally contacting her former coworkers (those undercover) and attempting to demonstrate what nefarious corporate crimes Hiller Brood clients want to protect. She hopes to change each operative’s mind about their undercover activities and perhaps join her in ecological activism. She is paying tribute to the things in which The East believes and attempting to make a difference in her own way, without causing harm to anyone.

Let’s also get the Cosine SImilarity of these two movies:


t-SNE on Movie Plots

Since the movies are too many plot visualization purposes we will choose the movies which belong to the genres drama, comedy and horror

embd['Genre'] =df['Genre']
embd = embd.loc[embd.Genre.isin(['drama', 'comedy', 'horror'])]

X = embd.iloc[:,0:50]
y = embd.iloc[:,50]
# scale/normalize the data
X = StandardScaler().fit_transform(X)

# the two components
tSNE = pd.DataFrame(TSNE(n_components=2).fit_transform(X), columns = ['tSNE1', 'tSNE2'])
# add the target
tSNE['target'] = y

sns.scatterplot(x='tSNE1', y='tSNE2', data=tSNE, hue='target')
GloVe Word Embeddings on Plot of the Movies 1

From the scatter plot above, it seems that the plots of the movies cannot be classified based on genres

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.