Python Text Analysis With the Schrutepy Package

Posted on January 18, 2020 by Brad Lindblad in Data science | 0 Comments

This article was first published on Technistema , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

LinkedIn | Github | Blog | Subscribe

Following the success of the {schrute} R package, many requests came in for the same dataset ported over to Python. The schrute and schrutepy packages serve one purpose only: to load the entire transcripts from The Office, so you can perform NLP, text analysis or whatever with this fun dataset.

Quick start

Install the package with pip:

pip install schrutepy

Then import the dataset into a dataframe:

from schrutepy import schrutepy

df = schrutepy.load_schrute()

That’s it. Now you’re ready.

Long example

Now we’ll quickly work through some common elementary text analysis functions.

from schrutepy import schrutepy
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import nltk
from nltk.corpus import stopwords
from PIL import Image
import numpy as np
import collections
import pandas as pd

Load the entire transcript with the load_schrute function

df = schrutepy.load_schrute()

Inspect the data

df.head()

	index	season	episode	episode_name	director	writer	character	text	text_w_direction
0	1	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Michael	All right Jim. Your quarterlies look very good…	All right Jim. Your quarterlies look very good…
1	2	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Jim	Oh, I told you. I couldnt close it. So…	Oh, I told you. I couldnt close it. So…
2	3	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Michael	So youve come to the master for guidance? Is …	So youve come to the master for guidance? Is …
3	4	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Jim	Actually, you called me in here, but yeah.	Actually, you called me in here, but yeah.
4	5	1	1	Pilot	Ken Kwapis	Ricky Gervais;Stephen Merchant;Greg Daniels	Michael	All right. Well, let me show you how its done.	All right. Well, let me show you how its done.

Some of the records don’t contain dialogue

df = df.dropna()

Create a wordcloud of all the text in the entire series

text = " ".join(review for review in df.text)

print ("There are {} words in the combination of all review.".format(len(text)))

There are 3001517 words in the combination of all review.

# Create stopword list:
nltk.download('stopwords')
stopWords = set(stopwords.words('english'))

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopWords, background_color="white").generate(text)

# Display the generated image:
# the matplotlib way:
plt.figure(figsize=[30,15])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

[nltk_data] Downloading package stopwords to /home/xps/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Let’s do this same thing for a few of the characters. Might as well make a function at this point…

def plotDunder(character, df):
    mydf = df[df.character == character]
    text1 = " ".join(review for review in mydf.text)
    # Generate a word cloud image
    wordcloud = WordCloud(stopwords=stopWords, background_color="white").generate(text1)

    # Display the generated image:
    # the matplotlib way:
    plt.figure(figsize=[15,7])
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(character)
    plt.axis("off")
    plt.show()

fav = ["Michael", "David Wallace", "Dwight", "Jim", "Pam", "Oscar", "Phyllis", "Creed", "Ryan",]

for i in fav:
    plotDunder(i, df)

Let’s make on in the shape of Dwight’s large head

dwight_mask = np.array(Image.open("schrutepy.png"))

# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=dwight_mask,
               stopwords=stopWords, contour_width=1, contour_color='grey')

# Generate a wordcloud
wc.generate(text)


# show
plt.figure(figsize=[30,15])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()


wc.to_file("final_schrute.png")

<wordcloud.wordcloud.WordCloud at 0x7fa1036a8b00>

Now let’s find and plot the most common word spoken by my favorite characters

def commonWord(character, df):
    mydf = df[df.character == character]
    text = " ".join(review for review in mydf.text)
    wordcount = {}
    # To eliminate duplicates, remember to split by punctuation, and use case demiliters.
    for word in text.lower().split():
        word = word.replace(".","")
        word = word.replace(",","")
        word = word.replace(":","")
        word = word.replace("\"","")
        word = word.replace("!","")
        word = word.replace("â€œ","")
        word = word.replace("â€˜","")
        word = word.replace("*","")
        if word not in stopWords:
            if word not in wordcount:
                wordcount[word] = 1
            else:
                wordcount[word] += 1

    # Print most common word
    n_print = int(10)
#     print("\nOK. The {} most common words are as follows\n".format(n_print))
    word_counter = collections.Counter(wordcount)
    for word, count in word_counter.most_common(n_print):
        pass
    # Close the file
    # Draw a bar chart
    lst = word_counter.most_common(n_print)
    df = pd.DataFrame(lst, columns = ['Word', 'Count'])
    df.plot.bar(x='Word',y='Count', title=character)