Want to share your content on python-bloggers? click here.
Introduction
“Words have no power to impress the mind without the exquisite horror of their reality.”
Edgar Allan Poe
One common way of distinguishing between history and prehistory is by the emergence of writing. In particular, in our modern era, text data has become ubiquitous. The study of either the past or the present often involves the analysis of text. From social media to scientific journals, words are everywhere. In this lesson, we will learn how to analyze and visualize textual data. We will use the Natural Language Toolkit (NLTK) to tokenize the text data, and the Matplotlib library to visualize our results.
Data source
Data used in this lesson is available on the Oxford Text Archive website. To know more about textual data sources, check this post: ‘Where to find and how to load historical data’
Coding the past: visualizing text data
1. Importing text data with Python
To load the text files mentioned above, we will build a function. Before we start to write the function, all libraries necessary for this lesson will be loaded.
Using the with
statement will ensure that the file is closed when the block inside it is finished. Note that we use “latin-1” encoding. The function islice()
creates an iterable object and a for loop is used to slice the file into chunks (lines). Each line is appended to the list my_text
.
word_tokenize
is a function from the NLTK library that splits a sentence into words. All the sentences are then split into words and stored in a list. Note that the list needs to be flattened into a single list, since the tokenizer returns a list of lists. This is done with a list comprehension.
content_copy
Copy
Now we load the manifests of three authors: Oxenbridge Thacher, James Otis, and James Mayhew. The results are stored in three lists called thacher
, otis
, and mayhew
.
content_copy
Copy
If you check the length of the lists, you will see that Oxenbridge Thacher’s manifest has approximately 4,156 words; James Mayhew, 18,969 words; and James Otis, 34,031 words.
2. Removing stop words in Python
In this function, we will use NLTK stopwords to remove all words that do not add any meaning to our analysis. Moreover, we transform all characters to lowercase and remove all words containing two or fewer characters.
content_copy
Copy
We apply the function to the three lists of words. After the cleaning process, the number of words is reduced to less than 50% of the original size.
content_copy
Copy
3. How to count words in a list using python
The function below counts the frequency of each word and returns a dataframe with the words and their frequencies, sorted by the frequency.
content_copy
Copy
4. Word count visualization
We will use the matplotlib
library to create a bar plot with the 10 most frequent words in each manifest. We use iloc
to select the first 10 rows of each dataframe. barh
creates a horizontal bar plot where the words are on the y-axis and the frequency on the x-axis. After that, we set the title of each plot and perform a series of adjustments to the plot, including the elimination of the grid, the removal of part of the frame, and the change in font and background colors. Finally we also use the tight layout function to adjust the spacing between the plots.
content_copy
Copy
5. Calculate the proportion of each word and comparing the manifests
Finally, we calculate the proportion of each word in each manifest relative to the total number of words in that document and store them in a new column called “proportion”. We also create two new data frames, one for each pair of manifests: one to compare Thacher and Otis, and the other to compare Thacher and Mayhew. This is done by an outer join, using the word
column as the key. This operation keeps all the words, even the ones that are not included in both datasets, and fills the missing values with 0.
content_copy
Copy
Now we will compare the three manifests by plotting the proportion of each word in Thacher on the x-axis and the proportion of the same word in Otis on the y-axis. We will use the scatter
function to create a scatter plot in which the coordinates are the frequencies of a given word in Thacher and Otis. We will also use the annotate
function to label each point with the word. The same procedure will be used to compare Thacher and Mayhew. Note that the more similar the manifests, the more points will be concentrated in the diagonal line (same frequency in both manifests).
content_copy
Copy
content_copy
Copy
Note that Thacher and Otis are more similar than Thacher and Mayhew. This is reflected in the scatterplot, where the points are more concentrated in the diagonal line in the plot relating Thacher and Otis compared to the one relating Thacher and Mayhew. This is a simple way to compare the similarity of two texts. We know, for example, that, while Thacher talks a lot about “colonies”, Mayhew talks a lot about “god”.
Conclusions
- You can tokenize text data with the NLTK library method
word_tokenize
; - With list comprehensions, you can treat text to eliminate irrelevant characters and words;
- You can visualize the frequency of words in a text with matplotlib.
Want to share your content on python-bloggers? click here.