Document Letter Frequency in Python

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Letter Frequency

We will provide you a walk-through example of how you can easily get the letter frequency in documents by considering the whole document or the unique words. Finally, we will compare our observed relative frequencies with the letter frequency of the English language.

letter frequency
https://en.wikipedia.org/wiki/Letter_frequency

From the above horizontal barplot, we can easily see that the letter e is the most common in both English Texts and Dictionaries. Notice also that the distribution is changed between Texts and Dictionaries.

Part A: Get the Letter Frequency in Documents

We will work with the Moby Dick book and we will provide the frequency and the relative frequency of the letters. Finally, we will apply a chi-square test to test if the distribution of the letters in Moby Dick is the same with what we see in English texts.

import pandas as pd
import numpy as np
import re
from collections import Counter

with open('moby.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    

# convert to a list where each character
# is an element
letter_list = list(file_name_data)

# get the frequency of each letter
my_counter = Counter(letter_list)

# convert the Counter into Pandas data frame
df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index()
df = df.rename(columns={'index':'letter', 0:'frequency'})

# keep only the 26 english letters
df = df.loc[df['letter'].isin(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])]

df['doc_rel_freq']=df['frequency']/df['frequency'].sum()
df = df.sort_values('letter')


# load the english letter frequency according to wikipedia
english = pd.read_csv("english_freq.csv")


df = pd.merge(df,english, on="letter")

# get the expected frequency
df['expected'] = np.round(df['rel_freq']*df['frequency'].sum(),0)

df

letter frequency
import matplotlib.pyplot as plt
%matplotlib inline  

df.plot(x="letter", y=["doc_rel_freq", "rel_freq"], kind="barh", figsize=(12,8))
 
letter frequency

Compare the Observed Frequencies with the Expected

We will apply the Chi-Square test to compare the observed with the expected letter frequencies.

from scipy.stats import chi2_contingency
# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(df[['frequency', 'expected']])
p
 

We get that the p-value (p) is 0 which implies that the letter frequency in Moby Dick does not follow the same distribution with what we see in English tests, although the Pearson correlation is too high (~99.6%).

df[['frequency', 'expected']].corr()
 
Document Letter Frequency in Python 1

Part B: Get the Letter Frequency in Unique Words

We will apply the same logic as above, but in this case, we will consider only the unique words and we will compare them with the letter frequency of the English Dictionary according to Wikipedia.

# get the words
words = re.findall('\w+',file_name_data)

# get the unique words
V = list(set(words))

# concatenate all words into one text
# and then get the list of each character
letter_list  = list(" ".join(V))

# get the frequency of each letter
my_counter = Counter(letter_list)


# get the frequency of each letter
my_counter = Counter(letter_list)

df = pd.DataFrame.from_dict(my_counter, orient='index').reset_index()
df = df.rename(columns={'index':'letter', 0:'frequency'})

# keep only the 26 english letters
df = df.loc[df['letter'].isin(['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'])]

df['doc_rel_freq']=df['frequency']/df['frequency'].sum()
df = df.sort_values('letter')


# load the english letter frequency according to wikipedia
english = pd.read_csv("english_dict_freq.csv")


df = pd.merge(df,english, on="letter")

# get the expected frequency
df['expected'] = np.round(df['rel_freq']*df['frequency'].sum(),0)

df
 
 
letter frequency
import matplotlib.pyplot as plt
%matplotlib inline  

df.plot(x="letter", y=["doc_rel_freq", "rel_freq"], kind="barh", figsize=(12,8))
 
letter frequency

Compare the Observed Frequencies with the Expected

As before we will apply the Chi-Square test.

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(df[['frequency', 'expected']])
p
 
1.7915973729245735e-84

Again, we may infer that there is a statistically significant difference in the distribution of the letters between the unique words of our document and the English dictionary. Again the Pearson correlation is high (~99.6%)

df[['frequency', 'expected']].corr()
 
Document Letter Frequency in Python 2

Discussion

The use of letter frequencies and frequency analysis plays a fundamental role in cryptograms and several word puzzle games, including Hangman, Scrabble and the television game show Wheel of Fortune. Letter frequencies also have a strong effect on the design of some keyboard layouts.

Today, we showed how easily you can get the letter frequency using Python, and how you can apply statistical tests to compare the distribution of the letters between two documents or between a group of documents (like English Texts).

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.