Redact Name Entities with SpaCy

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

When we work on NLP projects, we need to do text mining and data cleansing. A common task is to detect the Name Entities and sometimes it makes sense to replace the original text with the corresponding entities. This is like a Feature Engineering. Imagine that you try to model a list of documents using TF-IDF on n-grams. It is better to replace the original text with its corresponding entities. For example all the dates will be replaced with DATE, the prices etc will be replaced with MONEY and so on.

We will work with the SpaCy library. Let’s provide a practical example:

import pandas as pd
import spacy
pd.set_option("max_colwidth", 300)

nlp = spacy.load("en_core_web_sm")

# My sample data

df = pd.DataFrame({'Documents' : ["Apple is looking at buying U.K. startup for $1 billion",
                                  "San Francisco considers banning sidewalk delivery robots",
                                  "Amazon is hiring a new vice president of global policy",
                                  "George Pipis works for Predictive Hacks",
                                  "Today is Wednesday, 18:00",
                                  "Dear George, can you please respond to my email?"
                                 ]})
Redact Name Entities with SpaCy 1

Now, we will create a function called replace_ner which will replace the detected entities of the original text with their corresponding entities. The trick is to use the reversed function so that to start replacing from the last detecting entity otherwise the original text will be affected.

def replace_ner(mytxt):
    clean_text = mytxt
    doc = nlp(mytxt)
    for ent in reversed(doc.ents):
        clean_text = clean_text[:ent.start_char] +ent.label_ + clean_text[ent.end_char:]
    return clean_text

df['Redacted'] = df['Documents'].apply(lambda x:replace_ner(x) )

df
Redact Name Entities with SpaCy 2

As we can see, SpaCy was able to detect some entities but it failed to detect the “NAME” in “Dear George” and that “ORG” in Predictive Hacks. . However, it detected correctly the $1 billion, Apple, Amazon, San Francisco, Today, Wednesday, 18:00, and George Pipis

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.