Unleashing the power of NLP

This article was first published on Python-Post on Cosima Meyer , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

💡 What is NLP?

NLP is short for Natural Language Processing and it helps make sense of a difficult data type: written text.

📑 Basic concepts


(also as 📄PDF for you to download here)

  • Corpus: When you have your text data ready, you have your corpus. It’s a collection of documents.
  • Tokens: Define each word in a text (but it could also be a sentence, paragraph, or character).
  • Tokenization: When you hear the word tokenization, it means that you are splitting up the sentences into single words (tokens) and turning them into a bag of words. You can take this quite literally – a bag of words does not really take the order of the words into account. There are ways to account for the order using n-grams (so for instance a bigram would leave the sentence “Rory lives in a world of books” as “Rory lives”, “lives in”, “in a”, “a world”, “world of”, “of books”) but it’s limited.
  • Document-feature matrix (DFM): To generate the DFM you first split the text into its single terms (tokens), then count how frequently each token occurs in each document.
  • Stemming: With stemming, you are getting the stem of the word.
  • Lemmatization: With lemmatization, it’s slightly different. Instead of “stud” (which is the stem of the study terms), you end up with a meaningful stem – “study” 🥳

The terms and concepts sheet also describes a typical workflow with the bags-of-word approach nicely. You typically read the data, tokenize it (and turn it into a bag full of words), pre-process it by stemming it (and removing stop words) and counting the single words, and turn the count into a DFM (document-feature matrix) – and now you’re ready to go!

From here on, you can do multiple tasks – for instance, you can perform supervised tasks with dictionary approaches and classify the sentiment or topics. But you can also use it to perform unsupervised tasks like structural topic models.

If you’re up for more on how to use 📦{quanteda} in R on these tasks, here is more:

What are alternatives to Bag-of-words?

One possible downside when using the bag of words approach described above is that you often cannot fully take the structure of the language into account (n-grams are one way, but they are limited). You also often need many data to successfully train your model. An alternative is to use a pre-trained model. And here comes Google’s famous BERT model. BERT is the acronym for bidirectional encoder representation from transformers. To understand how a BERT model works, I like to look at how it understands your text and how you train one. Simply speaking, there are three essential components:

  • With BERT, you identify the order of the input. You give the model information about different embedding layers (the tokens (BERT uses special tokens ([CLS] and [SEP]) to make sense of the sentence), the positional embedding (where each token is placed in the sentence), and the segment embedding (which gives you more info about the sentences to which the tokens belong).

And then there is the training:

  • The first half of the training involves masking world (Mask ML). During the training period, you mask one word at a time and the model learns, which word usually follows.
  • During the second half, you train the model to predict the next sentence. This way, the model learns which sentences usually follow each other.


(also as 📄PDF for you to download here)

If you want more, there are so many great videos online explaining how a BERT model works but one major advantage is that it comes with a pre-trained language model. You can now use your labeled data to fine-tune this pre-trained model. One way of thinking about it is to think of a student: with the regular bag of words approach, you need to teach the student the language first. With BERT, you have a student who already knows the language but you’re teaching the student a specific topic like biology.
There is a fantastic framework in Python to work with BERT models – 🤗 Huggingface.

It has several pre-trained models available on the website including fantastic tutorials that are very easy to follow. If you’re up for using it in R – the great community got you covered. There is a tutorial for everything – and here’s one from RStudio explaining how to use the framework in R with 📦 {reticulate}.

And another cool thing: Once you understand how BERT works, you can also apply the logic to a variety of text, audio, or video data tasks 🥳

Here’s the visual of two possible workflows – they’re not so much different as you will see.

One approach uses a bag of words (for instance as described in more detail here) and the other approach uses BERT (with the 🤗 Huggingface framework – all helpful tutorials are linked below; also as 📄PDF for you to download here):


To leave a comment for the author, please follow the link and comment on their blog: Python-Post on Cosima Meyer .

Want to share your content on python-bloggers? click here.