Feature encoding methods – the Pandas way

[This article was first published on Python – Hutsons-hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This tutorial explores the various ways data can be encoded, using Pandas and Numpy, to prepare the data ready for a Machine Learning, or predictive model pipeline.

Encoding methods

There are three main methods explored therein:

  1. Label encoding – encoding a value based on where the label order falls – could be good for rank and non-parametric methods, but tends to be less used with machine learning models
  2. One hot encoding (dummy variable encoding) – taking a group of categorical labels and assigning them an encoding / dummy binary label of a 1 (belongs in that category) or 0 (doesn’t belong in that category)
Building a One Hot Encoding Layer with TensorFlow | by George Novack |  Towards Data Science

3. Manual encoding – this uses a specific condition to assign the numerical 1 or 0 encoding.

The tutorial

The tutorial is in a YouTube video I created and can help you grasp the concepts. I have created this in Jupyter, and have attached a Python (.py) file to support it. Watch the tutorial below:

Where to get the content?

The supporting code files can be found in my GitHub account. This includes a Jupyter notebook and a Python file. The next tutorial will look at how to do this type of encoding in scikit-learn and other Python libraries, so look out for that.

Signing off

I hope you enjoy my tutorials. Please stay posted for the next video and Subscribe to the YouTube channel.

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks.

Want to share your content on python-bloggers? click here.