How to Normalize Data in Python

[This article was first published on PyShark, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this article we will explore how to normalize data in Python.

Table of Contents


Introduction

One of the first steps in feature engineering for many machine learning models is ensuring that the data is scaled properly.

Some models, such as linear regression, KNN, and SVM, for example, are heavily affected by features with different scales.

While others, such as decision trees, bagging, and boosting algorithms generally do not require any data scaling.

The level of effect of features’ scales on mentioned models is high, and features with larger ranges of values will play a bigger role in the decision making of the algorithm since impacts they produce have larger effect on the outputs.

In such cases, we turn to feature scaling to help us find common level for all these features to be evaluated equally when training the model.

Two most popular feature scaling techniques are:

  1. Z-Score Standardization
  2. Min-Max Normalization

In this article, we will discuss how to perform min-max normalization of data using Python.

To continue following this tutorial we will need the following two Python libraries: sklearn and pandas.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install sklearn
pip install pandas

What is normalization

In statistics and machine learning, min-max normalization of data is a process of converting original range of data to the range between 0 and 1.

The resulting normalized values represent the original data on 0-1 scale.

This will allow us to compare multiple features together and get more relevant information since now all the data will be on the same scale.

In min-max normalization, for every feature, its minimum value gets transformed into 0 and its maximum value gets transformed into 1. All values in-between get scaled to be within 0-1 range based on the original value relative to minimum and maximum values of the feature.

Suppose you have an array of numbers \(A = [v_1, v_2, …, v_i]\).

We will first find the minimum and maximum values of the array: \(min_A\) and \(max_A\).

Then, using the min and max values we will transform each original value \(v_i\) into a min-max normalized value \(v’_i\) using the follwoing formula:

$$v’_i = \frac{v_i – min_A}{max_A – min_A}$$


Normalization example

In this section we will take a look at a simple example of data normalization.

Consider the following dataset with prices of different apples:

Weight in gPrice in $
3003
2502
8005

And plotting this dataset should look like this:

not normalized data

Here we see a much larger variation of the weight compare to price, but it appears to looks like this because of different scales of the data.

The prices range is between $2 and $5, whereas the weight range is between 250g and 800g.

Let’s normalize this data!

Start with the weight feature:

Observation\(v_i\)\(v_i = \frac{v_i – min_W}{max_W – min_W}\)
1300\(\frac{300-250}{800-250} = 0.09\)
2250\(\frac{250-250}{800-250} = 0\)
3800\(\frac{800-250}{800-250} = 1\)
\(min_W\)250
\(max_W\)800

And do the same for the price feature:

Observation\(v_i\)\(v_i = \frac{v_i – min_W}{max_W – min_W}\)
13\(\frac{3-2}{5-2} = 0.33\)
22\(\frac{2-2}{5-2} = 0\)
35\(\frac{5-2}{5-2} = 1\)
\(min_W\)2
\(max_W\)5

And combine the two features into one dataset:

Weight (normalized)Price (normalized)
0.090.33
00
11

We can now see that the scale of the features in the dataset is very similar, and when visualizing the data, the spread between the points will be smaller:

Normalized data

The graph looks almost identical with the only difference being the scale of the each axis.

Now let’s see how we can recreate this example using Python!


How to normalize data in Python

Let’s start by creating a dataframe that we used in the example above:

import pandas as pd

data = {'weight':[300, 250, 800],
        'price':[3, 2, 5]}

df = pd.DataFrame(data)

print(df)

And you should get:

   weight  price
0     300      3
1     250      2
2     800      5

Once we have the data ready, we can use the MinMaxScaler() class and its methods (from sklearn library) to normalize the data:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(df)

print(normalized_data)

And you should get:

[[0.09090909 0.33333333]
 [0.         0.        ]
 [1.         1.        ]]

As you can see, the above code returned an array, so the last step would be to convert it to dataframe:

normalized_df = pd.DataFrame(normalized_data, columns=df.columns)

print(normalized_df)

And you should get:

     weight     price
0  0.090909  0.333333
1  0.000000  0.000000
2  1.000000  1.000000

which is identical to the result in the example which we calculated manually.


Conclusion

In this tutorial we discussed how to normalize data in Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.

The post How to Normalize Data in Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark.

Want to share your content on python-bloggers? click here.