# NumPy Hacks for Data Cleansing

**Python – Predictive Hacks**, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)

Want to share your content on python-bloggers? click here.

The goal of this article is to provide some “Numpy Hacks” that will be quite useful during the Data Science Pipeline and especially during the Data Cleansing phase. As always, we will work with reproducible and practical examples. We will work with **Pandas **and **NumPy **libraries for our examples.

## random

**random.seed()**

NumPy gives us the possibility to generate random numbers. However, when we work with reproducible examples, we want the “random numbers” to be identical whenever we run the code. For that reason, we can set a random seed with the `random.seed()`

function which is similar to the random `random_state`

of scikit-learn package.

**random.choice() | random.poisson() | random.rand() **

With NumPy we can generate random numbers from distributions like poisson, normal, exponential etc, from the uniform distribution with the `random.rand()`

and from a sample with the `random.choice()`

. Let’s generate a Pandas data frame using the `random`

module.

**Example**: We will create a pandas data frame of 20 rows and columns such as **gender**, **age **and **score_a**., **score_b**, **score_c **

import pandas as pd import numpy as np # set a random seed np.random.seed(5) # gender 60% male 40% female # age from poisson distribution with lambda=25 # score a random integer from 0 to 100 df = pd.DataFrame({'gender':np.random.choice(a=['m','f'], size=20, p=[0.6,0.4]), 'age':np.random.poisson(lam=25, size=20), 'score_a':np.random.randint(100, size=20), 'score_b':np.random.randint(100, size=20), 'score_c':np.random.randint(100, size=20)}) df

**random.shuffle()**

With the `random.shuffle()`

we can shuffle randomly the numpy arrays.

# set a random seed np.random.seed(5) arr = df.values np.random.shuffle(arr) arr

## logical_and() | logical_or()

I have found the `logical_and()`

and `logical_or()`

to be very convenient when we dealing with multiple conditions. Let’s provide some simple examples.

x = np.arange(5) np.logical_and(x>1, x<4)

And we get:

array([False, False, True, True, False])

np.logical_or(x < 1, x > 3)

And we get:

array([ True, False, False, False, True])

## where()

The `where()`

function is very helpful when we want to apply an `if else`

statement by assigning new values. Let’s say that we want to assign a value equal to “Pass” when the score is higher than 50 and “Fail” when the score is lower than 50. Let’s do it for the `score_a`

column.

df['score_a_pass'] = np.where(df.score_a>=50,"Pass","Fail") df.head()

## select()

If we want to add more conditions, even across multiple columns then we should work with the `select()`

function. Let’s that that I want to define the following column called `demo`

as follows:

- if the gender is ‘m’ and the age is below 20 then ‘boy’
- if the gender is ‘m’ and the age is above 20 then ‘mister’
- if the gender is ‘f’ and the age is below 20 then ‘girl’
- if the gender is ‘f’ and the age is above 20 then ‘lady’
- else ‘null’

Let’s see how easily we can do it by using the `select()`

choices = ['Mister','Lady','Boy', 'Girl'] conditions = [ (df['gender'] == 'm') & (df['age']>20), (df['gender'] == 'f') & (df['age']>20), (df['gender'] == 'm') & (df['age']<=20), (df['gender'] == 'f') & (df['age']<=20) ] df['demo'] = np.select(conditions, choices, default=np.nan) df.head(10)

Note that we could have used the `logical_and()`

in the conditions.

## digitize()

Many times, we want to bucketize our data into bins. We have explained how to create bins with Pandas. Let’s see how we can do it with NumPy. Let’s say that I can to create 5 bins from the `score_a`

.

bins = np.array([0, 20, 40, 60, 80, 100]) df['Bins'] = np.digitize(df.score_a, bins) df.head(10)

## split()

You can also split the NumPy arrays into parts. Let’s say that you want to create a train (60%), validation (20%) and test (20%) datasets.

data_a, data_b, data_c = np.split(df.values, [int(0.6 * len(df.values)), int(0.8*len(df.values))]) data_a

data_b

data_c

## clip()

Sometimes we can set a range for the values and if they are outside this interval to get the minim and the maximum value respectively. Let’s assume that we want the data to take values from 0 to 100 and in our dataset, we have values below zero and values above zero. Let’s see the example below:

x = np.array([30, 20, 50, 70, 50, 100, 10, 130, -20, -10, 200]) np.clip(x,0,100)

As we can see below, the negative values became 0 and the values above 100 became 100:

array([ 30, 20, 50, 70, 50, 100, 10, 100, 0, 0, 100])

## extract()

Let’s say that we want to extract values that satisfy some conditions. Assume that on the previous example, we wanted to get the values which are less than 0 or greater than 100:

np.extract( (x>100) | (x<0), x )

And we get:

array([130, -20, -10, 200])

## unique()

The unique() function returns the unique values but we can use it to get a “**value counts**” of each element. For example:

# How to count the unique values of an array x = np.array([0,0,0,1,1,1,0,0,2,2]) unique, counts = np.unique(x, return_counts=True) dict(zip(unique, counts))

And we get:

{0: 5, 1: 3, 2: 2}

## argmax() | argmin() | argsort() | argpartition()

These functions are values useful. The `argmax()`

and `argmin()`

return the index of the max and min element respectively. Let’s say that we want to know the index of the data frame of the maximum `score_a`

np.argmax(np.array(df.score_a))

and we get `17`

. Now we can get the whole 17-th row of the data frame

df.iloc[np.argmax(np.array(df.score_a))]

The `argsort()`

sorts the NumPy array and returns the indexes. Let’s say that I want to sort the `score_a`

column:

df.iloc[np.argsort(np.array(df.score_a))]

If we want to get the N largest value index, then we can use the `argpartition()`

. Let’s say that we want to get the top 5 elements of the following array:

x = np.array([30, 20, 50, 70, 50, 100, 10, 130, -20, -10, 200]) indexes = np.argpartition(x, -5)[-5:] indexes

array([ 2, 3, 5, 7, 10], dtype=int64)

x[indexes]

array([ 50, 70, 100, 130, 200])

Let’s provide a final example by assuming the following scenario. Let’s say that we want to create three columns such as Top1, Top2 and Top3 for each row based on the scores of columns score_a, score_b and score_c. In other words, at which exam there was the highest score, then the second higher and finally the third higher. We can work with the `argsort()`

:

Tops =pd.DataFrame(df[['score_a','score_b','score_c']].\ apply(lambda x:list(df[['score_a','score_b','score_c']].\ columns[np.array(x).argsort()[::-1][:3]]), axis=1).\ to_list(), columns=['Top1', 'Top2', 'Top3']) Tops

## Sum-Up

NumPy is a very popular and strong library. It is very fast and compatible with all AI and ML libraries like Scikit-Learn, TensorFlow etc. Thus, is it very important for every Data Scientist to be competent with NumPy. If you like this article, then you may like the tips about NumPy arrays.

**leave a comment**for the author, please follow the link and comment on their blog:

**Python – Predictive Hacks**.

Want to share your content on python-bloggers? click here.