We have provided examples of how you can Resample Data By Groups in Python and how you do Undersampling by Groups in R. In this post, we will provide you an efficient way of how you can create balanced datasets by being able to take into consideration more than one variable. Let’s start by creating our “unbalanced” dataset with the following characteristics:

• 1000 observations
• Category column of 3 levels such as “A”, “B” and “C” with 30%, 50% and 20% respectively.
• Sentiment column of 2 levels such as “0” and “1” with 35% and 65% respectively.
• Gender column of 2 levels such as “M” and “F” with 70% and 30% respectively.
```df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})

df
```

## Create a Balanced Dataset based on Sentiment

Let’s say that we want a new dataset where the positive Sentiment is as many as the negative. Let’s see how we can easily achieve that.

```df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
```

Let’s verify that the dataset is balanced.

```df_balanced.groupby(['Sentiment']).size()
```

## Create a Balanced Dataset based on Category and Sentiment

Let’s say that we want to create a balanced dataset by taking into consideration the Category and the Sentiment.

```df_grouped_by = df.groupby(['Category', 'Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Category', 'Sentiment'])
df_balanced
```

Let’s verify that the dataset is balanced.

```df_balanced.groupby(['Category', 'Sentiment']).size()
```

## Create a Balanced Dataset based on Sentiment within each Category

Let’s say that we want, within each category, the Sentiment classes to be balanced. This is how we can do it:

```df_balanced = pd.DataFrame()

for i in df.Category.unique():
df_grouped_by = df.loc[df.Category==i].groupby(['Sentiment'])
tmp = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced = pd.concat([df_balanced, tmp])

df_balanced = df_balanced.droplevel(['Sentiment'])

df_balanced
```

Let’s confirm that we got a balanced dataset of Sentiments within each Category

```df_balanced.groupby(['Category', 'Sentiment']).size()
```

## The Takeaway

In many Data Science pipelines, there is a need to apply undersampling techniques, in order to deal with the bias of the unbalanced classes and features. In this tutorial, we provided you an efficient way of how you can create balanced datasets with a few lines of code.