How to Make Synthetic Datasets with Python: A Complete Guide for Machine Learning
Want to share your content on python-bloggers? click here.
A good dataset is difficult to find. Besides, sometimes you just want to make a point. Tedious loadings and preparations can be a bit much for these cases.
Today you’ll learn how to make synthetic datasets with Python and Scikit-Learn – a fantastic machine learning library. You’ll also learn how to play around with noise, class balance, and class separation.
The article is structured as follows:
You can download the Notebook for this article here.
Make your first synthetic dataset
Real-world datasets are often too much for demonstrating concepts and ideas. Imagine you want to visually explain SMOTE (a technique for handling class imbalance). You first have to find a class-imbalanced dataset and project it to 2-3 dimensions for visualizations to work.
There’s a better way.
The Scikit-Learn library comes with a handy make_classification()
function. It’s not the only one for creating synthetical datasets, but you’ll use it heavily today. It accepts various parameters that let you control the looks and feels of the dataset, but more on that in a bit.
To start, you’ll need to import the required libraries. Refer to the following snippet:
You’re ready to create your first dataset. It’ll have 1000 samples assigned to two classes (0 and 1) with a perfect balance (50:50). All samples belonging to each class are centered around a single cluster. The dataset has only two features – to make the visualization easier:
A call to sample()
prints out five random data points:
This doesn’t give you the full picture behind the dataset. It’s two dimensional, so you can declare a function for data visualization. Here’s one you can use:
Here’s how it looks like visually:
That was fast! You now have a simple synthetic dataset you can play around with. Next, you’ll learn how to add a bit of noise.
Add noise
You can use the flip_y
parameter of the make_classification()
function to add noise.
This parameter represents the fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder. Note that the default setting flip_y > 0 might lead to less than n_classes in y in some cases[1].
Here’s how to use it with our dataset:
Here’s the corresponding visualization:
You can see many more orange points in the blue cluster and vice versa, at least when compared with Image 2.
That’s how you can add noise. Let’s shift the focus on class balance next.
Tweak class balance
It’s common to see at least a bit of class imbalance in the real-world datasets. Some datasets suffer from severe class imbalance. For example, one of 1000 bank transactions could be fraudulent. This means the balance ratio is 1:1000.
You can use the weights
parameter to control class balance. It excepts a list as a value with N – 1 values, where N is the number of features. We only have 2, so there’ll be a single value in the list.
Let’s see what happens if we specify 0.95 as a value:
Here’s how the dataset looks like visually:
As you can see, only 5% of the dataset belongs to class 1. You can turn this around easily. Let’s say you want 5% of the dataset in class 0:
Here’s the corresponding visualization:
And that’s all there is to class balance. Let’s finish by tweaking class separation.
Tweak class separation
By default, there are some overlapping data points (class 0 and class 1). You can use the class_sep
parameter to control how separated the classes are. The default value is 1.
Let’s see what happens if you set the value to 5:
Here’s how the dataset looks like:
As you can see, the classes are much more separated now. Higher parameter values result in better class separation, and vice versa.
You now know everything to make basic synthetic datasets for classification. Let’s wrap things up next.
Conclusion
Today you’ve learned how to make basic synthetic classification datasets with Python and Scikit-Learn. You can use them whenever you want to prove a point or implement some data science concept. Real datasets can be overkill for that purpose, as they often require rigorous preparation.
Feel free to explore official documentation to learn about other useful parameters.
Thanks for reading.
Join my private email list for more helpful insights.
Learn more
- Top 3 Classification Machine Learning Metrics – Ditch Accuracy Once and For All
- ROC and AUC – How to Evaluate Machine Learning Models in No Time
- Precision-Recall Curves: How to Easily Evaluate Machine Learning Models in No Time
References
The post How to Make Synthetic Datasets with Python: A Complete Guide for Machine Learning appeared first on Better Data Science.
Want to share your content on python-bloggers? click here.