Part 1: Dummy Datasets with Pandas for Testing Purposes
Mainly for testing purposes, sometimes we want to create some dummy data frames.
Pandas give us this possibility with the
Dummy Data Frame
By default, it creates 30 rows with 4 columns called A,B,C and D and the index alpha-numeric.
import pandas as pd pd.util.testing.makeDataFrame().head()
Dummy Data Frame with Missing Values
It assigns some NaN values randomly.
Dummy Data Frame of Time-Series format
Here the index is as Time Series
Dummy Data Frame of Mixed Types
It creates a mixed dummy data containing categorical, date-time and continuous variables.
Dummy Data Frame with Periodical data
It creates dummy data frames with periodical data.
More rows and columns?
In case we want more rows and columns than the default which are 30 and 4 respectively, we can define the
testing.N as the number of rows and
testing.K as the number of columns.
pd.util.testing.N = 10 pd.util.testing.K = 5 pd.util.testing.makeDataFrame()
Part 2: Dummy Datasets with Scikit-Learn for Modelling Purposes
Usually, we want to generate sample datasets for exhibition purposes mainly to represent and test the Machine Learning Algorithms. The
scikit-learn gives us the power to do that with one-line of code!
How to Create Dummy Datasets for Clustering Algorithms
We will work with the make_blobs function which generates isotropic Gaussians distributions for clustering. For example, let’s say that we want to create a sample of 100 observations, with 4 features and 2 clusters.:
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=2, n_features=4, random_state=0) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets for Classification Algorithms
When we want to generate a Dataset for Classification purposes we can work with the make_classification from
scikit-learn. The interesting thing is that it gives us the possibility to define which of the variables will be informative and which will be redundant. So let’s say that we want to build a random classification problem of 100 samples with 2 classes and 10 features totally, where 5 of them are informative and the rest 5 redundant,
from sklearn.datasets import make_classification X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, n_classes=2, random_state=1) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets for Regression Algorithms
Similarly, for Regression purposes, we can work with the make_regression. Let’s repeat the above example, but now the target will be a continuous variable.
from sklearn.datasets import make_regression X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
When you would like to start experimenting with algorithms, it is not always necessary to search on the internet for proper datasets, since you can generate your own “structured – random” datasets.