How to Create Dummy Datasets in Python

[This article was first published on Python – Predictive Hacks, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Part 1: Dummy Datasets with Pandas for Testing Purposes

Mainly for testing purposes, sometimes we want to create some dummy data frames. Pandas give us this possibility with the util.testing package.


Dummy Data Frame

By default, it creates 30 rows with 4 columns called A,B,C and D and the index alpha-numeric.

import pandas as pd
pd.util.testing.makeDataFrame().head()
 
How to Create Dummy Datasets in Python 3

Dummy Data Frame with Missing Values

It assigns some NaN values randomly.

pd.util.testing.makeMissingDataframe().head()
  
How to Create Dummy Datasets in Python 4

Dummy Data Frame of Time-Series format

Here the index is as Time Series

pd.util.testing.makeTimeDataFrame().head()
 
How to Create Dummy Datasets in Python 5

Dummy Data Frame of Mixed Types

It creates a mixed dummy data containing categorical, date-time and continuous variables.

pd.util.testing.makeMixedDataFrame().head()
 
How to Create Dummy Datasets in Python 6

Dummy Data Frame with Periodical data

It creates dummy data frames with periodical data.

pd.util.testing.makePeriodFrame()
 
How to Create Dummy Datasets in Python 7

More rows and columns?

In case we want more rows and columns than the default which are 30 and 4 respectively, we can define the testing.N as the number of rows and testing.K as the number of columns.

pd.util.testing.N = 10
pd.util.testing.K = 5
pd.util.testing.makeDataFrame()
 

How to Create Dummy Datasets in Python 8

Part 2: Dummy Datasets with Scikit-Learn for Modelling Purposes

Usually, we want to generate sample datasets for exhibition purposes mainly to represent and test the Machine Learning Algorithms. The scikit-learn gives us the power to do that with one-line of code!

How to Create Dummy Datasets for Clustering Algorithms

We will work with the make_blobs function which generates isotropic Gaussians distributions for clustering. For example, let’s say that we want to create a sample of 100 observations, with 4 features and 2 clusters.:

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=4, random_state=0)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets in Python 9

How to Create Dummy Datasets for Classification Algorithms

When we want to generate a Dataset for Classification purposes we can work with the make_classification from scikit-learn. The interesting thing is that it gives us the possibility to define which of the variables will be informative and which will be redundant. So let’s say that we want to build a random classification problem of 100 samples with 2 classes and 10 features totally, where 5 of them are informative and the rest 5 redundant,

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, n_classes=2, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
  

How to Create Dummy Datasets in Python 10

How to Create Dummy Datasets for Regression Algorithms

Similarly, for Regression purposes, we can work with the make_regression. Let’s repeat the above example, but now the target will be a continuous variable.

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
 
How to Create Dummy Datasets in Python 11

Conclusion

When you would like to start experimenting with algorithms, it is not always necessary to search on the internet for proper datasets, since you can generate your own “structured – random” datasets.

To leave a comment for the author, please follow the link and comment on their blog: Python – Predictive Hacks.

Want to share your content on python-bloggers? click here.