Generating Fake Data for Privacy with Python

This article was first published on python – educational research techniques , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.
ad

The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.

Libraries & Data Preparation

The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.

from pydataset import data
df=data('SLID')
df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.

We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.

df['name']="Dan"
df['credit_card']=1234567890
df['credit_code']=123
df['credit_company']='comp'
df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.

Fake Numbers

The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.

# Import Faker class
from faker import Faker
# Create fake data generator
fake_data = Faker()
# Generate a credit card number
fake_data.credit_card_number()

'6561857744400343'

To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.

We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.

# Mask card number with new generated data using a lambda function
Faker.seed(0)
df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code())
df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider())
df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number())

# See the resulting pseudonymized data
df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.

Fake Names

There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.

Faker.seed(0)
print(fake_data.name())
print(fake_data.name_male())
print(fake_data.name_female())

Norma Fisher
Jorge Sullivan
Elizabeth Woods

The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.

Faker.seed(0)
df['name'] = df['name'].apply(lambda x: fake_data.name())
df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.

Conclusion

The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.

To leave a comment for the author, please follow the link and comment on their blog: python – educational research techniques .

Want to share your content on python-bloggers? click here.