Generating Fake Data for Privacy with Python
Want to share your content on python-bloggers? click here.
The privacy of individuals in a dataset can be protected through the development of fake data. Using false numbers makes it much more difficult to identify individual people within a dataset. In this post, we will look at how to generate fake numbers and names using Python.
Libraries & Data Preparation
The initial library needed is only “pydataset” which will allow us to load the data. We will use the data() function to load the “SLID” dataset into an object called “df”. Next, we will look at the data using the .head() method. Below is the code and the output.
from pydataset import data df=data('SLID') df.head()

We have five columns of data that address wages, education level, age, sex, and language. However, for this example, we need to take several additional steps.
We are going to create four new columns that will be manipulated in the example below. These columns will be “name”, “credit_card”, “credit_code”, and “credit_company”. Each of these columns will have a default value that we will manipulate. Below is the code and output.
df['name']="Dan" df['credit_card']=1234567890 df['credit_code']=123 df['credit_company']='comp' df.head()

All of this new data will serve as data that needs protection. The original data isn’t needed it just serves as a dataset that we are grafting the privacy data onto. Making a dataframe from scratch is a little complicated in Python and beyond the scope of this video so we took a shortcut by adding to preexisting data. We will now see how to generate fake numbers and names.
Fake Numbers
The “faker” library has a function called “Faker” that can generate fake data for almost any circumstance. We will demonstrate this by generating phony credit card numbers. Below is the code and output.
# Import Faker class from faker import Faker # Create fake data generator fake_data = Faker() # Generate a credit card number fake_data.credit_card_number() '6561857744400343'
To generate the false credit card number we loaded the faker library and imported the Faker() function. Then we created an instance of the Faker() function called “fake_data”. Lastly, we used the .credit_card_number() method on the “fake_data” object.
We will now generate fake numbers for “credit_card”, “credits_code”, and “credit_company”.
# Mask card number with new generated data using a lambda function Faker.seed(0) df['credit_code'] = df['credit_code'].apply(lambda x: fake_data.credit_card_security_code()) df['credit_company'] = df['credit_company'].apply(lambda x: fake_data.credit_card_provider()) df['credit_card'] = df['credit_card'].apply(lambda x: fake_data.credit_card_number()) # See the resulting pseudonymized data df.head()

If you compare this output to the original you can see that the values have changed. We set the seed using Faker.seed(0) so we always get the same results. The next three lines of code use an anonymous function which allows us to loop through our dataset. First, we subset the name of the column we want to overwrite. Second, we use the .apply() method on the same column. Inside the .apply() method we lambda followed by the argument x. After the x we indicate what we want done to the column using the appropriate method from the faker library. Lastly, we display the results using the .head() method. We will address the names of people.
Fake Names
There are at least three different methods for generating fake names, there is a method that generates male or female names, a method that generates only male names, and a method that only generates female names. Below is a brief example of each.
Faker.seed(0) print(fake_data.name()) print(fake_data.name_male()) print(fake_data.name_female()) Norma Fisher Jorge Sullivan Elizabeth Woods
The code above is self-explanatory. We used the print function in order to print several lines of code with different outputs. We will use the .name() method in the code below to generate fake names for our “name” column.
Faker.seed(0) df['name'] = df['name'].apply(lambda x: fake_data.name()) df.head()

The steps for changing the names are the same as what we did with the credit card information. As such, we will not reexplain it here.
Conclusion
The ability to generate fake data as shown in this post allows an incredible amount of flexibility in protecting people’s identity. However, nothing must be lost that is used for developing insights. For example, generating random credit card numbers could be catastrophic if this information provides insights in a given context. Therefore, any tool that is going to be used must be used with wisdom and caution.
Want to share your content on python-bloggers? click here.