Privacy of Continous Data with Python

This article was first published on python – educational research techniques , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

There are several ways that an individual’s privacy can be protected when dealing with continuous data. In this post, we will look at how protecting privacy can be accomplished using Python.

Libraries

We will begin by loading the necessary libraries. Below is the code.

from pydataset import data
import pandas as pd
ad

The library setup is simple. We are importing the data() function from pydataset. This will allow us to load the data we will use in this post. Below we will address the data preparation. We are also importing pandas to make a frequency table later on.

Data Preparation

The data preparation is also simple. We will load the dataset called “SLID” using the data() function into an object called df. We will then view the df object using the .head() method. Below is the code followed by the output.

df=data('SLID')
df.head()

The data set has five variables. The focus of this post will be on the manipulation of the “age” variable. We will now make a histogram of the data before we manipulate it.

View of Original Histogram

Below is the code output of the histogram of the “age” variable. The reason for making this visual is to provide a “before” picture of the data before changes are made.

df['age'].hist(bins=15)

We will now move to our first transformation which will involve changing the data to a categorical variable.

Change to Categorical

Changing continuous data to categorical is one way of protecting privacy as it removes individual values and replaces them with group values. Below is an example of how to do this with the code and the first few rows of the modified data.

df['age'] = df['age'].apply(lambda x:">=40"if x>=40 else"<40" )
df.head()

We are overwriting the “age” variable in the code using an anonymous function. On the “age” variable we use the .apply() method and replace values above 40 with “>=40” and values below 40 with “<40”. The data is now broken down into two groups, those above 40 and those below 40. Below is a frequency table of the transformed “age” variable.

df['age'].value_counts()
age
>=40    3984
<40     3441
Name: count, dtype: int64

The .value_counts() method comes from the pandas library. There are two groups now. The table above is a major transformation from the original histogram. Below is the code and output of a bar graph of this transformation

import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x="age", data=df)
plt.show()

This was a simple example. You do not have to limit yourself to only two groups to divide your data. How many groups depends on the context and the purpose of the use of this technique.

Top Coding

Top coding is a trick used to bring extremely high values down to a specific value. Again, the purpose of modifying these values in our context is to protect people’s privacy. Below is the code and output for this approach.

df=data('SLID')
df.loc[df['age'] > 75, 'age'] = 75
df['age'].hist(bins=15)

The code does the following.

  1. We load the “SLID” dataset again so that we can modify it again from its original state.
  2. We then use the .loc method to change all values in “age” above 75 to 75.
  3. Lastly, we create our histogram for comparison purposes to the original data

If you look to the far right you can see that spike in the number of data points at age 75 compared to our original histogram. This is a result of our manipulation of the data. Through doing this, we can keep all of our data for other forms of analysis while also protecting the privacy of the handful of people who are over the age of 75.

Bottom Coding

Bottom coding is the same as top coding except now you raise values below a threshold to a minimum value. Below is the code and output for this.

df=data('SLID')
df.loc[df['age'] < 20, 'age'] = 20
df['age'].hist(bins=15)

The code is the same as before with the only difference being the less than “<” symbol and the threshold being set to 20. As you compare this histogram to the original you can see a huge spike in the number of values at 20.

Conclusion

Data protection is an important aspect of the analysis role. The examples provided here are just some of the many ways in which the privacy of individuals can be respected with the help of Python

To leave a comment for the author, please follow the link and comment on their blog: python – educational research techniques .

Want to share your content on python-bloggers? click here.