Python for Data Privacy

Dr. Darrin

2 months ago

This article was first published on python – educational research techniques , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.

Libraries & Data Preparation

There are few libraries and minimal data preparation for this example. The code and output are below.

from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.

We will now move to the first way to protect privacy when working with data.

Drop Columns

Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.

# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")

# Explore obtained dataset
suppressed_language.head()

To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.

Drop Rows

It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.

# Drop rows with education higher than 14
education = df.drop(df[df.education >= 14].index)

# See  DataFrame
education.head()

In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14

Data Masking

Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.

# Uniformly mask the education column 
df['education'] = '****'

# See resulting DataFrame
df.head()

The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.

Replace Part of String

Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.

#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )

#See Results
df.head()

The code involves rewriting the data in the “sex” column.

We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
Lastly, we subset from “text and find the string “le” in “text” using the find() method.

The apply() method allows us to loop through the column like a for loop and repeat this process for every row.

Conclusion

Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.

To leave a comment for the author, please follow the link and comment on their blog: python – educational research techniques .

Want to share your content on python-bloggers? click here.