Want to share your content on python-bloggers? click here.
Data privacy is a major topic among analysts who want to protect people’s information. There are often ethical expectations that personal identifying information is protected. Whenever data is shared, you want to be sure that individual people cannot be identified within a dataset, which can lead to unforeseen consequences. This post will examine simple ways a data analyst can protect personal information.
Libraries & Data Preparation
There are few libraries and minimal data preparation for this example. The code and output are below.
from pydataset import data
df=data('SLID')
df.head()

The only library we need is “pydataset” which contains the dataset we will use. In the second line, we create an object called “df” which contains our data. The data we are using is called “SLID” and contains data on individuals relating to their wages, education level, age, sex, and language.
We will now move to the first way to protect privacy when working with data.
Drop Columns
Sometimes protecting people’s identity can be as easy as dropping a column. Often, the column(s) that contain the names, addresses, or phone numbers can be dropped. In our example below, we are going to pretend that the “language” column can be used to identify people. Therefore we will drop this column. Below is the code and the output for this.
# Attribute suppression on "language"
suppressed_language = df.drop('language', axis="columns")
# Explore obtained dataset
suppressed_language.head()
To remove the “language” column we use the drop() method. Inside this method, we indicate the name of the column and the axis as well.
Drop Rows
It is also possible to drop rows. Dropping rows may be appropriate for outliers. If only a handful of individuals have a certain value in a column it may be possible to identify them. In the code and output below, we drop all values where education is above or equal to 14.
# Drop rows with education higher than 14 education = df.drop(df[df.education >= 14].index) # See DataFrame education.head()
In the code, we used the drop() method again but subsetted the data to remove rows with education values greater than or equal to 14. We also include the index option to indicate the removal of rows. If you look you can see that several rows are now missing such as 1,3,4,6,8,9 as all of these rows had education scores above 14
Data Masking
Data masking involves removing all or part of the information within a column. In the example below, we remove the values for education and replace them with asterisks.
# Uniformly mask the education column df['education'] = '****' # See resulting DataFrame df.head()
The code involves subsetting the education variable and setting it equal to the asterisks. This approach is similar to dropping the column. However, there may be a reason to keep the column even if there is no useful information in it.
Replace Part of String
Data masking can also include replacing part of the data within a column. In the code below, we will remove some of the information within the “sex” column.
#Modify Sex Column
df['sex'] = df['sex'].apply(lambda text: text[0] + '****' + text[text.find('le'):] )
#See Results
df.head()
The code involves rewriting the data in the “sex” column.
- We do this by using the apply() method in this column. Inside the apply() method we use an anonymous function. Using an anonymous function includes using the word “lambda”.
 - After lambda, we set the argument to the word “text” for practical reasons since we are modifying text.
 - After the colon, we tell Python to start at the beginning of the string and keep it “text[0]”. Next, insert four asterisks **** after the first letter in the string.
 - Lastly, we subset from “text and find the string “le” in “text” using the find() method.
 
The apply() method allows us to loop through the column like a for loop and repeat this process for every row.
Conclusion
Protecting data is critical when using data. The ideas presented here are just some of the many ways that a data analyst can protect people’s personal information.
Want to share your content on python-bloggers? click here.