Where To Find And How To Load Historical Data

This article was first published on coding-the-past , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.


‘Information is not knowledge’

Albert Einstein

With so much data available nowadays, I frequently feel overwhelmed when I have to find data to study a subject. Is this dataset reliable? How was the data treated? Where can I find the codebook with detailed information on the variables? These are only some of my concerns. When it comes to historical data, the task can be even harder. In this lesson, you will learn about 5 fascinating and reliable websites to find historical data and how to load datasets in Python and R.

Data Sources

1. Harvard Business School

The Harvard Business School developed the project ‘Historical Data Visualization’ to foster the understanding of global capitalism throughout time. The page offers more than 40 datasets about a broad range of topics. For instance, you can find data on life expectancy, literacy rates or economic activity in several countries during the 19th and 20th century. Datasets are mostly in Excel format. Definitely worth a visit!

2. Human Mortality Database

Human Mortality Database (HMD) provides death rates and life expectancy for several countries over the last two centuries. Even though the platform requires a quick registration to give you access to the data, it is very complete and straightforward to understand. Datasets are in tab-delimited text (ASCII) files.

3. National Centers for Environmental Information

Would you like to study how climate has changed over the last centuries? Then this is an invaluable source for you! The National Centers for Environmental Information is the leading authority for environmental data in the USA and provides high quality data about climate, ecosystems and water resources. Data files can be downloaded in comma separated values format.

4. Clarin Historical Corpora

If you wish to work with text data, this is a valuable source of material. It offers access to ancient and medieval greek texts, the manifests wrote during the American Revolution, court proceedings in England in the 18th century and many other instigating materials. Files are usually provided in .txt format. The requirements to access files varies according to each case, since data comes from different institutions.

5. Slave Voyages

This impressive platform, supported by the Hutchins Center of Harvard University, gathers data regarding the forced relocations of more than 12 million African people between the 16th and 19th century. Files are provided in SPSS or comma separated values format.

Coding the past: how to load data in Python

1. Pandas read_csv()

In this section, you will learn to load data into Python. You will be using data provided by the Slave Voyages website. The dataset contains data regarding 36,108 transatlantic slave trade voyages. Learn more about the variables here.

To load our data in Python, we will use Pandas, a Python library that provides data structures and analysis tools. The Pandas method read_csv() is the ideal option to load comma separated values into a dataframe. A dataframe is one of the data structures provided by Pandas and it consists of a table with columns (variables) and rows (observations). Bellow, we use the default configuration of read_csv() to load our data. Note that the only parameter passed to the method is the file path where you saved the dataset.


import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/historical_data/tastdb-exp-2019.csv")

2. Getting pandas dataframe info

A dataframe object is now created. It has several attributes or characteristics. For example, we can check its dimensions with shape and its column names with columns. Note that column names are the names of our variables. Moreover, you can also call methods, which, in general, carry out an operation to analyze the data contained in the dataframe. For example, the method describe() calculates summary statistics of each variable and head() filters and displays only the first n observations of your data. Check all Pandas DataFrame attributes and methods here.

Pandas DataFrame Object

Use the following code to check the dimensions and variable names of the dataset:


print("Dimensions: ", df.shape, 
      "Variable names: ", df.columns)

The attributes show that there are 276 variables and 36,108 observations in this dataset. Let us suppose you are only interested in the number of slaves disembarked (slamimp) in America per year (yearam). You could load only these two variables using the read_csv() parameter usecols. This parameter receives a list with variable names you wish to load. In larger datasets this parameter is very handy because you do not want to load variables not relevant to your study.


df = pd.read_csv("/content/drive/MyDrive/historical_data/tastdb-exp-2019.csv",
                 usecols=['YEARAM', 'SLAMIMP'])



Now the dataframe is loaded only with the two specified variables. As said, Pandas dataframes offer tools to analyze the data, using DataFrame methods. Above, we use the method head() to display the five first observation in our dataframe. You can set how many observations head() should return through the n parameter (default is 5).

Moreover, we can use describe() to obtain summary statistics of our variables. From the summary statistics we can see that the earliest record is from the year 1514 and the latest one of 1886. Also, the maximum number of slaves traded in one voyage was 1,700.


Coding the past: how to import a dataset in R

In R there are several functions that load comma separated files. I chose fread from the data.table library, because it offers a straightforward parameter to select the variables you wish to load (select). fread creates a data frame, similar to a pandas dataframe.



df <- fread("tastdb-exp-2019.csv", 
            select = c("YEARAM","SLAMIMP"))

To get summary statistics about your variables you can use the function summary(df). To view the n first observations of your dataframe, use head(df,n) as shown bellow. Summary and head produce very similar results to describe and head in Python.




More posts on how to find reliable data will be published soon!


To leave a comment for the author, please follow the link and comment on their blog: coding-the-past .

Want to share your content on python-bloggers? click here.