Greetings, humanists, social and data scientists!
Have you ever tried to load a large file in Python or R? Sometimes, when we have file sizes in the order of gigabytes, you may experience problems of performance with your program taking an unusually long time to load the data. SQL, or Structured Query Language, is used to deal with larger data files stored in relational databases and is widely used in the industry and even in research. Apart from being more efficient to prepare data, in your journey, you might encounter data sources whose main form of access is through SQL.
In this lesson you will learn how to use SQL in Python to retrieve data from a relational data base of the National Gallery of Art (US). You will also learn how to use a relational database management system (RDBMS) and pd.read_sql_query to extract data from it in Python.
1. Data source
The database used in this lesson is made available by National Gallery of Art (US) under a Creative Commons Zero license. The dataset contains data about more than 130,000 artworks and their artists since the Middle Ages until the present day.
It is a wonderful resource to study history and art. Variables available include the title of the artwork, dimensions, author, description, location, country where it was produced, the year the artist started the work and the year he or she finished it. These variables are only some examples, but there is much more to explore.
2. Download and install PostgreSQL and pgAdmin
After you install PostgreSQL, you will need to connect to the Postgre database server. In this tutorial, we will be using the pgAdmin application to establish this connection. It is a visual and intuitive interface and makes many operations easier to execute. The guide above will also guide you through the process of connecting to your local database. In the next steps, after being connected to your local database server, we will learn how to create a database that will store the National Gallery Dataset.
3. Creating the database and its tables
After you are connected to the server, click “Databases” with the right mouse button and choose “Create” and “Database…” as shown in the image below.
Next, give a title to your database as shown in the figure below. In our case, it will be called “art_db”. Click “Save” and it is all set!
With the database ‘art_bd’ selected, click the ‘Query Tool’ as shown below.
This will open a field where you can type SQL code. Our objective is to create the first table of our database, which will contain the content of ‘objects.csv’ available in the GitHub account of the National Gallery of Art, provided in the Data section above.
To create a table, we must specify the name and the variable type for each variable in the table. The SQL command to create a table is quite intuitive: CREATE TABLE name_of_your_table. Copy the code below and paste it in the window opened by the ‘Query Tool’. The code specify each variable of the objects table. This table contains information on each artwork available in the collection.
The last step is to load the data from the csv file into this table. This can be done through the ‘COPY’ command as shown below.
Great! Now you should have your first table loaded to your database. The complete database includes more than 15 tables. However, we will only use two of them for this example, as shown in the scheme below. Note that the two tables relate to each other through the key variable objectid.
To load the “objects_terms” table, please repeat the same procedure with the code below.
4. Exploring the data with SQL commands
Click the ‘Query Tool’ to start exploring the data. First, select which variables you would like to include in your analysis. Second, you tell SQL in which table this variables are. The code below selects the variables title and attribution from the objects table. It also limits the result to 5 observations.
Now, we would like to know what are the different kinds of classification in this dataset. To achieve that, we have to select the classification variable, but including only distinct values.
The result tells us that there are 11 classifications: “Decorative Art”, “Drawing”, “Index of American Design”, “Painting”, “Photograph”, “Portfolio”, “Print”, “Sculpture”, “Technical Material”, “Time-Based Media Art” and “Volume”.
Finally, let us group the artworks by classification and count the number of objects in each category.
COUNT(*) will count the total of items in the groups defined by
GROUP BY. When you select a variable you can give it a new name with
AS. Finally, the command
ORDER BY orders the classification by number of items in a descending order (DESC).
Note that prints is the largest classification, followed by photographs.
5. Using pd.read_sql_query to access data
Now that you have your SQL database working, it is time to access it with Python. Before using Pandas, we have to connect Python to our SQL database. We will do that with
psycopg2, a very popular PostgreSQL adapter for Python. Please, install it with
pip install psycopg2.
We use the
connect method of
psycopg2 to establish the connection. It takes 4 main arguments:
- host: in our case, the database is hosted locally, so we will pass localhost to this parameter. Note, however, that we could specify an IP if the server was external;
- database: the name given to your SQL database, art_db;
- user: user name required to authenticate;
- password: your database password.
The next step is to store our SQL query in a string Python variable. The query below performs a LEFT JOIN with the two tables in our database. The operation uses the variable
objectid to join the two tables. In practice we are selecting the titles, authors (attribution), classification – we keep only “Painting” with a WHERE command -, and term – we filter only terms that specify the “Style” of the painting.
Finally, we can extract the data. Use the
cursor() method of
conn to be able to “type” your SQL query. Pass the command variable and connection object to
pd.read_sql_query and it will return a Pandas dataframe with the data we selected. Next, commit and close cursor and connections.
6. Visualizing the most popular styles
From the data we gathered from our database, we would like to check which are the 10 most popular art styles in our data, by number of paintings. We can use the
value_counts() method of the column
term to count how many paintings are classified in each style.
The result is a Pandas Series where the index contains the styles and the values contain the quantities of paintings of the respective style. The remaining code produces an horizontal bar plot showing the top 10 styles by number of paintings. If you would like to learn more about data visualization with
matplotlib, please consult the lesson Storytelling with Matplotlib – Visualizing historical data.
Note that Realist, Baroque and Renaissance are the most popular art styles in our dataset.
Please feel free to share your thoughts and questions below!
- It is possible to create a SQL database from csv files and access it with Python;
- psycopg2 enables connection between Python and your SQL database;
- pd.read_sql_query can be used to extract data into a Pandas dataframe.