MongoDB and Python – Inserting and Retrieving Data – ETL Part 1

[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Don’t be intimidated by “NoSQL”

For those who are behind the times, the so-called “NoSQL” movement has really gained momentum over the last 5 years (but it has been around much longer than that). The term “NoSQL” is a bit silly, but it conveys the point well enough for those who have lived in the traditional relational SQL world. While there are a lot of databases that fit into this category, we’ll start by focusing on one of the most popular open source versions, MongoDB. We’ll utilize Python in order to send our data to the database.

Let’s say we have been tasked to find out if there is a relationship between favorite integers and favorite floating point numbers of people using a particular website. In order to collect that information, we setup a form where users must register a username, password and the state they live in. We also need to stipulate that the time the user submitted the form is irrelevant, but the time it is sent to the database needs to be tracked.

Here’s an example of a user’s submission:

Username: Scott
Hashed Password = 34hl2jlkfdjlk23jlk23
Favorite Integer = 1
Favorite Float = 3.14
State = Colorado

Let’s represent that as a Python dictionary, where the datetime module is used to capture the current time in UTC format:

import datetime

username = 'scott'
hashed_password = '34hl2jlkfdjlk23jlk23'
favorite_integer = 1
favorite_float = 3.14
state = 'Colorado'

user_data = {
    'created_at': datetime.datetime.utcnow(),
    'username': username,
    'hashed_password': hashed_password,
    'favorite_integer': favorite_integer,
    'favorite_float': favorite_float,
    'state': state
}

In order to see the data types in the dictionary, we can print them out:

print("Here are your the data types:")
for k, v in user_data.items():
    print(f" - ")

Screen Shot 2020-11-16 at 1.42.33 PM.png

You’ll notice that the created_at variable is a datetime object. This is important to keep in mind because we would not only like to store it this way but we’d also like to retrieve it as a datetime – which allows us perform operations like grouping and sorting.

Introducing MongoDB

To interact with MongoDB, a database server needs to be running. This can be done in any number of ways, for this demonstration we’ll use https://cloud.mongodb.com/ for simplicity. If you do not have an account, you can set one up for free. We’ll walk through the most basic setup for this demonstration.

After setting up your account, create a cluster. Step 1, click the “Create a New Cluster” button.

Screen Shot 2020-11-16 at 11.31.33 AM.png

Step 2, select the cluster tier (M0 Sandbox is a free tier).

Screen Shot 2020-11-16 at 11.35.49 AM.png

That’s it! You have created a cluster and after a couple of minutes it will be up and running. Once it shows up, click the “Connect” button. Note, I did not change the cluster name (Cluster0 by default).

Screen Shot 2020-11-16 at 11.40.15 AM.png

This brings up a menu where you can find your MongoDB cluster and credentials. The easiest way to find this information is through the “Connect your application” button.

Screen Shot 2020-11-16 at 11.41.07 AM.png

You will see the “Copy” button to the right of the information you’ll use for your connection. It is crucial to keep this data private in order to maintain security. This has been done by creating a .env file (common place to keep secrets).

Screen Shot 2020-11-16 at 11.42.06 AM.png

An example of the .env file can be seen below. This simply needs to be stored in the root directory (there are other ways of doing this, but this is a common practice).

Screen Shot 2020-11-16 at 11.57.31 AM.png

Using pymongo

The pymongo library provides an easy to use API to connect your Python runtime to a MongoDB server. All you need are your credentials and a couple of packages. In your terminal, use pip to install the required packages. (python-dotenv is simply to make loading the environment variables from the .env file easier).

pip install pymongo
pip install dnspython
pip install python-dotenv

At this point, connecting to your MongoDB server is simple. You’ll notice that we use the os library to read the environment variables after they have been loaded.

import os
import datetime

import dotenv
import pymongo

dotenv.load_dotenv()

mongo_database_name = 'example_db'
mongo_collection_name = 'example_collection'

db_client = pymongo.MongoClient(f"mongodb+srv://?retryWrites=true&w=majority")
db = db_client[mongo_database_name]
collection = db[mongo_collection_name]

A few things to notice here:

  1. db_client is simply your connection to the server/cluster

  2. db is the connection to your database within the server/cluster

  3. collection is a term in MongoDB that refers to a location where you can store data (similar to a table in SQL)

You may also have noticed that you connected to both a database and a collection that you had never created. These will be created for you when you insert data the first time.

Now that you have your connection, it is simple to insert data. Let’s insert the user_data dictionary previously created.

inserted_data = collection.insert_one(user_data)

if inserted_data.acknowledged:
    print('Data was stored!')
else:
    print('You had an issue writing to the database')

By storing your result as a variable (in this case, inserted_data) you are making it easier to see if it was stored in MongoDB or not.

Retrieving data from MongoDB is easy as well.

database_return = collection.find_one()

print(f"Here is your returned data:")
print(database_return)

print("Here are your returned data types:")
for k, v in database_return.items():
    print(f" - ")

Screen Shot 2020-11-16 at 2.27.01 PM.png

Screen Shot 2020-11-16 at 2.27.43 PM.png

You’ll notice that the data all comes back in the same format it was inserted in! This is great news! However, you’ll also notice that there is a new field in your data, _id. This is a field generated automatically in MongoDB and is associated with every insert. This field is important but keep in mind that it changes anytime the data is updated.

Finally, always remember to close your connection to a database.

db_client.close()

Next time

Coming up next time, we’ll go over some slightly more complicated database inserts, queries and “gotchas”. As always, the code for this can be found on our GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting.

Want to share your content on python-bloggers? click here.