MongoDB and Python – Simplifying Your Schema – ETL Part 2

This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

MongoDB vs Relational SQL Benefits

In my previous post, we created a user object and saved it into a MongoDB collection. This was relatively straightforward but you may see some immediate challenges. While directly inserting data is an acceptable practice, it is similar to writing SQL in the backend of your application. SQL is just fine, but it may not always be the easiest to work with and can potentially lead to security problems (i.e. SQL injection attacks). It may also require you to write a lot of code to check the data schema, query efficiently, and/or write rules within your database.

Let’s take a look at an example schema of our previous data if it were normalized in a traditional SQL database. We would have two tables: users and states. The major change that happened here is that a states table was formed and linked to users with a the state_id key. All of the states would be listed in the states table and integers would represent those states in the users table. While this is a simple example, it illustrates the fact that complexity will increase quickly. At this point, to utilize the state data we need to perform a join.

*Note that the state field could simply be a string within the users table, but it is not a best practice (and does not illustrate the point I’m trying to make).

Screen Shot 2020-11-17 at 12.39.32 PM.png

Compare that schema to how this could simply be modeled in MongoDB.

Screen Shot 2020-11-17 at 12.53.22 PM.png

Let’s take our schema a little bit farther.

Imagine if we had to expand our application to do something more fun. Imagine if users could name their favorite restaurant from five states. That would require big expansions of the schema and then any changes to that schema require enormous amounts of effort to change. Here’s how that might look.

Screen Shot 2020-11-17 at 12.59.43 PM.png

Compare that to how this might look in MongoDB.

Screen Shot 2020-11-17 at 1.03.50 PM.png

Here’s an example of an object we would like to store in MongoDB. It follows the exact outline above, with favorite_restaurants being a nested object that uses the state as a key and restaurant name as the value.

Screen Shot 2020-11-17 at 1.47.39 PM.png

Let’s look at this in terms of Python code.

import os
import datetime

import dotenv
import pymongo

dotenv.load_dotenv()

mongo_database_name = 'example_db'
mongo_collection_name = 'example_collection'

db_client = pymongo.MongoClient(f"mongodb+srv://?retryWrites=true&w=majority")
db = db_client[mongo_database_name]
collection = db[mongo_collection_name]

username = 'scott'
hashed_password = '34hl2jlkfdjlk23jlk23'
favorite_integer = 1
favorite_float = 3.14
state = 'Colorado'
favorite_restaurants = {
    'Colorado': "Nick's Italian",
    'North Carolina': "Midnight Diner",
    'California': "Trujillo's Taco Shop",
    'Texas': "Killen's Barbecue",
    'New York': "Artichoke Basille's Pizza"
}

user_data = {
    'created_at': datetime.datetime.utcnow(),
    'username': username,
    'hashed_password': hashed_password,
    'favorite_integer': favorite_integer,
    'favorite_float': favorite_float,
    'state': state,
    'favorite_restaurants': favorite_restaurants
}

print("Here are your the data types:")
for k, v in user_data.items():
    print(f" - ")

inserted_data = collection.insert_one(user_data)

if inserted_data.acknowledged:
    print('Data was stored!')
else:
    print('You had an issue writing to the database')

database_return = collection.find_one()

print(f"Here is your returned data:")
print(database_return)

print("Here are your returned data types:")
for k, v in database_return.items():
    print(f" - ")

db_client.close()

It’s that simple. We simply inserted a dictionary with the states and restaurants and the user’s data is easily accessible within one single document. It is also worth noting that the data types and structures are also maintained when we query the data.

Next time

Coming up next time, we’ll go over some slightly more complicated database inserts, queries and “gotchas”. As always, the code for this can be found on our GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting .

Want to share your content on python-bloggers? click here.