MongoDB and Python – Avoiding Pitfalls by Using an “ORM” – ETL Part 3

[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Avoiding MongoDB Pitfalls by Using an ORM

In my previous post, I showed how you can simplify your life by using MongoDB compared to a traditional relational SQL database. To put it simply, it is trivial to provide extra depth in MongoDB (or any document database) by nesting data structures. However, this level of simplicity has its drawbacks, and you need to be aware of these as a new user.

Let’s start by looking at a potential problem with the workflow from my previous post. We have a user in our app with the following information: username, hashed password, favorite integer, favorite float, and state. Typically, an app would only allow for unique usernames to avoid confusion. However, our previous application would simply continue to add users regardless of duplication. Rules would have to be added at the database level and it’s not trivial to do that.

However, if we use an ORM like mongoengine, we can solve this problem easily. If you are familiar with the sqlalchemy package in Python, you are probably already aware of the benefits of an ORM. If you are not familiar with this concept, there are plenty of great resources out there to look at ( https://www.google.com/search?q=benefits+of+an+orm&oq=benefits+of+an+ORM ). Those resources will be better than what I can provide, so feel free to use those. The short version is: an ORM will make your life easier because it handles your schema, provides validation, improves querying capabilities, etc.

Data in mongoengine can be represented with an object that defines the schema of the collection. Let’s consider the data from the last post in which a user has the following data points: username, hashed password, favorite floating point number, favorite integer, state, and favorite restaurants per state. The dictionary representation can be seen below.

Screen Shot 2020-11-17 at 2.57.10 PM.png

Previously, we simply took the user_data variable and inserted it using pymongo.

inserted_data = collection.insert_one(user_data)

It isn’t immediately apparent, but a major problem with this is that there are no restrictions. A username could be used multiple times, any data type could be passed in any field, etc. In order to handle these types of issues, you would have to write a lot of code. Instead, let’s take a look at how this can be represented in mongoengine.

import datetime

from mongoengine import Document, StringField, DateTimeField, IntField, FloatField, DictField, connect, disconnect_all


class User(Document):
    created_at = DateTimeField(default=datetime.datetime.utcnow)
    username = StringField(required=True, unique=True)
    hashed_password = StringField(required=True)
    favorite_integer = IntField()
    favorite_float = FloatField()
    state = StringField()
    favorite_restaurants = DictField()
    meta = {'collection': 'users'}

This is incredibly simple and the code almost speaks for itself (I love when that happens). The new User class created simply inherits the Document class that does the heavy lifting for you. Let’s breakdown the individual items within the class:

  • created_at – sets the field as a datetime type and defaults to utcnow when it is created

  • username – sets the field as string type, requires it to be entered and makes sure it’s unique

  • hashed_password = sets the field as a string type and requires it to be entered

  • meta = sets the collection name to ‘users’ (can be whatever you’d like)

This simple little bit of code provides tremendous value. We’ll walk through some examples.

Inserting data is trivial, simply create the object and use the save method:

user1 = User(
    username=user_data['username'],
    hashed_password=user_data['hashed_password'],
    favorite_integer=user_data['favorite_integer'],
    favorite_float=user_data['favorite_float'],
    state=user_data['state'],
    favorite_restaurants=user_data['favorite_restaurants']
)

user1_insert = user1.save()

That’s it! You have inserted the user into your database. Let’s try to create another user with the same username. It should fail.

Screen Shot 2020-11-17 at 3.18.15 PM.png

Perfect! This does not allow you to duplicate usernames. What happens if we try to insert a string instead of an integer to the favorite_integer field? It should fail.

user1 = User(
    username='blah',
    hashed_password=user_data['hashed_password'],
    favorite_integer='one',
    favorite_float=user_data['favorite_float'],
    state=user_data['state'],
    favorite_restaurants=user_data['favorite_restaurants']
)

user1_insert = user1.save()

Screen Shot 2020-11-17 at 3.21.39 PM.png

Perfect! It recognizes we can’t enter a string where an integer should be. We don’t need to run through all of the fields, but this type checking works for all of them.

How easy is it to query the data? You can easily create filters, groupings, etc. However, we’ll just pull the most recent data. All you do is iterate through User.objects

for user in User.objects:
    print("USER DATA")
    print("----------")
    print(user.username)
    print(user.hashed_password)
    print(user.favorite_integer)
    print(user.state)
    print(user.favorite_restaurants)

Output:

Screen Shot 2020-11-17 at 3.26.30 PM.png

That’s it! Trust me, you’ll want to use this over pymongo in many instances, but you will still need to know some MongoDB syntax when it comes time to create aggregation pipelines!

Next time

We’ll move beyond these basics to go through some slightly more complicated queries and aggregations. As always, the code for this can be found on our GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting.

Want to share your content on python-bloggers? click here.