Loading large datasets into a database with Django
Want to share your content on python-bloggers? click here.
Introduction
Nowadays (it’s 2023 if someone has just woken up from hibernation), many applications work with large amounts of data, often exceeding 1 million records in the database. Such operations are more problematic than those on smaller datasets, mainly due to limited RAM and CPU usage.
In this tutorial I will present how to efficiently import large datasets into a database using django migrations, but a similar approach can be adjusted to alembic or flask-migrate.
The problem to solve
An Ill-considered process of importing such a large dataset might cause loading all objects into memory, which in turn can cause the migration process to crash.
Suppose we have a simple Person model as below:
class Person(models.Model): name = models.CharField(max_length=255) email = models.EmailField()
and the goal is to load data from a CSV file.
The simple and common solution (which works for small and medium datasets) is to read data from a file and save them one by one to the database.
import csv from django.db import migrations def read_csv_file(file_path: str): with open(file_path, "r") as people_file: return list(csv.reader(people_file)) def import_people_records(apps, schema_editor): Person = apps.get_model("people", "Person") for name, email in read_csv_file("people.csv"): Person.objects.get_or_create( name=name, email=email, ) def delete_people_records(apps, schema_editor): apps.get_model("people", "Person").objects.all().delete() class Migration(migrations.Migration): dependencies = [ ("people", "0001_initial"), ] operations = [ migrations.RunPython( code=import_people_records, reverse_code=delete_people_records, ), ]
This approach is simple, but it has two serious disadvantages:
- all data is loaded into memory, which can lead to a lack of memory, which in turn cause the process to crash or slow down significantly,
- as many database queries are executed as there are records, which is suboptimal.
Solution 1 for large datasets
As described above, there are two main problems to solve.
Running out of memory in case of Python can be solved by using generators.
Generators in Python are a special type of function that yield a sequence of results instead of a single value, allowing memory-efficient iteration over large datasets. They produce values one at a time on demand, instead of storing all results in memory at once.
The number of queries to the database can be reduced using bulk operations (i.e. on a group of data), leading to faster execution. In other words, a few records are inserted into the database table within one query. In the case of Django, bulk_create can be used for that purpose.
import csv from itertools import islice from django.db import migrations BATCH_SIZE = 1000 def read_csv_file(file_path: str, batch_size: int): with open(file_path, "r") as f: reader = csv.reader(f) while True: batch = list(islice(reader, batch_size)) if not batch: return yield batch def import_people_records(apps, schema_editor): Person = apps.get_model("people", "Person") for batch in read_csv_file("people.csv", BATCH_SIZE): Person.objects.bulk_create( [ Person( name=name, email=email, ) for name, email in batch ], batch_size=BATCH_SIZE, )
(Entire example can be found here.)
The above solution will do its job properly, but it’s not without flaws. More precisely, such a migration is run in a single database transaction, which is usually fine, but in severe cases can be dangerous, e.g. it can lead to import failure, performance issues or locking and concurrency issues or even exhaustion of database system resources.
Solution 2 for even larger datasets
To avoid issues related to too many records within a single database transaction described above, you can follow another approach.
The solution would be to split all those queries into a few database transactions.
To do that, atomic transactions within migration must be disabled (atomic transactions are associated with database operations where a set of actions must ALL complete).
Moreover, autocommit must be disabled (Django has it enabled by default) to fully manually manage when to update records in the database.
def import_people_records(apps, schema_editor): Person = apps.get_model("people", "Person") transaction.set_autocommit(False) for batch in read_csv_file("people.csv", BATCH_SIZE): Person.objects.bulk_create( [ Person( name=name, email=email, ) for name, email in batch ], batch_size=BATCH_SIZE, ) transaction.commit() class Migration(migrations.Migration): atomic = False
(Entire example can be found here.)
The above solution works well for very large datasets, however, it has one drawback. Due to the fact that groups of operations are performed in separate transactions, in case of problems during such a migration (e.g. loss of connection to the database), only part of the records may be saved to the database.
Depending on the imported data, this might be safeguarded in Django through the ignore_conflicts parameter that can be set for the bulk_create
method. This causes that when a conflict occurs on the primary key, the record is skipped, otherwise it is saved, and thus the migration can be rerun.
With other migration tools, please consult the documentation. Looking for a way to execute INSERT IGNORE
and INSERT ... ON DUPLICATE KEY UPDATE
would resemble the Django’s ignore_conflicts behavior.
Conclusion
Large datasets are now becoming more and more common, so it’s important for developers to learn how to work with them effectively.
It’s important to review the migration to be run and adjust it to the size of the dataset.
The above tips will allow you to avoid issues related to memory shortage and reduce the risk of failure.
Looking for people improving the performance metrics of your application?
Want to share your content on python-bloggers? click here.