Persisting Data with Pickle & S3

Posted on July 28, 2022 by Python - datawookie in Data science | 0 Comments

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I occasionally write scripts where I need to persist some information between runs. These scripts are often wrapped in a Docker image and deployed on Amazon ECS. This means that there is no persistent storage. I could use a database, but this would be overkill for the volume of data involved. This post describes a simple approach to storing these data on S3 using a pickle file.

Setup

Import the boto3 and botocore packages (the latter package is only required for the ClientError exception).

import boto3, botocore

Create an S3 client object.

s3 = boto3.client("s3")

How does authentication work? I store my credentials in ~/.aws/credentials with multiple AWS accounts, each identified by an unique profile name. I set the AWS_PROFILE environment variable to choose a specific account. I also specify a suitable value for the AWS_DEFAULT_REGION environment variable.

export AWS_PROFILE=fathom
export AWS_DEFAULT_REGION=eu-west-1

Now store the S3 bucket name and a name for the pickle file.

BUCKET = "state-persist"
PICKLE = "state.pkl"

Retrieve

First try to load the data. On the first iteration this won’t work because there’s nothing persisted yet. But after you’ve been through the process once, these steps will load the data from the previous iteration.

Attempt to download the pickle file from S3. If it’s not there, handle the error gracefully.

try:
    s3.download_file(BUCKET, PICKLE, PICKLE)
except botocore.exceptions.ClientError:
    # You'll arrive here on the first iteration.
    pass

Read the pickle file. On failure, set data to None (or some other appropriate default value).

try:
    with open(PICKLE, "rb") as file:
        data = pickle.load(file)
except (FileNotFoundError, EOFError):
    # You'll arrive here on the first iteration.
    data = None

Since both of the first two steps will normally fail together, it might make sense to place the second step in an else clause of the first exception handler.

Store

As the script runs the state information is assigned to (or updated in) data. At the end we need to persist those data.

Create or update the pickle file.

pickle.dump(data, open(PICKLE, "wb"))

Write that file to S3.

s3.upload_file(PICKLE, BUCKET, PICKLE)

Conclusion

A simple procedure for persisting information between jobs.

This approach is vulnerable to race conditions if there are multiple instances of the script running simultaneously. You could handle this with a lock file (also stored on S3) or by just being careful to avoid simultaneous execution.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Persisting Data with Pickle & S3

Setup

Retrieve

Store

Conclusion

Related