Retail Data: Scraping & API
Want to share your content on python-bloggers? click here.
I’ve been wanting to gather data on retail prices for quite some time. Finally, just before Christmas 2019, I had some time on my hands, so I started to put something together.
The Plan
This was the plan:
- a fleet of scrapers, each focusing on a specific retailer;
- the scrapers submit data to a service which would validate the data and persist in a database; and
- expose the data via a simple REST API.
Implementation
This was the tech stack I envisaged:
I used Scrapy to manage the scraping, via either a Spider
, CrawlSpider
or SitemapSpider
, depending on the site.
The data are pushed into a Celery queue (with a Redis backend), from whence they are processed and entered into the PostgreSQL database.
Environments
I ended up with three environments: development, staging and production.
Development
In the development environment I’ve got both Redis and PostgreSQL running as Docker containers.
Staging & Production
The staging and production environments are on EC2 instances (t3a.medium
and t3.large
respectively). Both have Redis on Docker. A PostgreSQL database hosted on RDS is accessed from both of these environments. Each environment has a separate dedicated schema, with a shared schema for common (static) data.
The API
Once I had the scrapers up and running, reliably sending data through to the database, it was time to allow access via an API. For this I used Flask and Flask-RESTPlus. This combination had the added advantage of immediately generating a Swagger interface for the API.
This is what the Swagger interface looks like at present.
There are endpoints which expose:
- a list of retailers
- a list of products for each retailer and
- historical price information for each product.
An API key is required to access some of these endpoints. Get in touch if you’d like a key.
Here’s data for a specific product:
{ "id": 349324, "retailer_id": 34, "url": "https://www.capeunionmart.co.za/tyrell-short-mens", "name": "Tyrell Men's Shorts", "brand": "Old Khaki", "sku": "153048", "barcodes": [] }
And the associated price history, showing both normal and promotion prices as well as availability.
[ { "product_id": 349324, "time": "2020-03-08T01:22:32+00:00", "price": 599, "price_promotion": 399, "available": true }, { "product_id": 349324, "time": "2020-02-23T02:16:44+00:00", "price": 599, "price_promotion": 399, "available": true }, { "product_id": 349324, "time": "2020-02-16T02:25:35+00:00", "price": 599, "price_promotion": 399, "available": true }, { "product_id": 349324, "time": "2020-02-09T01:47:47+00:00", "price": 599, "price_promotion": null, "available": true }, { "product_id": 349324, "time": "2020-02-02T02:02:42+00:00", "price": 599, "price_promotion": null, "available": true }, { "product_id": 349324, "time": "2020-01-26T02:23:15+00:00", "price": 599, "price_promotion": null, "available": true } ]
The API is useful. But I do most of my analytical work in R, so it made sense to wrap the API in an R Package. More details on that [here](/blog/2020/03/retail-data-how-not-to-get-screwed-on-valentines-day/).
Want to share your content on python-bloggers? click here.