Articles by Python - datawookie

Cookies & Headers from Selenium

November 7, 2023 | Python - datawookie

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests package and I don’t need to worry about all ...

[...Read more...]

Cookies & Headers from Selenium

October 31, 2023 | Python - datawookie

[...Read more...]

Mocking S3 from Python tests

August 4, 2023 | Python - datawookie

Code that moves data to and from S3 can slow down testing. A lot. This post demonstrates how you can speed things up by mocking S3.

[...Read more...]

Undetected ChromeDriver: Stay Below the Radar

October 26, 2022 | Python - datawookie

There’s one major problem with ChromeDriver: anti-bot services are able to detect that a browser session is being automated (as opposed to being used by a regular meat sack) and will often impose restrictions or deny connections altogether. The Undetected ChromeDriver (undetected-chromedriver) Python package is a patched version of ...

[...Read more...]

Enforcing Style in a Python Project

September 19, 2022 | Python - datawookie

A linter and a styler can help you to write cleaner and more consistent code. In this post we’ll look at how to set up both for a Python project.

[...Read more...]

Historical Weather Data

August 7, 2022 | Python - datawookie

I’m building a model which requires historical weather data from a selection of locations in South Africa. In this post I demonstrate the process of acquiring the data and doing some simple processing.

[...Read more...]

Persisting Data with Pickle & S3

July 28, 2022 | Python - datawookie

I occasionally write scripts where I need to persist some information between runs. These scripts are often wrapped in a Docker image and deployed on Amazon ECS. This means that there is no persistent storage. I could use a database, but this would be overkill for the volume of data ...

[...Read more...]

Firing Up Firestore

March 20, 2022 | Python - datawookie

I’ve just started collaborating on a new project, Votela, with Luke. We’re going to be using Firestore for stashing our data. I’ve never worked with Firestore before, so one of my first tasks was just figuring out how to get connected and how to shift some data ...

[...Read more...]

Scrapy with a Rotating Tor Proxy

June 9, 2021 | Python - datawookie

This post shows an approach to using a rotating Tor proxy with Scrapy. I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over ... [...Read more...]

Selenium Crawler #3: Docker Compose

April 19, 2021 | Python - datawookie

In two previous posts we’ve looked at how to set up a simple scraper which uses Selenium in Docker, communicating via the host network and bridge network. Both of those setups have involved launching separate containers for the scraper and Selenium. In this post we’ll see how to ...