Articles by Python - datawookie

Cookies & Headers from Selenium

November 7, 2023 | Python - datawookie

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests package and I don’t need to worry about all ...
[...Read more...]

Cookies & Headers from Selenium

October 31, 2023 | Python - datawookie

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests package and I don’t need to worry about all ...
[...Read more...]

Undetected ChromeDriver: Stay Below the Radar

October 26, 2022 | Python - datawookie

There’s one major problem with ChromeDriver: anti-bot services are able to detect that a browser session is being automated (as opposed to being used by a regular meat sack) and will often impose restrictions or deny connections altogether. The Undetected ChromeDriver (undetected-chromedriver) Python package is a patched version of ...
[...Read more...]

Persisting Data with Pickle & S3

July 28, 2022 | Python - datawookie

I occasionally write scripts where I need to persist some information between runs. These scripts are often wrapped in a Docker image and deployed on Amazon ECS. This means that there is no persistent storage. I could use a database, but this would be overkill for the volume of data ...
[...Read more...]

Firing Up Firestore

March 20, 2022 | Python - datawookie

I’ve just started collaborating on a new project, Votela, with Luke. We’re going to be using Firestore for stashing our data. I’ve never worked with Firestore before, so one of my first tasks was just figuring out how to get connected and how to shift some data ...
[...Read more...]

Scrapy with a Rotating Tor Proxy

June 9, 2021 | Python - datawookie

This post shows an approach to using a rotating Tor proxy with Scrapy. I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over ... [...Read more...]

Selenium Crawler #3: Docker Compose

April 19, 2021 | Python - datawookie

In two previous posts we’ve looked at how to set up a simple scraper which uses Selenium in Docker, communicating via the host network and bridge network. Both of those setups have involved launching separate containers for the scraper and Selenium. In this post we’ll see how to ...
[...Read more...]
1 2 3