Scrapy with a Rotating Tor Proxy

[This article was first published on Python | datawookie, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This post shows an approach to using a rotating Tor proxy with Scrapy.

I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over time too, so I’m using the Tor network.

Setup

I’ve got the following in the settings.py for my Scrapy project:

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

ROTATING_PROXY_LIST_PATH = 'proxy-list.txt'
ROTATING_PROXY_PAGE_RETRY_TIMES = 5

This (1) specifies where the package middleware fits into the pipeline for processing requests and (2) points to a file, proxy-list.txt, which contains a list of proxies. There are other settings for the package, but they are not important right now.

Proxy List

The contents of proxy-list.txt looks like this:

# Generated by create-proxies script.

http://127.0.0.1:9990
http://127.0.0.1:9991
http://127.0.0.1:9992
http://127.0.0.1:9993

So I’m running four local proxies. How? Well, with Docker, of course!

The scrapy-rotating-proxies package ensures that

  • requests are sent out via these proxies and
  • the proxies are used in rotation, so that consecutive requests use distinct proxies.

The reason for rotating through a list of proxies is to ensure that at any given time there are multiple proxies (each with a different IP address) available for sending requests.

Tor Proxies

In order to access a truly diverse set of IP addresses I’m tapping into the Tor network via the pickapp/tor-proxy Docker image.

Using Docker Compose it’s easy to spin up a cluster of Tor proxies. This is my docker-compose.yml:

# Generated by create-proxies script.

version: '3'

services:
  tor-bart:
    container_name: 'tor-bart'
    image: 'pickapp/tor-proxy:latest'
    ports:
      - '9990:8888'
    environment:
      - IP_CHANGE_SECONDS=60
    restart: always
  tor-homer:
    container_name: 'tor-homer'
    image: 'pickapp/tor-proxy:latest'
    ports:
      - '9991:8888'
    environment:
      - IP_CHANGE_SECONDS=60
    restart: always
  tor-marge:
    container_name: 'tor-marge'
    image: 'pickapp/tor-proxy:latest'
    ports:
      - '9992:8888'
    environment:
      - IP_CHANGE_SECONDS=60
    restart: always
  tor-lisa:
    container_name: 'tor-lisa'
    image: 'pickapp/tor-proxy:latest'
    ports:
      - '9993:8888'
    environment:
      - IP_CHANGE_SECONDS=60
    restart: always

There are four services defined, each of which maps port 8888 on the container to a specific host port (a sequence of ports starting at 9990 and corresponding to the ports listed in proxy-list.txt).

CONTAINER ID  IMAGE                   PORTS                   NAMES
98feb5a034e6  datawookie/tor-privoxy  0.0.0.0:9990->8888/tcp  tor-bart
26f05b1deb17  datawookie/tor-privoxy  0.0.0.0:9991->8888/tcp  tor-homer
b856ded83585  datawookie/tor-privoxy  0.0.0.0:9992->8888/tcp  tor-marge
c352aea63eed  datawookie/tor-privoxy  0.0.0.0:9993->8888/tcp  tor-lisa

Setting the IP_CHANGE_SECONDS environment variable to 60 causes the Tor exit node used by a proxy to change every minute.

Generating Configuration

To make this setup more flexible I have a script, create-proxies, which generates the contents of proxy-list.txt and docker-compose.yml.

#!/usr/bin/env python3

NAMES = ['bart', 'homer', 'marge', 'lisa']

WARNING = "# Generated by create-proxies script.\n\n"

# Generate docker-compose.yml.
#
with open("docker-compose.yml", "w") as f:
    f.write(WARNING)
    f.write("version: '3'\n\nservices:\n")

    for index, name in enumerate(NAMES):
        f.write(f"  tor-{name}:\n")
        f.write(f"    container_name: 'tor-{name}'\n")
        f.write("    image: 'pickapp/tor-proxy:latest'\n")
        f.write("    ports:\n")
        f.write(f"      - '{9990+index}:8888'\n")
        f.write("    environment:\n")
        f.write("      - IP_CHANGE_SECONDS=60\n")
        f.write("    restart: always\n")

# Generate proxy-list.txt.
#
with open("proxy-list.txt", "w") as f:
    f.write(WARNING)
    
    for index, name in enumerate(NAMES):
        f.write(f'http://127.0.0.1:{9990+index}\n')

If I want to add or remove proxies then I simply edit the NAMES list, run the script again, restart Docker Compose and voila!

Results

This is what an extract from the crawler logs looks like:

Proxies(good: 0, dead: 0, unchecked: 4, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9993> is GOOD
Proxy <http://127.0.0.1:9992> is GOOD
Proxies(good: 2, dead: 0, unchecked: 2, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9991> is GOOD
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9990> is GOOD
Proxies(good: 4, dead: 0, unchecked: 0, reanimated: 0, mean backoff: 0s)

The addresses for the proxies are fixed (sampled from the list in proxy-list.txt). However, the each Tor proxy refreshes its exit node every minute. Here are the logs from a slightly updated version of the Tor proxy Docker image:

🔁 HUP → Tor.
📌 exit IP: 109.70.100.50.
🔁 HUP → Tor.
📌 exit IP: 31.7.61.190.
🔁 HUP → Tor.
📌 exit IP: 178.20.55.18.
🔁 HUP → Tor.
📌 exit IP: 185.220.102.242.
🔁 HUP → Tor.
📌 exit IP: 109.70.100.51.

This is happening for each of the proxies, so requests effectively are being sent from a constantly changing set of IP addresses. Good way to stay below the radar!

To leave a comment for the author, please follow the link and comment on their blog: Python | datawookie.

Want to share your content on python-bloggers? click here.