Scrapy with a Rotating Tor Proxy
Want to share your content on python-bloggers? click here.
This post shows an approach to using a rotating Tor proxy with Scrapy.
I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over time too, so I’m using the Tor network.
Setup
I’ve got the following in the settings.py
for my Scrapy project:
DOWNLOADER_MIDDLEWARES = { 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'rotating_proxies.middlewares.BanDetectionMiddleware': 620, } ROTATING_PROXY_LIST_PATH = 'proxy-list.txt' ROTATING_PROXY_PAGE_RETRY_TIMES = 5
This (1) specifies where the package middleware fits into the pipeline for processing requests and (2) points to a file, proxy-list.txt
, which contains a list of proxies. There are other settings for the package, but they are not important right now.
Proxy List
The contents of proxy-list.txt
looks like this:
# Generated by create-proxies script. http://127.0.0.1:9990 http://127.0.0.1:9991 http://127.0.0.1:9992 http://127.0.0.1:9993
So I’m running four local proxies. How? Well, with Docker, of course!
The scrapy-rotating-proxies package ensures that
- requests are sent out via these proxies and
- the proxies are used in rotation, so that consecutive requests use distinct proxies.
The reason for rotating through a list of proxies is to ensure that at any given time there are multiple proxies (each with a different IP address) available for sending requests.
Tor Proxies
In order to access a truly diverse set of IP addresses I’m tapping into the Tor network via the pickapp/tor-proxy Docker image.
Using Docker Compose it’s easy to spin up a cluster of Tor proxies. This is my docker-compose.yml
:
# Generated by create-proxies script. version: '3' services: tor-bart: container_name: 'tor-bart' image: 'pickapp/tor-proxy:latest' ports: - '9990:8888' environment: - IP_CHANGE_SECONDS=60 restart: always tor-homer: container_name: 'tor-homer' image: 'pickapp/tor-proxy:latest' ports: - '9991:8888' environment: - IP_CHANGE_SECONDS=60 restart: always tor-marge: container_name: 'tor-marge' image: 'pickapp/tor-proxy:latest' ports: - '9992:8888' environment: - IP_CHANGE_SECONDS=60 restart: always tor-lisa: container_name: 'tor-lisa' image: 'pickapp/tor-proxy:latest' ports: - '9993:8888' environment: - IP_CHANGE_SECONDS=60 restart: always
There are four services defined, each of which maps port 8888 on the container to a specific host port (a sequence of ports starting at 9990 and corresponding to the ports listed in proxy-list.txt
).
CONTAINER ID IMAGE PORTS NAMES 98feb5a034e6 datawookie/tor-privoxy 0.0.0.0:9990->8888/tcp tor-bart 26f05b1deb17 datawookie/tor-privoxy 0.0.0.0:9991->8888/tcp tor-homer b856ded83585 datawookie/tor-privoxy 0.0.0.0:9992->8888/tcp tor-marge c352aea63eed datawookie/tor-privoxy 0.0.0.0:9993->8888/tcp tor-lisa
Setting the IP_CHANGE_SECONDS
environment variable to 60 causes the Tor exit node used by a proxy to change every minute.
Generating Configuration
To make this setup more flexible I have a script, create-proxies
, which generates the contents of proxy-list.txt
and docker-compose.yml
.
#!/usr/bin/env python3 NAMES = ['bart', 'homer', 'marge', 'lisa'] WARNING = "# Generated by create-proxies script.\n\n" # Generate docker-compose.yml. # with open("docker-compose.yml", "w") as f: f.write(WARNING) f.write("version: '3'\n\nservices:\n") for index, name in enumerate(NAMES): f.write(f" tor-{name}:\n") f.write(f" container_name: 'tor-{name}'\n") f.write(" image: 'pickapp/tor-proxy:latest'\n") f.write(" ports:\n") f.write(f" - '{9990+index}:8888'\n") f.write(" environment:\n") f.write(" - IP_CHANGE_SECONDS=60\n") f.write(" restart: always\n") # Generate proxy-list.txt. # with open("proxy-list.txt", "w") as f: f.write(WARNING) for index, name in enumerate(NAMES): f.write(f'http://127.0.0.1:{9990+index}\n')
If I want to add or remove proxies then I simply edit the NAMES
list, run the script again, restart Docker Compose and voila!
Results
This is what an extract from the crawler logs looks like:
Proxies(good: 0, dead: 0, unchecked: 4, reanimated: 0, mean backoff: 0s) Proxy <http://127.0.0.1:9993> is GOOD Proxy <http://127.0.0.1:9992> is GOOD Proxies(good: 2, dead: 0, unchecked: 2, reanimated: 0, mean backoff: 0s) Proxy <http://127.0.0.1:9991> is GOOD Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s) Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s) Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s) Proxy <http://127.0.0.1:9990> is GOOD Proxies(good: 4, dead: 0, unchecked: 0, reanimated: 0, mean backoff: 0s)
The addresses for the proxies are fixed (sampled from the list in proxy-list.txt
). However, the each Tor proxy refreshes its exit node every minute. Here are the logs from a slightly updated version of the Tor proxy Docker image:
🔁 HUP → Tor. 📌 exit IP: 109.70.100.50. 🔁 HUP → Tor. 📌 exit IP: 31.7.61.190. 🔁 HUP → Tor. 📌 exit IP: 178.20.55.18. 🔁 HUP → Tor. 📌 exit IP: 185.220.102.242. 🔁 HUP → Tor. 📌 exit IP: 109.70.100.51.
This is happening for each of the proxies, so requests effectively are being sent from a constantly changing set of IP addresses. Good way to stay below the radar!
Want to share your content on python-bloggers? click here.