Selenium Crawler #3: Docker Compose
Want to share your content on python-bloggers? click here.
In two previous posts we’ve looked at how to set up a simple scraper which uses Selenium in Docker, communicating via the host network and bridge network. Both of those setups have involved launching separate containers for the scraper and Selenium. In this post we’ll see how to wrap everything up in a single entity using Docker Compose.
First Iteration
This docker-compose.yml
looks like it should do the job.
version: '2' services: selenium: container_name: 'selenium' image: 'selenium/standalone-chrome:3.141' google: container_name: 'google-selenium' image: 'google-selenium-bridge-user-defined' depends_on: - selenium
We have indicated that the scraper container depends on the Selenium container (via depends_on
). However, this only affects the order in which these containers are created. It won’t ensure that Selenium is ready to accept connections before launch the scraper.
So this is what happens:
- The Selenium container is launched.
- The scraper container is launched.
- Selenium is not yet ready to accept connections, so the scraper dies with an error.
Fault Tolerant Scraper
We need to make the scraper more resilient. We’ll wrap the call to webdriver.Remote()
in a function and use a decorator from the backoff package to apply exponential backoff.
import sys import urllib3 import backoff from selenium import webdriver import logging # logging.basicConfig( level=logging.INFO, format='%(asctime)s %(message)s' ) # logging.getLogger('urllib3').setLevel(logging.ERROR) SELENIUM_URL = "selenium:4444" @backoff.on_exception( backoff.expo, urllib3.exceptions.MaxRetryError, max_tries=5, jitter=None ) def selenium_connect(url): return webdriver.Remote(url, {'browserName': 'chrome'}) try: browser = selenium_connect(f"http://{SELENIUM_URL}/wd/hub") except urllib3.exceptions.MaxRetryError: logging.error("Unable to connect to Selenium.") sys.exit(1) browser.get("https://www.google.com") logging.info(f"Retrieved URL: {browser.current_url}.") browser.close()
Let’s see what happens when we run this script without Selenium.
2021-04-19 06:51:14,760 Could not connect to port 4444 on host selenium 2021-04-19 06:51:14,760 Could not get IP address for host: selenium 2021-04-19 06:51:16,436 Backing off selenium_connect(...) for 1.0s 2021-04-19 06:51:17,918 Could not connect to port 4444 on host selenium 2021-04-19 06:51:17,918 Could not get IP address for host: selenium 2021-04-19 06:51:19,618 Backing off selenium_connect(...) for 2.0s 2021-04-19 06:51:22,030 Could not connect to port 4444 on host selenium 2021-04-19 06:51:22,030 Could not get IP address for host: selenium 2021-04-19 06:51:23,756 Backing off selenium_connect(...) for 4.0s 2021-04-19 06:51:28,190 Could not connect to port 4444 on host selenium 2021-04-19 06:51:28,190 Could not get IP address for host: selenium 2021-04-19 06:51:29,946 Backing off selenium_connect(...) for 8.0s 2021-04-19 06:51:38,372 Could not connect to port 4444 on host selenium 2021-04-19 06:51:38,372 Could not get IP address for host: selenium 2021-04-19 06:51:40,074 Giving up selenium_connect(...) after 5 tries 2021-04-19 06:51:40,074 Unable to connect to Selenium.
The above output has been abridged for clarity.
It tries to connect to Selenium five times, each time with a progressively longer delay (1, 2, 4 and 8 seconds), before ultimately giving up.
Docker with User-Defined Bridge Network
Let’s wrap that up in a Docker image.
FROM python:3.8.5-slim AS base RUN pip3 install selenium==3.141.0 backoff==1.10.0 COPY google-selenium-robust.py / CMD python3 google-selenium-robust.py
Now build the image.
docker build -t google-selenium-robust .
We’ll try running that using the user-defined bridge network from the previous post.
docker run --net=google google-selenium-robust
2021-04-19 06:32:25,544 Retrieved URL: https://www.google.com/.
Looks promising.
Second Iteration
Revise docker-compose.yml
to use this new image.
version: '2' services: selenium: container_name: 'selenium' image: 'selenium/standalone-chrome:3.141' google: container_name: 'google-selenium' image: 'google-selenium-robust' depends_on: - selenium
Let’s try that out.
docker-compose up --abort-on-container-exit
We’ll step through the output to understand what’s happening.
The below output has been abridged for clarity.
Docker Compose kicks off the containers.
Starting selenium ... done Starting google-selenium ... done Attaching to selenium, google-selenium
The Selenium container starts.
selenium | 2021-04-19 04:53:05,936 supervisord started with pid 7
The scraper container starts, tries to connect to Selenium but fails. It waits to try again.
google-selenium | 2021-04-19 04:53:06,503 Could not connect to port 4444 on host selenium google-selenium | 2021-04-19 04:53:06,503 Could not get IP address for host: selenium google-selenium | 2021-04-19 04:53:06,506 Backing off selenium_connect(...) for 1.0s
The Selenium container continues to wind up.
selenium | 2021-04-19 04:53:06,939 spawned: 'xvfb' with pid 9 selenium | 2021-04-19 04:53:06,941 spawned: 'selenium-standalone' with pid 10 selenium | 04:53:07.140 Selenium server version: 3.141.59, revision: e82be7d358 selenium | 2021-04-19 04:53:07,141 success: selenium-standalone entered RUNNING state selenium | 04:53:07.218 Launching a standalone Selenium Server on port 4444 selenium | 04:53:07.448 Initialising WebDriverServlet
The scraper container tries again. But Selenium is still not ready, so it fails and waits to try again.
google-selenium | 2021-04-19 04:53:07,508 Could not connect to port 4444 on host selenium google-selenium | 2021-04-19 04:53:07,508 Could not get IP address for host: selenium google-selenium | 2021-04-19 04:53:07,512 Backing off selenium_connect(...) for 2.0s
The Selenium container is finally up and running. It accepts a connection from the scraper container and creates a new session.
selenium | 04:53:07.524 Selenium Server is up and running on port 4444 selenium | Starting ChromeDriver 89.0.4389.23 on port 27768 selenium | 04:53:10.131 Detected dialect: W3C selenium | 04:53:10.152 Started new session
The scraper does its thing and exits.
google-selenium | 2021-04-19 04:53:11,686 Retrieved URL: https://www.google.com/. google-selenium exited with code 0
Docker Compose winds down the Selenium container.
Aborting on container exit... Stopping selenium ... done
Persistent Selenium
If you run docker-compose
without the --abort-on-container-exit
then the Selenium container stays up after the scraper job has finished.
docker-compose up
This means that you can then run the scraper again using the run
command for docker-compose
.
docker-compose run google
Starting selenium ... done 2021-04-19 06:21:01,107 Retrieved URL: https://www.google.com/.
This is rather handy. All of the infrastructure is in place and you can trigger the scraper whenever it’s required.
What Does the Network Look Like?
Docker Compose creates a user-defined bridge network.
docker network ls
NETWORK ID NAME DRIVER SCOPE d0ad6cdb74af google_default bridge local ea5ebd23a086 bridge bridge local bb80a2809880 host host local 00b74ecbf970 none null local
As a result, we have automatic service discovery and the scraper container is able to connect to the selenium
container by name.
Conclusion
I’m not sure that I’d frequently use this setup in practice, but it’s rather convenient that the services are all packaged up nicely and managed by Docker Compose.
Want to share your content on python-bloggers? click here.