Selenium Crawler #3: Docker Compose

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In two previous posts we’ve looked at how to set up a simple scraper which uses Selenium in Docker, communicating via the host network and bridge network. Both of those setups have involved launching separate containers for the scraper and Selenium. In this post we’ll see how to wrap everything up in a single entity using Docker Compose.

First Iteration

This docker-compose.yml looks like it should do the job.

version: '2'

services:
  selenium:
    container_name: 'selenium'
    image: 'selenium/standalone-chrome:3.141'
  
  google:
    container_name: 'google-selenium'
    image: 'google-selenium-bridge-user-defined'
    depends_on:
      - selenium

We have indicated that the scraper container depends on the Selenium container (via depends_on). However, this only affects the order in which these containers are created. It won’t ensure that Selenium is ready to accept connections before launch the scraper.

So this is what happens:

  1. The Selenium container is launched.
  2. The scraper container is launched.
  3. Selenium is not yet ready to accept connections, so the scraper dies with an error.

Fault Tolerant Scraper

We need to make the scraper more resilient. We’ll wrap the call to webdriver.Remote() in a function and use a decorator from the backoff package to apply exponential backoff.

import sys
import urllib3
import backoff
from selenium import webdriver

import logging
#
logging.basicConfig(
  level=logging.INFO,
  format='%(asctime)s %(message)s'
)
#
logging.getLogger('urllib3').setLevel(logging.ERROR)

SELENIUM_URL = "selenium:4444"

@backoff.on_exception(
  backoff.expo,
  urllib3.exceptions.MaxRetryError,
  max_tries=5,
  jitter=None
)
def selenium_connect(url):
  return webdriver.Remote(url, {'browserName': 'chrome'})

try:
  browser = selenium_connect(f"http://{SELENIUM_URL}/wd/hub")
except urllib3.exceptions.MaxRetryError:
  logging.error("Unable to connect to Selenium.")
  sys.exit(1)

browser.get("https://www.google.com")

logging.info(f"Retrieved URL: {browser.current_url}.")

browser.close()

Let’s see what happens when we run this script without Selenium.

2021-04-19 06:51:14,760 Could not connect to port 4444 on host selenium
2021-04-19 06:51:14,760 Could not get IP address for host: selenium
2021-04-19 06:51:16,436 Backing off selenium_connect(...) for 1.0s
2021-04-19 06:51:17,918 Could not connect to port 4444 on host selenium
2021-04-19 06:51:17,918 Could not get IP address for host: selenium
2021-04-19 06:51:19,618 Backing off selenium_connect(...) for 2.0s
2021-04-19 06:51:22,030 Could not connect to port 4444 on host selenium
2021-04-19 06:51:22,030 Could not get IP address for host: selenium
2021-04-19 06:51:23,756 Backing off selenium_connect(...) for 4.0s
2021-04-19 06:51:28,190 Could not connect to port 4444 on host selenium
2021-04-19 06:51:28,190 Could not get IP address for host: selenium
2021-04-19 06:51:29,946 Backing off selenium_connect(...) for 8.0s
2021-04-19 06:51:38,372 Could not connect to port 4444 on host selenium
2021-04-19 06:51:38,372 Could not get IP address for host: selenium
2021-04-19 06:51:40,074 Giving up selenium_connect(...) after 5 tries
2021-04-19 06:51:40,074 Unable to connect to Selenium.

The above output has been abridged for clarity.

It tries to connect to Selenium five times, each time with a progressively longer delay (1, 2, 4 and 8 seconds), before ultimately giving up.

Docker with User-Defined Bridge Network

Let’s wrap that up in a Docker image.

FROM python:3.8.5-slim AS base

RUN pip3 install selenium==3.141.0 backoff==1.10.0

COPY google-selenium-robust.py /

CMD python3 google-selenium-robust.py

Now build the image.

docker build -t google-selenium-robust .

We’ll try running that using the user-defined bridge network from the previous post.

docker run --net=google google-selenium-robust
2021-04-19 06:32:25,544 Retrieved URL: https://www.google.com/.

Looks promising.

Second Iteration

Revise docker-compose.yml to use this new image.

version: '2'

services:
  selenium:
    container_name: 'selenium'
    image: 'selenium/standalone-chrome:3.141'
  
  google:
    container_name: 'google-selenium'
    image: 'google-selenium-robust'
    depends_on:
      - selenium

Let’s try that out.

docker-compose up --abort-on-container-exit

We’ll step through the output to understand what’s happening.

The below output has been abridged for clarity.

Docker Compose kicks off the containers.

Starting selenium ... done
Starting google-selenium ... done
Attaching to selenium, google-selenium

The Selenium container starts.

selenium    | 2021-04-19 04:53:05,936 supervisord started with pid 7

The scraper container starts, tries to connect to Selenium but fails. It waits to try again.

google-selenium | 2021-04-19 04:53:06,503 Could not connect to port 4444 on host selenium
google-selenium | 2021-04-19 04:53:06,503 Could not get IP address for host: selenium
google-selenium | 2021-04-19 04:53:06,506 Backing off selenium_connect(...) for 1.0s

The Selenium container continues to wind up.

selenium    | 2021-04-19 04:53:06,939 spawned: 'xvfb' with pid 9
selenium    | 2021-04-19 04:53:06,941 spawned: 'selenium-standalone' with pid 10
selenium    | 04:53:07.140 Selenium server version: 3.141.59, revision: e82be7d358
selenium    | 2021-04-19 04:53:07,141 success: selenium-standalone entered RUNNING state
selenium    | 04:53:07.218 Launching a standalone Selenium Server on port 4444
selenium    | 04:53:07.448 Initialising WebDriverServlet

The scraper container tries again. But Selenium is still not ready, so it fails and waits to try again.

google-selenium | 2021-04-19 04:53:07,508 Could not connect to port 4444 on host selenium
google-selenium | 2021-04-19 04:53:07,508 Could not get IP address for host: selenium
google-selenium | 2021-04-19 04:53:07,512 Backing off selenium_connect(...) for 2.0s

The Selenium container is finally up and running. It accepts a connection from the scraper container and creates a new session.

selenium    | 04:53:07.524 Selenium Server is up and running on port 4444
selenium    | Starting ChromeDriver 89.0.4389.23 on port 27768
selenium    | 04:53:10.131 Detected dialect: W3C
selenium    | 04:53:10.152 Started new session

The scraper does its thing and exits.

google-selenium | 2021-04-19 04:53:11,686 Retrieved URL: https://www.google.com/.
google-selenium exited with code 0

Docker Compose winds down the Selenium container.

Aborting on container exit...
Stopping selenium        ... done

Persistent Selenium

If you run docker-compose without the --abort-on-container-exit then the Selenium container stays up after the scraper job has finished.

docker-compose up

This means that you can then run the scraper again using the run command for docker-compose.

docker-compose run google
Starting selenium ... done
2021-04-19 06:21:01,107 Retrieved URL: https://www.google.com/.

This is rather handy. All of the infrastructure is in place and you can trigger the scraper whenever it’s required.

What Does the Network Look Like?

Docker Compose creates a user-defined bridge network.

docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
d0ad6cdb74af        google_default      bridge              local
ea5ebd23a086        bridge              bridge              local
bb80a2809880        host                host                local
00b74ecbf970        none                null                local

As a result, we have automatic service discovery and the scraper container is able to connect to the selenium container by name.

Conclusion

I’m not sure that I’d frequently use this setup in practice, but it’s rather convenient that the services are all packaged up nicely and managed by Docker Compose.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.