Scraper Template with Selenium on Docker Host Network

This article was first published on Python | datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This post will show you how to set up the following:

  • a Selenium instance and
  • a simple script connecting to Selenium.

Both of these will run in Docker containers and will communicate over the host network.

Selenium Service

Create a Selenium container, exposing port 4444 on the host. This means that port 4444 is mapped directly to port 4444 on the host and all requests to port 4444 on the host will be redirected to the same port on the container.

docker run -d --rm --name selenium -p 4444:4444 selenium/standalone-chrome:3.141

Scraper Template

First we’ll create the framework for a simple scraper in Python.

from selenium import webdriver

SELENIUM_URL = "localhost:4444"

browser = webdriver.Remote(f"http://{SELENIUM_URL}/wd/hub", {'browserName': 'chrome'})

browser.get("https://www.google.com")

print(f"Retrieved URL: {browser.current_url}.")

browser.close()

It doesn’t actually do any scraping but it does fire up a Selenium session and opens an URL. These are the biggest technical hurdles.

The script connects to Selenium at http://localhost:4444. Since localhost maps to the loopback IP address, 127.0.0.1, you can also use http://127.0.0.1:4444.

SELENIUM_URL = "127.0.0.1:4444"

In either case the script will produce the following result:

python3 google-selenium.py
Retrieved URL: https://www.google.com/.

Scraper Template in Docker

Now we’re going to wrap that script up in its own Docker image. Here’s the Dockerfile:

FROM python:3.8.5-slim AS base

RUN pip3 install selenium==3.141.0

COPY google-selenium.py /

CMD python3 google-selenium.py

Let’s build the image.

docker build -t google-selenium .

Now run it.

docker run --net=host google-selenium
Retrieved URL: https://www.google.com/.

We’ve specified --net=host, so we’re using the host’s network and the scraper container is accessing the Selenium instance via port 4444 on the host.

Precisely the same result as earlier (when running the scraper script directly), but now everything (except some networking) is in Docker.

Appendix: Network Details

To learn a little more about using the host network with Docker, let’s scratch beneath the surface. We’ll run a BASH shell using the scraper image.

docker run -it --net=host google-selenium /bin/bash

Now we’re the root user inside the container. It’s a very lightweight image, so we need to install some networking tools.

root@propane:/# apt update && apt install -y iproute2

Now we can check on what the network configuration looks like in the container.

root@propane:/# ip -br -c a
lo               UNKNOWN        127.0.0.1/8 ::1/128 
wlp3s0           UP             10.0.0.8/24 fe80::363:4c7c:a305:62a7/64 
br-f3c6be594433  DOWN           172.19.0.1/16 
docker0          UP             172.17.0.1/16 fe80::42:eeff:fe7f:173f/64 
br-e48eb1ef2d48  DOWN           172.22.0.1/16 
br-81ccbe03027c  DOWN           172.20.0.1/16 fe80::42:b0ff:fe48:4b7f/64 
vethd2e0b5e@if55 UP             fe80::d4a7:3ff:fefa:47e9/64

This is precisely the same network configuration that you’d see on the host. So, effectively, the container has the same network configuration as the host. This is simple, convenient and performant. But it also means that the container is not isolated from the host.

Check out the next post where we’ll use a bridge network to create the same setup.

To leave a comment for the author, please follow the link and comment on their blog: Python | datawookie .

Want to share your content on python-bloggers? click here.