Selenium Template #4: Deploying to ECS

Posted on April 25, 2021 by Python | datawookie in Data science | 0 Comments

This article was first published on Python | datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

This is part of a series of posts:

In the last few posts we’ve looked at a few ways to set up the infrastructure for a Selenium crawler using Docker to run both the crawler and Selenium. In this post we’ll launch this setup in the cloud using AWS Elastic Container Service (ECS).

The earlier posts form an important prelude to deploying on ECS. My strategy is:

get it running locally ✅ then
get it running in Docker ✅ and finally
get it running on ECS 🚀.

There are certainly better and more efficient ways to do this, but that’s not the objective here. We’re just aiming to build something minimal: it’s simple and it works.

Docker Image

We’ll need to enhance the crawler script slightly.

import logging
from time import sleep
from subprocess import Popen, DEVNULL, STDOUT
from selenium import webdriver

logging.basicConfig(
  level=logging.INFO,
  format='%(asctime)s [%(levelname)7s] %(message)s',
)

HOST = "localhost"
PORT = 4444

# Check connection to host and port.
#
def check_connection():
  process = Popen(['nc', '-zv', HOST, str(PORT)], stdout=DEVNULL, stderr=STDOUT)
  #
  if process.wait() != 0:
    logging.warning(f"Unable to communicate with {HOST} on port {PORT}.")
    return False
  else:
    logging.info(f"Can communicate with {HOST} on port {PORT}!")
    return True

RETRY = 10

for i in range(RETRY):
  if check_connection():
    break
  logging.info("Sleeping.")
  sleep(1)

SELENIUM_URL = f"{HOST}:{PORT}"

browser = webdriver.Remote(f"http://{SELENIUM_URL}/wd/hub", {'browserName': 'chrome'})

browser.get("https://www.google.com")

logging.info(f"Retrieved URL: {browser.current_url}.")

browser.close()

The major changes are:

Using the logging package for enhanced logging.
Polling port 4444 to see if Selenium is up and running (the importance of this will become apparent later on). 💡 This could easily be done in native Python, but I like the nc command line utility and it’s fun to run shell commands from Python. We all have different ideas about what consitutes “fun”.

The Dockerfile now also installs the netcat package.

FROM python:3.8.5-slim AS base

RUN apt-get update && apt-get install -y netcat

RUN pip3 install selenium==3.141.0

COPY google-selenium.py /

CMD python3 google-selenium.py

Build the Docker image. I’m also tagging it with my username because I’ll be pushing it to Docker Hub.

docker build -t google-selenium -t datawookie/google-selenium .

Check that it works locally. If it doesn’t work on localhost then it’s not going to work on ECS!

docker run --net=host google-selenium

2021-04-24 17:08:49,114 [   INFO] Can communicate with localhost on port 4444!
Retrieved URL: https://www.google.com/.

🎉 Success. Now we push the image to Docker Hub so that it’s available to ECS.

docker login
docker push datawookie/google-selenium

Create a Cluster

The first step towards deploying on ECS is to create a cluster. If you have an existing cluster then you can skip this step.

Login to AWS and go to the ECS Dashboard. Press .
Select the Networking only template and press .
Choose a name for the cluster and press . If everything went smoothly then you will see a message informing you that the cluster has been successfully created.
Smash the button.

Define a Task

Once we have a cluster we can create a task which specifies one or more containers which will run together.

Click on Task Definitions (menu on left) and then press .
Select the Fargate launch type and press .
Choose a suitable name for the task. We’ll fudge the task size, specifying 1 Gb for the task memory and 0.25 vCPU for the CPU.
We’ll be adding two containers. Press and provide the following details:

Container 1: Selenium
- Container name: selenium
- Image: selenium/standalone-chrome:3.141
- Port mappings: 4444
Container 2: Crawler
- Container name: google-selenium
- Image: The location of the crawler image, which I’ve pushed to my Docker Hub repository.
- Port mappings: 4444

Scroll to the bottom of the page and press .
Click on Task Definitions again and you should see the freshly created task in the table.
Click on the link to the task definition.
You’ll see a list of revisions. Since we’ve just created this task there will only be a single revision. Each time we make a change to the task a new revision will be created. To review the details of the task, click on the link to the first revision.

The details of the task are captured as JSON.

The configuration file below has been abridged and edited for clarity.

{
  "family": "google-selenium",
  "revision": 1,
  "status": "ACTIVE",
  "cpu": "256",
  "memory": "1024",
  "networkMode": "awsvpc",
  "volumes": [],
  "containerDefinitions": [
    {
      "image": "selenium/standalone-chrome:3.141",
      "name": "selenium",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/google-selenium",
        }
      },
      "portMappings": [
        {
          "hostPort": 4444,
          "protocol": "tcp",
          "containerPort": 4444
        }
      ]
    },
    {
      "image": "datawookie/google-selenium",
      "name": "google-selenium",
      "logConfiguration": {
        "options": {
          "awslogs-group": "/ecs/google-selenium",
        }
      }
    }
  ],
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "requiresCompatibilities": ["FARGATE"]
}

Details to note:

The name of the task is given by the family key.
The operating parameters of the task are specified by the cpu and memory keys.
Each container is defined by an object in the containerDefinitions array.
The images, names and port mappings (where applicable) are listed for each container.
There’s a log group associated with the task, /ecs/google-selenium, which can be used to find logs in CloudWatch.

Networking: The task will run using the AWS Fargate serverless compute engine. The network type associated with the task is thus awsvpc, which means that containers within the task can communicate via localhost. This makes communication between containers quite simple and is not dissimilar to using the host network with Docker. The crawler will simply look for the Selenium instance at 127.0.0.1:4444 (or, equivalently, localhost:4444).

Run the Task

The moment of truth: we’re going to run the task.

Click on Clusters (menu on left) and then the link to the cluster.
Select the Tasks tab and press .
Select the Fargate launch type. Choose a cluster VPC and subnet (you can just choose the first on each list).
Whack the button.

The running task will be assigned an unique task ID (like eb86fa0e2bff4aeb8e0d69bdc79eea5a). Click on the task ID link. You’ll see a table with the two containers listed, both of which will initially be in the PENDING state. Wait a moment and refresh the page. You should find that both of the containers are RUNNING. Refresh again and they should both be STOPPED. This means that the task has run and we can now inspect the logs. Click the dropdown next to each container name and follow the View logs in CloudWatch link.

Selenium Logs

These are the logs for the Selenium container:

2021-04-25 04:29:22,559 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf"
2021-04-25 04:29:22,561 INFO supervisord started with pid 8
2021-04-25 04:29:23,563 INFO spawned: 'xvfb' with pid 10
2021-04-25 04:29:23,565 INFO spawned: 'selenium-standalone' with pid 11
2021-04-25 04:29:24,566 INFO success: xvfb entered RUNNING state
2021-04-25 04:29:24,566 INFO success: selenium-standalone entered RUNNING state
04:29:25.167 INFO Selenium server version: 3.141.59, revision: e82be7d358
04:29:25.765 INFO Launching a standalone Selenium Server on port 4444
04:29:27.661 INFO Initialising WebDriverServlet
04:29:28.562 INFO Selenium Server is up and running on port 4444
04:29:34.765 INFO Detected dialect: W3C
04:29:35.061 INFO Started new session 04468dff0f93c1e29008b35b019799b1
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2021-04-25 04:29:39,156 WARN received SIGTERM indicating exit request
2021-04-25 04:29:39,156 INFO waiting for xvfb, selenium-standalone to die
2021-04-25 04:29:40,158 INFO stopped: selenium-standalone (terminated by SIGTERM)
2021-04-25 04:29:41,160 INFO stopped: xvfb (terminated by SIGTERM)

The logs above have been abridged and edited for clarity.

The Selenium container was initialised at around 04:29:22 and terminated at 04:29:41. It was ready to receive requests at 04:29:28 and created a single new session at 04:29:35.

Crawler Logs

And these are the logs for the crawler.

2021-04-25 04:29:22,557 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:22,557 [   INFO] Sleeping.
2021-04-25 04:29:23,560 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:23,561 [   INFO] Sleeping.
2021-04-25 04:29:24,656 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:24,656 [   INFO] Sleeping.
2021-04-25 04:29:25,756 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:25,756 [   INFO] Sleeping.
2021-04-25 04:29:26,761 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:26,761 [   INFO] Sleeping.
2021-04-25 04:29:27,766 [WARNING] Unable to communicate with localhost on port 4444.
2021-04-25 04:29:27,767 [   INFO] Sleeping.
2021-04-25 04:29:28,860 [   INFO] Can communicate with localhost on port 4444!
2021-04-25T06:29:38,258 [   INFO]   Retrieved URL: https://www.google.com/.

The crawler container was initialised at 04:29:22. This was before the Selenium container was ready to receive requests, so the first few attempts to communicate with Selenium were unsuccessful. However, by 04:29:28 the Selenium container was ready to receive requests (see Selenium logs above) and the crawler was able to establish communication. It was at this point that the crawler triggered the creation of a new Selenium session and retrieved the content of https://www.google.com/. The crawler then exited, which in turn caused the Selenium container to terminate.

Conclusion

Waiting for the Selenium service is paramount. If we didn’t wait then the crawler would fail and terminate before Selenium was ready to accept requests. In later posts we’ll see how to create explicit dependencies between containers.

So there you have it, a minimal setup to run your containers in a serverless environment on AWS Elastic Container Service. Although I’ve illustrated the principles with a simple Selenium crawler, the same approach can be used for many other container configurations. 🚀

To leave a comment for the author, please follow the link and comment on their blog: Python | datawookie .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers