MENU

Cookies & Headers from Selenium

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the

requests
requests package and I don’t need to worry about all of the complexity around scraping with Selenium. However, it’s often the case that the API requests require a collection of cookies and headers, and those need to be gathered using Selenium.

In this case I have a two-step method:

  1. open the page in Selenium and retrieve the cookies and headers; and
  2. use the required cookies and/or headers to submit further requests using the
    requests
    requests package.

Getting Cookies & Headers

Here’s the function that I use to retrieve the cookies and headers.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import re
def get_cookies_headers(driver):
# Get cookies from browser & unpack into a dictionary.
#
cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()}
# Use a synchronous request to retrieve response headers.
#
script = """
var xhr = new XMLHttpRequest();
xhr.open('GET', window.location.href, false);
xhr.send(null);
return xhr.getAllResponseHeaders();
"""
headers = driver.execute_script(script)
# Unpack headers into dictionary.
#
headers = headers.splitlines()
headers = dict([re.split(": +", header, maxsplit=1) for header in headers])
return cookies, headers
import re def get_cookies_headers(driver): # Get cookies from browser & unpack into a dictionary. # cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()} # Use a synchronous request to retrieve response headers. # script = """ var xhr = new XMLHttpRequest(); xhr.open('GET', window.location.href, false); xhr.send(null); return xhr.getAllResponseHeaders(); """ headers = driver.execute_script(script) # Unpack headers into dictionary. # headers = headers.splitlines() headers = dict([re.split(": +", header, maxsplit=1) for header in headers]) return cookies, headers
import re

def get_cookies_headers(driver):
    # Get cookies from browser & unpack into a dictionary.
    #    
    cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()}

    # Use a synchronous request to retrieve response headers.
    #
    script = """
    var xhr = new XMLHttpRequest();
    xhr.open('GET', window.location.href, false);
    xhr.send(null);
    return xhr.getAllResponseHeaders();
    """
    headers = driver.execute_script(script)
    
    # Unpack headers into dictionary.
    #
    headers = headers.splitlines()
    headers = dict([re.split(": +", header, maxsplit=1) for header in headers])

    return cookies, headers

Getting the cookies is relatively simple because the Selenium driver has a

get_cookies()
get_cookies() method. The object returned by
get_cookies()
get_cookies() is a list of dictionaries, which we transform into a single dictionary.

A little more work is required for the headers. There’s no dedicated method to get the headers, so we run a snippet of JavaScript. The result is returned as a multi-line string, which is then parsed into a dictionary.

Driver

Let’s hook that up with a driver and see how well it works. I’ve got Selenium running in a Docker container and will access it via port 4444. Also I’m using the

selenium==4.9.0
selenium==4.9.0 package.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import atexit
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from util import get_cookies_headers
SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub"
chrome_options = ChromeOptions()
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Remote(
command_executor=SELENIUM_SERVER_URL,
options=chrome_options,
)
atexit.register(lambda: driver.quit())
driver.get("https://www.google.com/")
cookies, headers = get_cookies_headers(driver)
import atexit from selenium import webdriver from selenium.webdriver import ChromeOptions from util import get_cookies_headers SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub" chrome_options = ChromeOptions() chrome_options.add_argument("--disable-gpu") driver = webdriver.Remote( command_executor=SELENIUM_SERVER_URL, options=chrome_options, ) atexit.register(lambda: driver.quit()) driver.get("https://www.google.com/") cookies, headers = get_cookies_headers(driver)
import atexit

from selenium import webdriver
from selenium.webdriver import ChromeOptions

from util import get_cookies_headers

SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub"

chrome_options = ChromeOptions()
chrome_options.add_argument("--disable-gpu")

driver = webdriver.Remote(
    command_executor=SELENIUM_SERVER_URL,
    options=chrome_options,
)

atexit.register(lambda: driver.quit())

driver.get("https://www.google.com/")

cookies, headers = get_cookies_headers(driver)

Both

cookies
cookies and
headers
headers are dictionaries, as required for use with the
requests
requests package. Dumping a subset of the cookies as JSON gives:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
{
"CONSENT": "PENDING+054",
"AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY"
}
{ "CONSENT": "PENDING+054", "AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY" }
{
  "CONSENT": "PENDING+054",
  "AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY"
}

And here are selected headers:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
{
"alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
"cache-control": "private, max-age=0",
"content-encoding": "br",
"content-length": "72050",
"content-type": "text/html; charset=UTF-8",
"cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"",
"date": "Tue, 31 Oct 2023 14:12:59 GMT",
"expires": "-1",
"permissions-policy": "unload=()",
"server": "gws",
"strict-transport-security": "max-age=31536000",
"x-frame-options": "SAMEORIGIN",
"x-xss-protection": "0"
}
{ "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000", "cache-control": "private, max-age=0", "content-encoding": "br", "content-length": "72050", "content-type": "text/html; charset=UTF-8", "cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"", "date": "Tue, 31 Oct 2023 14:12:59 GMT", "expires": "-1", "permissions-policy": "unload=()", "server": "gws", "strict-transport-security": "max-age=31536000", "x-frame-options": "SAMEORIGIN", "x-xss-protection": "0" }
{
  "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
  "cache-control": "private, max-age=0",
  "content-encoding": "br",
  "content-length": "72050",
  "content-type": "text/html; charset=UTF-8",
  "cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"",
  "date": "Tue, 31 Oct 2023 14:12:59 GMT",
  "expires": "-1",
  "permissions-policy": "unload=()",
  "server": "gws",
  "strict-transport-security": "max-age=31536000",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "0"
}

Conclusion

Being able to retrieve cookies and headers from a dynamic website using Selenium can be handy when the underlying API requires specific cookies and/or headers.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.