Cookies & Headers from Selenium
Want to share your content on python-bloggers? click here.

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests package and I don’t need to worry about all of the complexity around scraping with Selenium. However, it’s often the case that the API requests require a collection of cookies and headers, and those need to be gathered using Selenium.
In this case I have a two-step method:
- open the page in Selenium and retrieve the cookies and headers; then
 - use the required cookies and/or headers to submit further requests using the 
requestspackage. 
Getting Cookies & Headers
Here’s the function that I use to retrieve the cookies and headers.
import re
def get_cookies_headers(driver):
    # Get cookies from browser & unpack into a dictionary.
    #    
    cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()}
    # Use a synchronous request to retrieve response headers.
    #
    script = """
    var xhr = new XMLHttpRequest();
    xhr.open('GET', window.location.href, false);
    xhr.send(null);
    return xhr.getAllResponseHeaders();
    """
    headers = driver.execute_script(script)
    
    # Unpack headers into dictionary.
    #
    headers = headers.splitlines()
    headers = dict([re.split(": +", header, maxsplit=1) for header in headers])
    return cookies, headers
Getting the cookies is relatively simple because the Selenium driver has a get_cookies() method. The object returned by get_cookies() is a list of dictionaries, which we transform into a single dictionary.
A little more work is required for the headers. There’s no dedicated method to get the headers, so we run a snippet of JavaScript. The result is returned as a multi-line string, which is then parsed into a dictionary.
Driver
Lets hook that up with a driver and see how well it works. I’ve got Selenium running in a Docker container and will access it via port 4444. Also I’m using the selenium==4.9.0 package.
import atexit
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from util import get_cookies_headers
SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub"
chrome_options = ChromeOptions()
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Remote(
    command_executor=SELENIUM_SERVER_URL,
    options=chrome_options,
)
atexit.register(lambda: driver.quit())
driver.get("https://www.google.com/")
cookies, headers = get_cookies_headers(driver)
Both cookies and headers are dictionaries, as required for use with the requests package. Dumping a subset of the cookies as JSON gives:
{
  "CONSENT": "PENDING+054",
  "AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY"
}
And here are selected headers:
{
  "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
  "cache-control": "private, max-age=0",
  "content-encoding": "br",
  "content-length": "72050",
  "content-type": "text/html; charset=UTF-8",
  "cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"",
  "date": "Tue, 31 Oct 2023 14:12:59 GMT",
  "expires": "-1",
  "permissions-policy": "unload=()",
  "server": "gws",
  "strict-transport-security": "max-age=31536000",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "0"
}
Conclusion
Being able to retrieve cookies and headers from a dynamic website using Selenium can be handy when the underlying API requires specific cookies and/or headers.
Want to share your content on python-bloggers? click here.