Want to share your content on python-bloggers? click here.
One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests
package and I don’t need to worry about all of the complexity around scraping with Selenium. However, it’s often the case that the API requests require a collection of cookies and headers, and those need to be gathered using Selenium.
In this case I have a two-step method:
- open the page in Selenium and retrieve the cookies and headers; and
- use the required cookies and/or headers to submit further requests using the
requests
package.
Getting Cookies & Headers
Here’s the function that I use to retrieve the cookies and headers.
import re def get_cookies_headers(driver): # Get cookies from browser & unpack into a dictionary. # cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()} # Use a synchronous request to retrieve response headers. # script = """ var xhr = new XMLHttpRequest(); xhr.open('GET', window.location.href, false); xhr.send(null); return xhr.getAllResponseHeaders(); """ headers = driver.execute_script(script) # Unpack headers into dictionary. # headers = headers.splitlines() headers = dict([re.split(": +", header, maxsplit=1) for header in headers]) return cookies, headers
Getting the cookies is relatively simple because the Selenium driver has a get_cookies()
method. The object returned by get_cookies()
is a list of dictionaries, which we transform into a single dictionary.
A little more work is required for the headers. There’s no dedicated method to get the headers, so we run a snippet of JavaScript. The result is returned as a multi-line string, which is then parsed into a dictionary.
Driver
Let’s hook that up with a driver and see how well it works. I’ve got Selenium running in a Docker container and will access it via port 4444. Also I’m using the selenium==4.9.0
package.
import atexit from selenium import webdriver from selenium.webdriver import ChromeOptions from util import get_cookies_headers SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub" chrome_options = ChromeOptions() chrome_options.add_argument("--disable-gpu") driver = webdriver.Remote( command_executor=SELENIUM_SERVER_URL, options=chrome_options, ) atexit.register(lambda: driver.quit()) driver.get("https://www.google.com/") cookies, headers = get_cookies_headers(driver)
Both cookies
and headers
are dictionaries, as required for use with the requests
package. Dumping a subset of the cookies as JSON gives:
{ "CONSENT": "PENDING+054", "AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY" }
And here are selected headers:
{ "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000", "cache-control": "private, max-age=0", "content-encoding": "br", "content-length": "72050", "content-type": "text/html; charset=UTF-8", "cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"", "date": "Tue, 31 Oct 2023 14:12:59 GMT", "expires": "-1", "permissions-policy": "unload=()", "server": "gws", "strict-transport-security": "max-age=31536000", "x-frame-options": "SAMEORIGIN", "x-xss-protection": "0" }
Conclusion
Being able to retrieve cookies and headers from a dynamic website using Selenium can be handy when the underlying API requires specific cookies and/or headers.
Want to share your content on python-bloggers? click here.