Want to share your content on python-bloggers? click here.
In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.
For my personal edification I’m writing a scraper to gather job listings posted by Cirrus Logic. Openings at the company are posted at https://www.cirrus.com/careers/.
If you click through on the Go button then you get to the listings.
This is what we are after!
Simple Python
In Developer Tools I found the request I was looking for and copied the corresponding curl
command.
curl 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' \ -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' \ -H 'Accept: */*' \ -H 'Accept-Language: en-US,en;q=0.5' \ -H 'Accept-Encoding: gzip, deflate, br, zstd' \ -H 'Origin: https://www.cirrus.com' \ -H 'Connection: keep-alive' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: cross-site' \ -H 'If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' \ -H 'Priority: u=4'
Translating that directly into Python yields the following little script.
import requests headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0', 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', # 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Origin': 'https://www.cirrus.com', 'Connection': 'keep-alive', 'Sec-Fetch-Dest': 'empty', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Site': 'cross-site', 'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"', 'Priority': 'u=4', } params = { 'mode': 'json', } response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus', params=params, headers=headers )
But when I found that it resulted in an unexpected status code: rather than
(OK
)
I got
(Not Modified
)
.
<Response [304]>
(Not Modified
)
code indicates that the server has declined to resend the data. It’s using browser caching to reduce the amount of data sloshing back and forth between the server and the client, effectively telling the browser that the data returned in the previous response has not been modified.
Browser
How does this work in the browser? Suppose it’s the first time that I visit the site (or that I have just cleared the browser cache). The request headers would look like this (some headers omitted for brevity):
GET /v0/postings/cirrus?mode=json HTTP/1.1 Host: api.eu.lever.co User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 Accept: */* Origin: https://www.cirrus.com
And these are the corresponding response headers (also pared down):
HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Content-Length: 497872 ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
The
(OK
)
status code indicates success and around 500 kB of JSON content was returned in the payload. Also included is an ETag
header, which is a unique identifier for the response payload. The presence of an ETag
header in the response informs the browser that the data should be cached.
A short time later I might refresh the page. The request headers now look like this:
GET /v0/postings/cirrus?mode=json HTTP/1.1 Host: api.eu.lever.co User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 Accept: */* Origin: https://www.cirrus.com If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
The value of the ETag
header from the previous response is now included as an If-None-Match
request header.
HTTP/1.1 304 Not Modified ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
The associated response status code is
(Not Modified
)
indicating that the data on the server has not been updated since the last request (it’s still consistent with the specified ETag). The Content-Type
and Content-Length
headers are absent because the response payload is empty.
You can find the cached data in Firefox by browsing to about:cache
. The data are either cached in memory or on disk. I found the data for the above request listed under the items in the disk cache.
Stripping Off the Fluff
There are a load of headers in the original request, most of which are irrelevant. Let’s strip the script down to the minimum required to reproduce the
(Not Modified
)
status code.
import requests headers = { 'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' } response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus?mode=json', headers=headers )
The If-None-Match
request header is the source of the problem (we knew this already from our investigation in the browser). It’s being used to pass an ETag to the server. So the solution is simply to drop the ETag
header.
import requests response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' )
Now we get a successful response and the required data in the payload.
<Response [200]>
If we unpack the first few items in the payload then they yield a series of job listings. We’ll just print the job titles.
for job in response.json(): print(job["text"])
Account Payable Specialist (EH-64000218) Analog Design Engineer - Power (PC-64000100) Applications Engineer (DO-64000212) Applications Engineer (JR-64000232) Benefits Specialist Customer Program Manager - PC Customer Program Manager - PC (NS-64000161) Device Characterization and Spice Modeling Engineer (SB-TBD) Director of Corporate Development (LL- 64000095) Electronic Engineering Internship
The order of the roles is not the same as you see on the site because they are dynamically grouped into categories on the site.
Emulating a Browser Cache in Python
The purpose of the If-None-Match
header is to enable your browser to efficiently manage cached versions of the resource. If you’re sending your requests from Python (or your language of choice) then you can simply omit this header and get a fresh response each time.
Sometimes, however, it can be useful to have a request cache because it can speed up development. This is especially the case if you are sending a large volume of requests or have a few slow requests. The request-cache
package is a drop-in substitute for the requests
package that implements a local cache. Responses can be cached in a variety of backends. Definitely worth checking out!
Conclusion
Although browser caching greatly improves your web experience, making sites significantly more responsive, it can get in the way of your web scraping efforts. Don’t send ETag
headers along with your web scraping requests and avoid getting a
(Not Modified
)
response.
Want to share your content on python-bloggers? click here.