Scraping and Not Modified Responses

Python - datawookie

5 months ago

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.

For my personal edification I’m writing a scraper to gather job listings posted by Cirrus Logic. Openings at the company are posted at https://www.cirrus.com/careers/.

If you click through on the Go button then you get to the listings.

This is what we are after!

Simple Python

In Developer Tools I found the request I was looking for and copied the corresponding curl command.

curl 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' \
  -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Origin: https://www.cirrus.com' \
  -H 'Connection: keep-alive' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: cross-site' \
  -H 'If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' \
  -H 'Priority: u=4'

Translating that directly into Python yields the following little script.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Origin': 'https://www.cirrus.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'cross-site',
    'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"',
    'Priority': 'u=4',
}

params = {
    'mode': 'json',
}

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus',
  params=params,
  headers=headers
)

But when I found that it resulted in an unexpected status code: rather than

(OK)
I got

(Not Modified)
.

<Response [304]>

The

(Not Modified)
code indicates that the server has declined to resend the data. It’s using browser caching to reduce the amount of data sloshing back and forth between the server and the client, effectively telling the browser that the data returned in the previous response has not been modified.

Browser

How does this work in the browser? Suppose it’s the first time that I visit the site (or that I have just cleared the browser cache). The request headers would look like this (some headers omitted for brevity):

GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com

And these are the corresponding response headers (also pared down):

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 497872
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The

(OK)
status code indicates success and around 500 kB of JSON content was returned in the payload. Also included is an ETag header, which is a unique identifier for the response payload. The presence of an ETag header in the response informs the browser that the data should be cached.

A short time later I might refresh the page. The request headers now look like this:

GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com
If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The value of the ETag header from the previous response is now included as an If-None-Match request header.

HTTP/1.1 304 Not Modified
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The associated response status code is

(Not Modified)
indicating that the data on the server has not been updated since the last request (it’s still consistent with the specified ETag). The Content-Type and Content-Length headers are absent because the response payload is empty.

You can find the cached data in Firefox by browsing to about:cache. The data are either cached in memory or on disk. I found the data for the above request listed under the items in the disk cache.

Stripping Off the Fluff

There are a load of headers in the original request, most of which are irrelevant. Let’s strip the script down to the minimum required to reproduce the

(Not Modified)
status code.

import requests

headers = {
    'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"'
}

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus?mode=json',
  headers=headers
)

The If-None-Match request header is the source of the problem (we knew this already from our investigation in the browser). It’s being used to pass an ETag to the server. So the solution is simply to drop the ETag header.

import requests

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus?mode=json'
)

Now we get a successful response and the required data in the payload.

<Response [200]>

If we unpack the first few items in the payload then they yield a series of job listings. We’ll just print the job titles.

for job in response.json():
  print(job["text"])

Account Payable Specialist (EH-64000218)
Analog Design Engineer - Power (PC-64000100)
Applications Engineer (DO-64000212)
Applications Engineer (JR-64000232)
Benefits Specialist
Customer Program Manager - PC
Customer Program Manager - PC (NS-64000161)
Device Characterization and Spice Modeling Engineer (SB-TBD)
Director of Corporate Development (LL- 64000095)
Electronic Engineering Internship

The order of the roles is not the same as you see on the site because they are dynamically grouped into categories on the site.

Emulating a Browser Cache in Python

The purpose of the If-None-Match header is to enable your browser to efficiently manage cached versions of the resource. If you’re sending your requests from Python (or your language of choice) then you can simply omit this header and get a fresh response each time.

Sometimes, however, it can be useful to have a request cache because it can speed up development. This is especially the case if you are sending a large volume of requests or have a few slow requests. The request-cache package is a drop-in substitute for the requests package that implements a local cache. Responses can be cached in a variety of backends. Definitely worth checking out!

Conclusion

Although browser caching greatly improves your web experience, making sites significantly more responsive, it can get in the way of your web scraping efforts. Don’t send ETag headers along with your web scraping requests and avoid getting a

(Not Modified)
response.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.