MENU

Scraping and Not Modified Responses

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.

For my personal edification I’m writing a scraper to gather job listings posted by Cirrus Logic. Openings at the company are posted at https://www.cirrus.com/careers/.

If you click through on the Go button then you get to the listings.

This is what we are after!

Simple Python

In Developer Tools I found the request I was looking for and copied the corresponding

curl
curl command.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
curl 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' \
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' \
-H 'Accept: */*' \
-H 'Accept-Language: en-US,en;q=0.5' \
-H 'Accept-Encoding: gzip, deflate, br, zstd' \
-H 'Origin: https://www.cirrus.com' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: cross-site' \
-H 'If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' \
-H 'Priority: u=4'
curl 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' \ -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' \ -H 'Accept: */*' \ -H 'Accept-Language: en-US,en;q=0.5' \ -H 'Accept-Encoding: gzip, deflate, br, zstd' \ -H 'Origin: https://www.cirrus.com' \ -H 'Connection: keep-alive' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: cross-site' \ -H 'If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' \ -H 'Priority: u=4'
curl 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' \
  -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Origin: https://www.cirrus.com' \
  -H 'Connection: keep-alive' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: cross-site' \
  -H 'If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' \
  -H 'Priority: u=4'

Translating that directly into Python yields the following little script.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br, zstd',
'Origin': 'https://www.cirrus.com',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"',
'Priority': 'u=4',
}
params = {
'mode': 'json',
}
response = requests.get(
'https://api.eu.lever.co/v0/postings/cirrus',
params=params,
headers=headers
)
import requests headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0', 'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', # 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Origin': 'https://www.cirrus.com', 'Connection': 'keep-alive', 'Sec-Fetch-Dest': 'empty', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Site': 'cross-site', 'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"', 'Priority': 'u=4', } params = { 'mode': 'json', } response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus', params=params, headers=headers )
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Origin': 'https://www.cirrus.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'cross-site',
    'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"',
    'Priority': 'u=4',
}

params = {
    'mode': 'json',
}

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus',
  params=params,
  headers=headers
)

But when I found that it resulted in an unexpected status code: rather than

(

OK
OK)
I got

(

Not Modified
Not Modified)
.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<Response [304]>
<Response [304]>
<Response [304]>

The

(

Not Modified)
code
indicates that the server has declined to resend the data. It’s using browser caching to reduce the amount of data sloshing back and forth between the server and the client, effectively telling the browser that the data returned in the previous response has not been modified.

Browser

How does this work in the browser? Suppose it’s the first time that I visit the site (or that I have just cleared the browser cache). The request headers would look like this (some headers omitted for brevity):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com
GET /v0/postings/cirrus?mode=json HTTP/1.1 Host: api.eu.lever.co User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 Accept: */* Origin: https://www.cirrus.com
GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com

And these are the corresponding response headers (also pared down):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 497872
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Content-Length: 497872 ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 497872
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The

(

OK
OK)
status code indicates success and around 500 kB of JSON content was returned in the payload. Also included is an
ETag
ETag header, which is a unique identifier for the response payload. The presence of an
ETag
ETag header in the response informs the browser that the data should be cached.

A short time later I might refresh the page. The request headers now look like this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com
If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
GET /v0/postings/cirrus?mode=json HTTP/1.1 Host: api.eu.lever.co User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 Accept: */* Origin: https://www.cirrus.com If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
GET /v0/postings/cirrus?mode=json HTTP/1.1
Host: api.eu.lever.co
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0
Accept: */*
Origin: https://www.cirrus.com
If-None-Match: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The value of the

ETag
ETag header from the previous response is now included as an
If-None-Match
If-None-Match request header.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
HTTP/1.1 304 Not Modified
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
HTTP/1.1 304 Not Modified ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"
HTTP/1.1 304 Not Modified
ETag: W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"

The associated response status code is

(

Not Modified
Not Modified)
indicating that the data on the server has not been updated since the last request (it’s still consistent with the specified ETag). The
Content-Type
Content-Type and
Content-Length
Content-Length headers are absent because the response payload is empty.

You can find the cached data in Firefox by browsing to

about:cache
about:cache. The data are either cached in memory or on disk. I found the data for the above request listed under the items in the disk cache.

Stripping Off the Fluff

There are a load of headers in the original request, most of which are irrelevant. Let’s strip the script down to the minimum required to reproduce the

(

Not Modified
Not Modified)
status code.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
headers = {
'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"'
}
response = requests.get(
'https://api.eu.lever.co/v0/postings/cirrus?mode=json',
headers=headers
)
import requests headers = { 'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"' } response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus?mode=json', headers=headers )
import requests

headers = {
    'If-None-Match': 'W/"798d0-Przpe3guh5i4fDnUEqWOoW2LvUA"'
}

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus?mode=json',
  headers=headers
)

The

If-None-Match
If-None-Match request header is the source of the problem (we knew this already from our investigation in the browser). It’s being used to pass an ETag to the server. So the solution is simply to drop the
ETag
ETag header.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
response = requests.get(
'https://api.eu.lever.co/v0/postings/cirrus?mode=json'
)
import requests response = requests.get( 'https://api.eu.lever.co/v0/postings/cirrus?mode=json' )
import requests

response = requests.get(
  'https://api.eu.lever.co/v0/postings/cirrus?mode=json'
)

Now we get a successful response and the required data in the payload.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<Response [200]>
<Response [200]>
<Response [200]>

If we unpack the first few items in the payload then they yield a series of job listings. We’ll just print the job titles.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
for job in response.json():
print(job["text"])
for job in response.json(): print(job["text"])
for job in response.json():
  print(job["text"])
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Account Payable Specialist (EH-64000218)
Analog Design Engineer - Power (PC-64000100)
Applications Engineer (DO-64000212)
Applications Engineer (JR-64000232)
Benefits Specialist
Customer Program Manager - PC
Customer Program Manager - PC (NS-64000161)
Device Characterization and Spice Modeling Engineer (SB-TBD)
Director of Corporate Development (LL- 64000095)
Electronic Engineering Internship
Account Payable Specialist (EH-64000218) Analog Design Engineer - Power (PC-64000100) Applications Engineer (DO-64000212) Applications Engineer (JR-64000232) Benefits Specialist Customer Program Manager - PC Customer Program Manager - PC (NS-64000161) Device Characterization and Spice Modeling Engineer (SB-TBD) Director of Corporate Development (LL- 64000095) Electronic Engineering Internship
Account Payable Specialist (EH-64000218)
Analog Design Engineer - Power (PC-64000100)
Applications Engineer (DO-64000212)
Applications Engineer (JR-64000232)
Benefits Specialist
Customer Program Manager - PC
Customer Program Manager - PC (NS-64000161)
Device Characterization and Spice Modeling Engineer (SB-TBD)
Director of Corporate Development (LL- 64000095)
Electronic Engineering Internship

The order of the roles is not the same as you see on the site because they are dynamically grouped into categories on the site.

Emulating a Browser Cache in Python

The purpose of the

If-None-Match
If-None-Match header is to enable your browser to efficiently manage cached versions of the resource. If you’re sending your requests from Python (or your language of choice) then you can simply omit this header and get a fresh response each time.

Sometimes, however, it can be useful to have a request cache because it can speed up development. This is especially the case if you are sending a large volume of requests or have a few slow requests. The

request-cache package is a drop-in substitute for the
requests
requests package that implements a local cache. Responses can be cached in a variety of backends. Definitely worth checking out!

Conclusion

Although browser caching greatly improves your web experience, making sites significantly more responsive, it can get in the way of your web scraping efforts. Don’t send

ETag
ETag headers along with your web scraping requests and avoid getting a

(

Not Modified
Not Modified)
response.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.