Bypassing Cloudflare with Cloudscraper

Posted on July 20, 2024 by Python - datawookie in Data science | 0 Comments

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Cloudflare is a service that aims improve the performance and security of websites. It operates as a content delivery network (CDN) to ensure faster load times and consequently better user experience. However, it also protects against online threats by filtering “malicious” traffic.

Web scraping requests are often deemed to be malicious (certainly by Cloudflare!) and thus blocked. There are various approaches to circumventing this, most of which involve running a live browser instance. For some applications though, this is a bit hammer for a small nail. The cloudscraper package provides a lightweight option for dealing with Cloudflare and has an API similar to the requests package.

Sites using Cloudflare

Take a look at the list of sites using Cloudflare. We’ll pick the first item on the list, OpenAI, as a test target.

OpenAI home page.

Setup

Install the cloudscraper and requests Python packages.

beautifulsoup4==4.12.3
cloudscraper==1.2.71
requests==2.32.3

Also throw in beautifulsoup4 so that we can parse a response… when we get one!

Using Requests

Let’s try retrieving the content of the OpenAI homepage using a GET request via requests.

import requests

response = requests.get("https://openai.com/")

print(response.status_code)

Not surprisingly, Cloudflare intervenes and we get a 403 “Forbidden” response.

Using Cloudscraper

Maybe we’ll have more success using the cloudscraper package?

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.CloudScraper()

response = scraper.get("http://openai.com")

print(response.status_code)

soup = BeautifulSoup(response.text, "html.parser")

banner = soup.select_one("h2")
print(banner.text)

And the result?

200
ChatGPT on your desktop

Success!

There are lots of options for tweaking the Cloudscraper configuration. For example, you can specify the browser type (Chrome or Firefox), platform (Linux, Windows, Darwin, Android or IOS) and whether or not it’s on desktop or mobile. You can also choose from a selection of JavaScript engines.

import cloudscraper

URL = "http://openai.com"

scraper = cloudscraper.CloudScraper(
  # Browser specifications.
  browser={
        "browser": "firefox",
        "platform": "linux",
        "desktop": True,
        "mobile": False
    },
  # JavaScript Engine
  interpreter="nodejs",
  # debug=True
)

response = scraper.get(URL)

print(response.status_code)

Various third party Captcha solvers, like 2captcha, anticaptcha and CapSolver, are also supported.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers

Bypassing Cloudflare with Cloudscraper

Sites using Cloudflare

Setup

Using Requests

Using Cloudscraper

Related