Accelerating BeautifulSoup Encoding Detection

Python - datawookie

2 days ago

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I’ve noticed that some of my scraper tests are significantly slower than others. On closer examination I discovered that most of the delay is being incurred when BeautifulSoup is parsing HTML. And a significant proportion of that time is spent checking character encoding.

What is Character Encoding

Computers store and transmit text as a sequence of bytes. Character encodings are the rules that map those raw bytes to human-readable text. To get the correct letters, punctuation, and symbols, we need to know what encoding was used.

In the early days of computing the rules were simple. Everyone used the ASCII character set, which defined 128 characters including basic English letters, digits and punctuation. One byte per character, no ambiguity.

But as computing evolved and people needed to represent accented letters, symbols and non-Latin scripts, ASCII was no longer enough. New encodings were developed to support these characters. And then you had to know which encoding was being used or risk turning perfectly good text into complete nonsense.

These are some well known encodings:

ascii — the original 7-bit (128 character) ASCII encoding (still used widely but not often for HTML);
utf-8 — the 8-bit UTF (Unicode Transformation Format) encoding which is most widely used today and handles all Unicode characters; and
iso-8859-1 — Latin 1 encoding used for Western European languages.

Detecting Character Encoding

HTML documents are found with a variety of different encodings. BeautifulSoup needs to detect the appropriate encoding to ensure that it’s output is not a garbled mess.

Depending on how it’s done, detecting character encoding can be relatively time consuming. To get some intuition around this we’ll use the chardet package which gathers statistics to infer character encoding. On larger HTML documents identifying character encoding can often take significantly longer than actually parsing the HTML.

CLI Simple Test

Let’s see how this works with a few simple HTML document. First a tiny HTML5 document that uses a <meta> tag in the <head> section to specify the character set.

<!DOCTYPE html>
<html lang="fr">
<head>
  <meta charset="UTF-8">
</head>
<body>
  <p>Résumé, naïve, café.</p>
</body>
</html>

The legacy equivalent uses a <meta> tag to simulate a Content-Type HTTP header.

<!DOCTYPE html>
<html lang="fr">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
  <p>Résumé, naïve, café.</p>
</body>
</html>

The chardet package has a CLI client that can be used to quickly check encoding. The result is the same with either of the above files.

$ chardet with-charset-header.html

with-charset-header.html: utf-8 with confidence 0.938125

It identifies UTF-8 encoding with high confidence. What about the same document but without the header?

<!DOCTYPE html>
<html>
<body>
  <p>Résumé, naïve, café.</p>
</body>
</html>

$ chardet without-header.html

without-header.html: utf-8 with confidence 0.938125

For a simple document the presence of a <meta> tag giving the character encoding doesn’t seem to impact the results from chardet. I suspect that chardet ignores the header because it’s possible that it might not provide the correct information anyway (for example, the header says utf-8 but the document is actually encoded with iso-8859-1), in which case chardet would get better results from a statistical analysis of the HTML contents anyway. The chardet documentation says as much:

Sometimes you receive text with verifiably inaccurate encoding information.

chardet FAQ

CLI Realistic Test

Let’s try the CLI out with a more realistic HTML file.

time chardet cashmere-interior-acrylic-latex.html

cashmere-interior-acrylic-latex.html: utf-8 with confidence 0.99

real    0m0.936s
user    0m0.922s
sys     0m0.013s

The CLI client takes just under 1 second to detect the encoding for this file. Not long. However, if this was happening repeatedly across an extensive suite of tests then this delay would accumulate. Slow tests are a problem because they’re an impediment to rapid development.

Python Tests

Let’s repeat these checks from Python.

import chardet


def find_encoding(filename: str):
    with open(filename, "rb") as f:
        html = f.read()

    print(chardet.detect(html))


find_encoding("with-charset-header.html")
find_encoding("without-header.html")
find_encoding("cashmere-interior-acrylic-latex.html")

The results are consistent with what we got from the CLI.

{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Unicode, Dammit!

BeautifulSoup doesn’t actually use chardet directly. Instead it uses a sub-library called UnicodeDammit (Unicode, Dammit!) to detect encoding and if necessary convert to Unicode. It depends on either chardet (implemented in Python) or the quicker cchardet (implemented in C) to do the actual encoding detection.

from bs4 import UnicodeDammit

with open("cashmere-interior-acrylic-latex.html", "rb") as f:
    html = f.read()

dammit = UnicodeDammit(html)
print(dammit.original_encoding)

The .original_encoding attribute of the UnicodeDammit object gives the document’s original encoding.

utf-8

This takes around half a second.

real    0m0.661s
user    0m0.649s
sys     0m0.012s

HTML Parsing and Encoding Detection

Let’s try parsing the realistic HTML file using BeautifulSoup.

from bs4 import BeautifulSoup

with open("cashmere-interior-acrylic-latex.html", "rb") as f:
    html = f.read()

# Create soup from bytes.
soup = BeautifulSoup(html, "lxml")

print(soup.original_encoding)

I’m reading the file as bytes because this most closely emulates my normal scraping workflow, where I persist downloaded HTML as bytes. Again the .original_encoding attribute gives the document’s original encoding.

utf-8

This takes slightly longer than simply determining the encoding, some of the extra time being taken to parse the HTML.

real    0m0.704s
user    0m0.687s
sys     0m0.017s

Use the cProfile module to generate profiling data and retain only the biggest contributors. The total execution time is a little longer because of the overhead of running the profiler. However, it’s clear that most of the time is being spent figuring out the correct encoding.

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  68/1    0.000    0.000    1.117    1.117 {built-in method builtins.exec}
     1    0.000    0.000    1.117    1.117 html-parse-naive.py:1(<module>)
     1    0.000    0.000    1.040    1.040 __init__.py:122(__init__)
     1    0.000    0.000    0.923    0.923 _lxml.py:149(prepare_markup)
     1    0.000    0.000    0.923    0.923 dammit.py:407(encodings)
     1    0.000    0.000    0.923    0.923 dammit.py:43(chardet_dammit)
     1    0.000    0.000    0.923    0.923 __init__.py:24(detect)
     1    0.000    0.000    0.923    0.923 universaldetector.py:111(feed)
     2    0.000    0.000    0.703    0.352 charsetgroupprober.py:65(feed)
    14    0.082    0.006    0.469    0.034 sbcharsetprober.py:77(feed)
    13    0.000    0.000    0.387    0.030 charsetprober.py:66(filter_international_words)
  1274    0.387    0.000    0.387    0.000 {method 'findall' of 're.Pattern' objects}
    13    0.000    0.000    0.386    0.030 __init__.py:209(findall)
     1    0.010    0.010    0.218    0.218 latin1prober.py:116(feed)
     1    0.116    0.116    0.211    0.211 utf8prober.py:57(feed)
     1    0.157    0.157    0.209    0.209 charsetprober.py:103(filter_with_english_letters)

Since most HTML documents are UTF-8 encoded this seems like wasted effort. Surely if we know (or are pretty certain of) the correct encoding then we don’t need to check each time?

How can we make that more efficient? Here are some options.

Open as Text with Explicit Encoding

Open the file with the correct encoding.

from bs4 import BeautifulSoup

# The mode is implicitly "rt".
with open("cashmere-interior-acrylic-latex.html", encoding="utf-8") as f:
    html = f.read()

# Create soup from str.
soup = BeautifulSoup(html, "lxml")

print(soup.original_encoding)

A Unicode string (rather than bytes) is being passed to BeautifulSoup, so it doesn’t need to guess the encoding. And the .original_encoding attribute is consequently empty.

None

It’s also much faster.

real    0m0.109s
user    0m0.090s
sys     0m0.019s

Read as Bytes then Decode

Read the file as bytes and then decode using the correct encoding.

from bs4 import BeautifulSoup

with open("cashmere-interior-acrylic-latex.html", "rb") as f:
    html = f.read()

# Decode to str using appropriate encoding.
html = html.decode("utf-8")

# Create soup from str.
soup = BeautifulSoup(html, "lxml")

print(soup.original_encoding)

The .original_encoding attribute is empty again because BeautifulSoup receives a decoded string and doesn’t have to guess the encoding.

None

Execution time is essentially the same as the previous example.

Read as Bytes then Parse with Explicit Encoding

Read the file as bytes and then parse using the correct encoding.

from bs4 import BeautifulSoup

with open("cashmere-interior-acrylic-latex.html", "rb") as f:
    html = f.read()

# Create soup from bytes with specific encoding.
soup = BeautifulSoup(html, "lxml", from_encoding="utf-8")

print(soup.original_encoding)

Now the .original_encoding attribute is populated with the provided encoding. No guessing required.

utf-8

Execution time is similar again.

Much ado about nothing?

Whether or not encoding overhead is a problem depends on your workflow.

If you retrieve the HTML content (using requests.get() or httpx.get()) and immediately parse it then you probably don’t need to worry about encodings because requests or httpx will do it for you. The .text attribute on the response object will automatically apply the appropriate encoding. 🤞 If the result is not quite what you expected then you can intervene manually by either setting the .encoding attribute on the response object or decoding explicitly.

# Set the encoding.
response.encoding = "utf-8"
response.text
# Decode explicitly.
response.content.decode("utf-8")

It’s really more of an issue where your workflow consists of multiple steps like this:

Download the raw HTML content and then persist to file as bytes.
Load bytes from the file then parse.

In this case BeautifulSoup may need to work harder to determine the appropriate encoding to apply. But using any of the three approaches illustrated above should sort this out!

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.