Python-bloggers

Downloading Files with Selenium

This article was first published on Python - datawookie , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

If you use Selenium for browser automation then at some stage you are likely to need to download a file by clicking a button or link on a website. Sometimes this just works. Other times it doesn’t.

When I encounter a stubborn download I have found that adding some specific preferences when I launch Selenium can help.

These are the preferences I apply:

prefs = {
  "download.default_directory": os.getcwd(),
  "download.prompt_for_download": False,
  "directory_upgrade": True,
  "safebrowsing.enabled": True,
  "profile.default_content_settings.popups": 0,
  "profile.content_settings.exceptions.automatic_downloads.*.setting": 1,
  "profile.default_content_setting_values.automatic_downloads": 1,
  "profile.default_content_settings.mimetype_overrides": {
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  }
}

What does each of those do?

Of these, the final preference, which specifies how the XLSX MIME type should be handled, is probably the most important. Where does the MIME type come from? It should be found in the server headers for the download (so crack open Developer Tools to find it). Without this setting it’s possible that the browser might apply a generic MIME type (like application/octet-stream), and this might cause the browser to prompt the user for how to handle the downloaded file.

Take a look at a complete Python script that downloads an XLS file from here. In the interests of full disclosure, this script will work fine without those extra preferences, but it does illustrate what needs to be done for a more stubborn site. The server headers for this download are included below.

HTTP/2 200 
last-modified: Tue, 22 Mar 2022 12:47:49 GMT
content-length: 8704
content-type: application/vnd.ms-excel
date: Sat, 05 Oct 2024 04:15:52 GMT
cache-control: max-age=0
expires: Sat, 05 Oct 2024 04:15:52 GMT
server: Apache

Clearly the browser already knows to save the application/vnd.ms-excel MIME type specified in the content-type header. For comparison, here are the server headers for a download from here:

HTTP/2 200 
last-modified: Thu, 27 Jan 2022 17:47:57 GMT
content-length: 9487759
content-type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
date: Sat, 05 Oct 2024 04:19:11 GMT
cache-control: max-age=86400
expires: Sun, 06 Oct 2024 04:19:11 GMT
server: nginx/1.25.5

Note that this uses a different MIME type (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) to download a XLSX file.

To leave a comment for the author, please follow the link and comment on their blog: Python - datawookie .

Want to share your content on python-bloggers? click here.
Exit mobile version