Want to share your content on python-bloggers? click here.
If you use Selenium for browser automation then at some stage you are likely to need to download a file by clicking a button or link on a website. Sometimes this just works. Other times it doesn’t.
When I encounter a stubborn download I have found that adding some specific preferences when I launch Selenium can help.
These are the preferences I apply:
prefs = { "download.default_directory": os.getcwd(), "download.prompt_for_download": False, "directory_upgrade": True, "safebrowsing.enabled": True, "profile.default_content_settings.popups": 0, "profile.content_settings.exceptions.automatic_downloads.*.setting": 1, "profile.default_content_setting_values.automatic_downloads": 1, "profile.default_content_settings.mimetype_overrides": { "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" } }
What does each of those do?
download.default_directory
— Sets the download directory. Not strictly necessary, but useful to have control over this. Defaults to~/Downloads
.download.prompt_for_download
— Prevents the browser from asking where to save the file.directory_upgrade
— Allows browser to change download directory.safebrowsing.enabled
— Enables the Safe Browsing feature, which protects against phishing, malware, and other malicious content. Again, not strictly necessary, but good to have.profile.default_content_settings.popups
— Block popups. This refers to browser popups, not in-page dialogs or popups.profile.content_settings.exceptions.automatic_downloads.*.setting
— Allow multiple automatic downloads without requiring user intervention.profile.default_content_setting_values.automatic_downloads
— Allow automatic downloads.profile.default_content_settings.mimetype_overrides
— Override MIME type handling for specific file types.
Of these, the final preference, which specifies how the XLSX MIME type should be handled, is probably the most important. Where does the MIME type come from? It should be found in the server headers for the download (so crack open Developer Tools to find it). Without this setting it’s possible that the browser might apply a generic MIME type (like application/octet-stream
), and this might cause the browser to prompt the user for how to handle the downloaded file.
Take a look at a complete Python script that downloads an XLS file from here. In the interests of full disclosure, this script will work fine without those extra preferences, but it does illustrate what needs to be done for a more stubborn site. The server headers for this download are included below.
HTTP/2 200 last-modified: Tue, 22 Mar 2022 12:47:49 GMT content-length: 8704 content-type: application/vnd.ms-excel date: Sat, 05 Oct 2024 04:15:52 GMT cache-control: max-age=0 expires: Sat, 05 Oct 2024 04:15:52 GMT server: Apache
Clearly the browser already knows to save the application/vnd.ms-excel
MIME type specified in the content-type
header. For comparison, here are the server headers for a download from here:
HTTP/2 200 last-modified: Thu, 27 Jan 2022 17:47:57 GMT content-length: 9487759 content-type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet date: Sat, 05 Oct 2024 04:19:11 GMT cache-control: max-age=86400 expires: Sun, 06 Oct 2024 04:19:11 GMT server: nginx/1.25.5
Note that this uses a different MIME type (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
) to download a XLSX file.
Want to share your content on python-bloggers? click here.