In this article we will discuss how to download images from a web page using Python.
Table of Contents
- Get HTML content from URL
- Finding and extracting image links from HTML
- Downloading images from URL
- Complete Object-Oriented Programming Example
Let’s see how we can quickly build our own image scraper using Python.
To continue following this tutorial we will need the following Python libraries: httplib2, bs4 and urllib.
If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install httplib2 pip install bs4 pip install urllib
Get HTML content from URL using Python
To begin this part, let’s first import some of the libraries we just installed:
import httplib2 from bs4 import BeautifulSoup, SoupStrainer
Now, let’s decide on the URL that we would like to extract the images from. As an example, I will extract the images from the one of the articles of this blog https://pyshark.com/principal-component-analysis-in-python/:
url = 'https://pyshark.com/principal-component-analysis-in-python/'
Next, we will create an instance of a class that represents a client HTTP interface:
http = httplib2.Http()
We will need this instance in order to perform HTTP requests to the URLs we would like to extract images from.
Now we will need to perform the following HTTP request:
response, content = http.request(url)
An important note is that .request() method returns a tuple, the first being an instance of a Response class, and the second being the content of the body of the URL we are working with.
Now, we will only need to use the content component of the tuple, being the actual HTML content of the webpage, which contains the entity of the body in a string format.
Finding and extracting image links from HTML using Python
At this point we have the HTML content of the URL we would like to extract links from. We are only a few steps away from getting all the information we need.
Let’s see how we can extract the image links:
images = BeautifulSoup(content).find_all('img') image_links = for image in images: image_links.append(image['src'])
To begin with, we create a BeautifulSoup() object and pass the HTML content to it. What it does is it creates a nested representations of the HTML content.
Then, we create an empty list (image_links) that we will use to store the image links that we will extract from the HTML content of the webpage.
As the final step, what we need to do is actually discover the image links from the entire HTML content of the webapage. To do it, we use the .find_all() method and let it know that we would like to discover only the tags that are actually image links.
Once the script discovers the URLs, it will append them to the links list we have created before. In order to check what we found, simply print out the content of the final list:
for link in image_links: print(link)
And we should see each image link printed out one by one.
Downloading Images from a Web Page using Python
In this step we will use the image links we found in the above steps to download and save the images.
Let’s start with importing the required library:
Next, we will iterate through the image_links list and download each image:
for link in image_links: filename = link.split("/")[-1].split("?") urllib.request.urlretrieve(link, filename=filename)
Note: your string splitting for filename can be different depending on the original image link.
You should see the images being saved in the same folder as your Python file.
Complete Object-Oriented Programming Example
class Extractor(): def get_links(self, url): http = httplib2.Http() response, content = http.request(url) images = BeautifulSoup(content).find_all('img') image_links= for image in images: image_links.append(image['src']) return image_links def get_images(self, image_links): for link in image_links: filename = link.split("/")[-1].split("?") urllib.request.urlretrieve(image_url, filename=filename)
And this is an example of getting images from a web page using the above class:
url = 'https://pyshark.com/principal-component-analysis-in-python/' myextractor = Extractor() image_links = myextractor.get_links(url) myextractror.get_images(image_links)
This article introduces the basics of how to download images from a web page using Python httplib2, bs4 and urllib libraries as well as created a full process example.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.