Download Images from a Web Page using Python

[This article was first published on PyShark, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this article we will discuss how to download images from a web page using Python.

Table of Contents

  • Introduction
  • Get HTML content from URL
  • Finding and extracting image links from HTML
  • Downloading images from URL
  • Complete Object-Oriented Programming Example
  • Conclusion

Introduction

text here

Let’s see how we can quickly build our own image scraper using Python.

To continue following this tutorial we will need the following Python libraries: httplib2, bs4 and urllib.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install httplib2
pip install bs4
pip install urllib

Get HTML content from URL using Python

To begin this part, let’s first import some of the libraries we just installed:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

Now, let’s decide on the URL that we would like to extract the images from. As an example, I will extract the images from the one of the articles of this blog https://pyshark.com/principal-component-analysis-in-python/:

url = 'https://pyshark.com/principal-component-analysis-in-python/'

Next, we will create an instance of a class that represents a client HTTP interface:

http = httplib2.Http()

We will need this instance in order to perform HTTP requests to the URLs we would like to extract images from.

Now we will need to perform the following HTTP request:

response, content = http.request(url)

An important note is that .request() method returns a tuple, the first being an instance of a Response class, and the second being the content of the body of the URL we are working with.

Now, we will only need to use the content component of the tuple, being the actual HTML content of the webpage, which contains the entity of the body in a string format.


Finding and extracting image links from HTML using Python

At this point we have the HTML content of the URL we would like to extract links from. We are only a few steps away from getting all the information we need.

Let’s see how we can extract the image links:

images =  BeautifulSoup(content).find_all('img')

image_links =[]

for image in images:
    image_links.append(image['src'])

To begin with, we create a BeautifulSoup() object and pass the HTML content to it. What it does is it creates a nested representations of the HTML content.

Then, we create an empty list (image_links) that we will use to store the image links that we will extract from the HTML content of the webpage.

As the final step, what we need to do is actually discover the image links from the entire HTML content of the webapage. To do it, we use the .find_all() method and let it know that we would like to discover only the tags that are actually image links.

Once the script discovers the URLs, it will append them to the links list we have created before. In order to check what we found, simply print out the content of the final list:

for link in image_links:
    print(link)

And we should see each image link printed out one by one.


Downloading Images from a Web Page using Python

In this step we will use the image links we found in the above steps to download and save the images.

Let’s start with importing the required library:

import urllib.request

Next, we will iterate through the image_links list and download each image:

for link in image_links:
    filename = link.split("/")[-1].split("?")[0]
    urllib.request.urlretrieve(link, filename=filename)

Note: your string splitting for filename can be different depending on the original image link.

You should see the images being saved in the same folder as your Python file.


Complete Object-Oriented Programming Example

class Extractor():

    
    def get_links(self, url):

        http = httplib2.Http()
        response, content = http.request(url)

        images =  BeautifulSoup(content).find_all('img')

        image_links=[]

        for image in images:
            image_links.append(image['src'])
        
        return image_links

    
    def get_images(self, image_links):
        
        for link in image_links:
            
            filename = link.split("/")[-1].split("?")[0]
            
            urllib.request.urlretrieve(image_url, filename=filename)

And this is an example of getting images from a web page using the above class:

url = 'https://pyshark.com/principal-component-analysis-in-python/'

myextractor = Extractor()

image_links = myextractor.get_links(url)

myextractror.get_images(image_links)

Conclusion

This article introduces the basics of how to download images from a web page using Python httplib2, bs4 and urllib libraries as well as created a full process example.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.

The post Download Images from a Web Page using Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark.

Want to share your content on python-bloggers? click here.