Extract Links from a Web Page using Python

[This article was first published on PyShark, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this article we will discuss how to extract links from a URL using Python.

Table of Contents

  • Introduction
  • Get HTML content from URL
  • Finding and extracting links from HTML
  • Complete Object-Oriented Programming Example
  • Conclusion

Introduction

URL extractors are a very popular tool for everyone involved in the digital space, from marketers to SEO professionals. It is also a big part for web scrapers in the programming community. These scripts range from very simple ones (like the one in this tutorial) to very advanced web crawlers used by the industry leaders.

Let’s see how we can quickly build our own URL scraper using Python.

To continue following this tutorial we will need the two Python libraries: httplib2 and bs4.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install httplib2
pip install bs4

Get HTML content from URL using Python

To begin this part, let’s first decide on the URL that we would like to extract the links from. As an example, I will extract the links from the homepage of this blog https://pyshark.com/:

url = 'https://pyshark.com/'

Next, we will create an instance of a class that represents a client HTTP interface:

http = httplib2.Http()

We will need this instance in order to perform HTTP requests to the URLs we would like to extract links from.

Now we will need to perform the following HTTP request:

response, content = http.request(url)

An important note is that .request() method returns a tuple, the first being an instance of a Response class, and the second being the content of the cody of the URL we are working with.

Now, we will only need to use the content component of the tuple, being the actual HTML content of the webpage, which contains the entity of the body in a string format.


Finding and extracting links from HTML using Python

At this point we have the HTML content of the URL we would like to extract links from. We are only one step away from getting all the information we need.

Let’s see how we can extract the needed information:

links=[]

for link in BeautifulSoup(content).find_all('a', href=True):
    links.append(link['href'])

To begin with, we create an empty list (links) that we will use to store the links that we will extract from the HTML content of the webpage.

Then, we create a BeautifulSoup() object and pass the HTML content to it. What it does is it creates a nested representations of the HTML content.

As the final step, what we need to do is actually discover the links from the entire HTML content of the webapage. To do it, we use the .find_all() method and let it know that we would like to discover only the tags that are actually links.

Once the script discovers the URLs, it will append them to the links list we have created before. In order to check what we found, simply print out the content of the final list:

for link in links:
    print(link)

And we should see each URL printed out one by one.,


Complete Object-Oriented Programming Example

class Extractor():
    
    def get_links(self, url):

        http = httplib2.Http()
        response, content = http.request(url)

        links=[]

        for link in BeautifulSoup(content).find_all('a', href=True):
            links.append(link['href'])
        
        return links

And this is an example of getting links from a web page using the above class:

url = 'https://pyshark.com/'

myextractor = Extractor()

links = myextractor.get_links()

Conclusion

This article introduces the basics of link scraping from web pages using httplib2 and bs4 libraries as well as created a full process example.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.

The post Extract Links from a Web Page using Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark.

Want to share your content on python-bloggers? click here.