Steps Involved in Web Crawling
To perform this tutorial step-by-step with me, you’ll need Python3 already configured on your local development machine. You can set up everything you need before-hand and then come back to continue ahead.
Creating a Basic Web Scraper
Web Scraping is a two-step process:
You send HTTP request and get source code web pages.
You take that source code and extract information from it.
Both these steps can be implemented in numerous ways in various languages. But we will be using request and bs4 packages of python to perform them.
pip install beautifulsoup4
If you want to install BeautifulSoup4 without using pip or you face any issues during installation you can always refer to the official documentation.
Create a new folder 📂 : With bs4 ready to be utilized, let’s create a new folder for our lab inside any code editor you want (I will be using Microsoft Visual Studio Code
Firstly, we import request package from urllib folder (a directory containing multiple packages related to HTTP requests and responses) of Python so that we can use a particular function that the package provides to make an HTTP request to the website, from where we are trying to scrape data, to get complete source code of its webpage.
import urllib.request as req
Import BeautifulSoup4 package
Next, we bring in the bs4 package that we installed using pip. Think of bs4 as a specialized package to read HTML or XML data. Bs4 has methods and behaviours that allow us to extract data from the webpages’ source code we provide to it, but it doesn’t know what data to look for or in which part to look out.
We will help it to gather information from the webpage and return that info back to us.
import bs4
Provide the URL for webpage
Finally, we provide the crawler with URL of the webpage from where we want to start gathering data: https://www.indeed.co.in/python-jobs.
If you paste this URL in your browser, you will reach indeed.com’s search results page, showing the most relevant jobs out of 11K jobs containing Python as a skill required.
Next, we will send an HTTP request to this URL.
URL = “https://www.indeed.co.in/python-jobs“
Making an HTTP request
Now let’s make a request to indeed.com for the search results page, using HTTP(S) protocol. You typically make this request by using urlopen() from the request package of Python. However, the HTTP response we get is just an object and we cannot make anything useful out it. So, we will handover this object to bs4 to extract the source code and do the needful with it. Send a request to a particular website like this:
response = req.urlopen(URL)
Extracting the source code
Now let’s extract out the source code from the response object. You, generally, will do this by feeding this response object to the BeautifulSoup class present inside bs4 package. However, this source code is very large and it’s a very tedious task to read through it, so we would want to filter the information out of this source code later on. Hand over the response object to BeautifulSoup by writing the following line:
htmlSourceCode = bs4.BeautifulSoup(response)
Testing the crawler
Now let’s test out the code. You can run your Python files by running a command like python in the integrated terminal of VS Code. Moreover, VS Code has got a graphical play button which can directly run the file which is currently open in the text editor. Still, execute your file by running the following command:
python crawler.py
Read Full Article Here – https://brain-mentors.com/web-crawling-in-python/