Scrapping is a technique where data is collected from targeted sites to be used in developing the desired apps or building a source of information. Scrapping is mostly depended on the necessity of data which is not easily available. Data can be used in a good way or can be used in some harmful way. There are three main steps to scrap a website.
Choose your target website
The first thing you need is to find the website you want to scrape to collect the data. Some sites prevent the scrapping to avoid unnecessary visits to pages or to protect the data. Therefore you need to check the site before doing the scrapping. I’m building the dataset for the Urdu language for NER and I wanted to get the names of boys and girls in Arabic as well as in Urdu. Unfortunately, there is no dataset available yet, so I decided to find and scrape the names.Understand the structure of HTML
It is important that you know the tags of HTML language. Once you know the structure of the site you want to scrape, it will help you to select those elements of HTML tags you want to scrape to get the data. The inspection tool is very helpful to go through HTML tags and select elements that contains necessary data. You can use IDs or classes of elements to extract the data.
Scrape the site
There are many tools available in Python for scrapping the site such as Scrapy, Beautiful Soap, Selenium. I’ve used Beautiful Soap for scrapping so I’m sticking to it. Run the commands in the terminal to install beautiful soap and numpy.pip install beautifulsoup4
pip install numpy
The Code
Import Beautiful Soap and urllib for opening the web pages.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Here is the HTML of pagination elements:
Get the total count of pages from the pagination list.
names_page = "https://www.urdupoint.com/names/boys-islamic-names-urdu.html"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(names_page, headers=hdr)
page = urlopen(req)
soup_object = BeautifulSoup(page, features='html.parser')
pagination = soup_object.find("ul", attrs={"class": "pagination"})
total_pages = pagination.findAll("li")
total_pages = list(filter(None, [l.text for l in total_pages]))
total_count = int(total_pages[-1]) + 1
print("total pages " + total_pages[-1])
Select list elements using find method with class attrs and then grab all the links and get the text values.
pagination_page = "https://www.urdupoint.com/names/boys-islamic-names-urdu/{}.html"
boys_names = []
for i in range(1, total_count):
page = pagination_page.format(str(i))
req = Request(page, headers=hdr)
page = urlopen(req)
print("Scrapping Page {}".format(i))
soup_object = BeautifulSoup(page, features='html.parser')
names_list = soup_object.find("ul", attrs={'class': 'list_item_wrap'})
names_list = names_list.find_all('a', href=True)
for name in names_list:
boys_names.append(name.text.split("-")[1].strip())
Now save the scrape names in csv format. You can use Numpy or Pandas to save the csv. I’m using Numpy’s `savetxt` function for saving the data into csv format.
np.savetxt("./boys_names.csv", boys_names, fmt="%s", delimiter=",",
encoding="utf-8", header="boys_names", comments="")
names_page = "https://www.urdupoint.com/names/girls-islamic-names-urdu.html"
pagination_page = "https://www.urdupoint.com/names/girls-islamic-names-urdu/{}.html"
This is my first article on scrapping and I believe it just scratches the surface of web scrapping.
If you have any question feel free to comment or contact me on Linkedin.
can you help me scrape what mobile reviews
ReplyDelete