Skip to main content

How to Scrape a Site with Beautiful Soap and Urllib


Scrapping is a technique where data is collected from targeted sites to be used in developing the desired apps or building a source of information. Scrapping is mostly depended on the necessity of data which is not easily available. Data can be used in a good way or can be used in some harmful way. There are three main steps to scrap a website.

Choose your target website

The first thing you need is to find the website you want to scrape to collect the data. Some sites prevent the scrapping to avoid unnecessary visits to pages or to protect the data. Therefore you need to check the site before doing the scrapping. I’m building the dataset for the Urdu language for NER and I wanted to get the names of boys and girls in Arabic as well as in Urdu. Unfortunately, there is no dataset available yet, so I decided to find and scrape the names.

Understand the structure of HTML

It is important that you know the tags of HTML language. Once you know the structure of the site you want to scrape, it will help you to select those elements of HTML tags you want to scrape to get the data. The inspection tool is very helpful to go through HTML tags and select elements that contains necessary data. You can use IDs or classes of elements to extract the data.

Scrape the site

There are many tools available in Python for scrapping the site such as Scrapy, Beautiful Soap, Selenium. I’ve used Beautiful Soap for scrapping so I’m sticking to it. Run the commands in the terminal to install beautiful soap and numpy.

pip install beautifulsoup4
pip install numpy
I’ve selected Urdu Point for getting the names data for boys and girls.

The Code

Import Beautiful Soap and urllib for opening the web pages.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
Here is the HTML of pagination elements:


Get the total count of pages from the pagination list.

names_page = "https://www.urdupoint.com/names/boys-islamic-names-urdu.html"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(names_page, headers=hdr)
page = urlopen(req)
soup_object = BeautifulSoup(page, features='html.parser')
pagination = soup_object.find("ul", attrs={"class": "pagination"})
total_pages = pagination.findAll("li")
total_pages = list(filter(None, [l.text for l in total_pages]))
total_count = int(total_pages[-1]) + 1
print("total pages " + total_pages[-1])
Now use this total_count variable to scrape each page. HTML lists containing required names:

Select list elements using find method with class attrs and then grab all the links and get the text values.
pagination_page = "https://www.urdupoint.com/names/boys-islamic-names-urdu/{}.html"
boys_names = []
for i in range(1, total_count):
    page = pagination_page.format(str(i))
    req = Request(page, headers=hdr)
    page = urlopen(req)
    print("Scrapping Page {}".format(i))
    soup_object = BeautifulSoup(page, features='html.parser')
    names_list = soup_object.find("ul", attrs={'class':    'list_item_wrap'})
    names_list = names_list.find_all('a', href=True)
    for name in names_list:
        boys_names.append(name.text.split("-")[1].strip())
Now save the scrape names in csv format. You can use Numpy or Pandas to save the csv. I’m using Numpy’s `savetxt` function for saving the data into csv format.
np.savetxt("./boys_names.csv", boys_names, fmt="%s", delimiter=",",
           encoding="utf-8", header="boys_names", comments="")
I’ve used this same code for scrapping the girls’ name from these URLs. Just replace the links and variables according to your choice.

names_page = "https://www.urdupoint.com/names/girls-islamic-names-urdu.html"
pagination_page = "https://www.urdupoint.com/names/girls-islamic-names-urdu/{}.html"
This is my first article on scrapping and I believe it just scratches the surface of web scrapping.

If you have any question feel free to comment or contact me on Linkedin.

Comments

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing