UrduNLP

Posts

Showing posts from April, 2019

Remove duplicates from two Pandas DataFrame

During my research on Indeed for classifying the job types, I found an issue regarding multiple labels for the same job i.e. temporary and part-time are assigned to the same jobs. I used the following code to remove common descriptions of jobs with different labels. import pandas as pd part_time = pd.read_csv("part_time.csv", index_col=0) temporary = pd.read_csv("temporary.csv", index_col=0) # find common jobs using description column with isin() function. # A intersection B A = part_time[part_time.description.isin(temporary.description)] # remove common elements from both part_time and temporary jobs. # temporary - A # part_time - A temporary = temporary[~temporary.description.isin(A)] part_time = part_time[~part_time.description.isin(A)] # now concat these two data frames and save. total = pd.concat([temporary, part_time]) total.to_csv("indeed_jobs.csv", index=False) Hoping it will help those who have the same issue.

How to Scrape a Site with Beautiful Soap and Urllib

Scrapping is a technique where data is collected from targeted sites to be used in developing the desired apps or building a source of information. Scrapping is mostly depended on the necessity of data which is not easily available. Data can be used in a good way or can be used in some harmful way. There are three main steps to scrap a website. Choose your target website The first thing you need is to find the website you want to scrape to collect the data. Some sites prevent the scrapping to avoid unnecessary visits to pages or to protect the data. Therefore you need to check the site before doing the scrapping. I’m building the dataset for the Urdu language for NER and I wanted to get the names of boys and girls in Arabic as well as in Urdu. Unfortunately, there is no dataset available yet, so I decided to find and scrape the names. Understand the structure of HTML It is important that you know the tags of HTML language. Once you know the structure of the site you want to ...

Urdu POS Tagging using MLP

Urdu is a less developed language as compared to English for natural language processing applications. POS is a simple and most common natural language processing task but the dataset for training Urdu POS is in scarcity. There are different POS tagsets such as Muaz’s Tagset and Sajjad’s Tagset available in the literature. Due to the non-availability of the dataset and restriction to use dataset, much of NLP work is under progress. I’ve developed a dataset of training POS for the Urdu language. It is available on Github . It is a small dataset more than enough to train the POS tagger. It has been build using Sajja’s Tagset because this tagset covers all the words in Urdu literature and has 39 tags. Here are some examples of this tag set. I’ve used Keras to build the MLP model for POS. Data is in tab-separated form and converted to sentences and tags using utility functions. Here is the format of data.txt [('اشتتیاق', 'NN'), ('اور', 'CC...