Skip to main content

Posts

Showing posts from 2019

How to build NER dataset for Urdu language?

Prodigy annotation tool Named Entity Recognition is the most common and important task in NLP. There are a lot of resources and prebuild solutions available for the English language. Urdu is a scarce resource language and there are no usable datasets available that can be used. In this article, I'm going to show you how you can build your own NER dataset with minimal effort by annotating the entities. I'm using UNER( https://github.com/mirfan899/Urdu#uner-dataset ) entities for this article. Annotator : There are some good annotators available to annotate the text data like  http://brat.nlplab.org/ ,  https://prodi.gy/ ,  https://www.lighttag.io/ , and  https://github.com/YuukanOO/tracy . I'm more interested in building a dataset that can be used for a chatbot in the future. So I've decided to use Prodigy ( https://prodi.gy/ ), you need to purchase the license to use it or you can apply for educational research to get the license. Commands: Train ner-...

How to build Word 2 Vector for Urdu language.

Vector representation of words Word to vector is an important feature of a language model. It gives you insights into a language model and how effective that model is. There is no general w2v  (word to vector) model for a language, you want to use. You need to build it for your language depending on the domain and usage for specific applications. You can build a large w2v model but it creates a heavy burden of performance and it will use a lot of resources in the applications. So, you need to choose the dataset accordingly and build a w2v model for that specific application. For example, if you are building a model for news, choose the news dataset and then build a w2v model using that dataset. In this article, I'm going to build a w2v model for freely available journalism dataset. COUNTER ( COrpus of Urdu News TExt Reuse): This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information. This ...

Urdu Word and Sentence Similarity using SpaCy

The similarity is the common measure of understanding how much close two words or sentences are to each other. There are multiple ways to find out the similarity of two documents and the most common being used in NLP is Cosine Similarity. Cosine Similarity is counted using vectors (word2vector) and provides information about how much two vectors are close in the context of orientation. Some helpful links to understand the similarity concepts: https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50 https://www.sciencedirect.com/topics/computer-science/cosine-similarity It mostly depends on the quality of the vectors of the documents. If you want to get better results, build a better word 2 vector model. To use the similarity feature of SpaCy, you need to build a language model (you can build a language model by following my article  https://www.urdunlp.com/2019/08/how-to-build-urdu-language-model-in.html ). Here is how I've calculated the cosi...

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing...

Named Entity Recognition for Urdu

Urdu is a less developed language as compared to English. That's why it lacks resources of research and development for natural language processing, speech recognition, and other AI and ML related problems. It took me so long to build a dataset and enhance it for NLP tasks because the datasets which are available are not enough to do ML. Most of the dataset is proprietary which restricts the researchers and developers. Fortunately, I've made POS and NER dataset publicly available on Github for research and development. This article is related to building the NER model using the  UNER dataset using Python. Install the necessary packages  for training. pip3 install numpy keras matplotlib scikit-learn  then  import the packages. import ast import codecs import json import matplotlib.pyplot as plt import numpy as np from keras.layers import Dense, InputLayer, Embedding, Activation, LSTM from keras.models import Sequential from keras.optimizers import Adam from...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...

Classifying Indeed Jobs using DNNs

Indeed is a top site for posting job adds. Companies and employers post ads of required jobs on Indeed to hire people with required skills. It helps job seekers to find jobs depending on their skills and expertise. There are many parameters for finding the job such as salary range, location, skills, requirements, job type and many more. But for classification of jobs, I’ve chosen requirements and location parameters to predict jobs using job type as classes or labels. Step 1: Scrapping Indeed Jobs I’ve used beautiful soap to scrap Indeed for different types of jobs. Indeed uses some tricks to prevent the site to be scrapped. Here is how I did it. import re import pandas as pd from time import sleep from bs4 import BeautifulSoup from nltk.corpus import stopwords from urllib.request import urlopen import nltk FULL_TIME = 'full%20time&l=New%20York' PART_TIME = 'part%20time&l=New%20York' CONTRACT = 'contract&l=New%20York' INTERNSHIP =...

Remove duplicates from two Pandas DataFrame

During my research on Indeed for classifying the job types, I found an issue regarding multiple labels for the same job i.e. temporary and part-time are assigned to the same jobs. I used the following code to remove common descriptions of jobs with different labels. import pandas as pd part_time = pd.read_csv("part_time.csv", index_col=0) temporary = pd.read_csv("temporary.csv", index_col=0) # find common jobs using description column with isin() function. # A intersection B A = part_time[part_time.description.isin(temporary.description)] # remove common elements from both part_time and temporary jobs. # temporary - A # part_time - A temporary = temporary[~temporary.description.isin(A)] part_time = part_time[~part_time.description.isin(A)] # now concat these two data frames and save. total = pd.concat([temporary, part_time]) total.to_csv("indeed_jobs.csv", index=False) Hoping it will help those who have the same issue.

How to Scrape a Site with Beautiful Soap and Urllib

Scrapping is a technique where data is collected from targeted sites to be used in developing the desired apps or building a source of information. Scrapping is mostly depended on the necessity of data which is not easily available. Data can be used in a good way or can be used in some harmful way. There are three main steps to scrap a website. Choose your target website The first thing you need is to find the website you want to scrape to collect the data. Some sites prevent the scrapping to avoid unnecessary visits to pages or to protect the data. Therefore you need to check the site before doing the scrapping. I’m building the dataset for the Urdu language for NER and I wanted to get the names of boys and girls in Arabic as well as in Urdu. Unfortunately, there is no dataset available yet, so I decided to find and scrape the names. Understand the structure of HTML It is important that you know the tags of HTML language. Once you know the structure of the site you want to ...

Urdu POS Tagging using MLP

Urdu is a less developed language as compared to English for natural language processing applications. POS is a simple and most common natural language processing task but the dataset for training Urdu POS is in scarcity. There are different POS tagsets such as Muaz’s Tagset and Sajjad’s Tagset  available in the literature. Due to the non-availability of the dataset and restriction to use dataset, much of NLP work is under progress. I’ve developed a dataset of training POS for the Urdu language. It is available on Github . It is a small dataset more than enough to train the POS tagger. It has been build using Sajja’s Tagset because this tagset covers all the words in Urdu literature and has 39 tags. Here are some examples of this tag set. I’ve used Keras to build the MLP model for POS. Data is in tab-separated form and converted to sentences and tags using utility functions. Here is the format of data.txt [('اشتتیاق', 'NN'), ('اور', 'CC...