UrduNLP

Posts

Showing posts from May, 2019

Named Entity Recognition for Urdu

Urdu is a less developed language as compared to English. That's why it lacks resources of research and development for natural language processing, speech recognition, and other AI and ML related problems. It took me so long to build a dataset and enhance it for NLP tasks because the datasets which are available are not enough to do ML. Most of the dataset is proprietary which restricts the researchers and developers. Fortunately, I've made POS and NER dataset publicly available on Github for research and development. This article is related to building the NER model using the UNER dataset using Python. Install the necessary packages for training. pip3 install numpy keras matplotlib scikit-learn then import the packages. import ast import codecs import json import matplotlib.pyplot as plt import numpy as np from keras.layers import Dense, InputLayer, Embedding, Activation, LSTM from keras.models import Sequential from keras.optimizers import Adam from...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...

Classifying Indeed Jobs using DNNs

Indeed is a top site for posting job adds. Companies and employers post ads of required jobs on Indeed to hire people with required skills. It helps job seekers to find jobs depending on their skills and expertise. There are many parameters for finding the job such as salary range, location, skills, requirements, job type and many more. But for classification of jobs, I’ve chosen requirements and location parameters to predict jobs using job type as classes or labels. Step 1: Scrapping Indeed Jobs I’ve used beautiful soap to scrap Indeed for different types of jobs. Indeed uses some tricks to prevent the site to be scrapped. Here is how I did it. import re import pandas as pd from time import sleep from bs4 import BeautifulSoup from nltk.corpus import stopwords from urllib.request import urlopen import nltk FULL_TIME = 'full%20time&l=New%20York' PART_TIME = 'part%20time&l=New%20York' CONTRACT = 'contract&l=New%20York' INTERNSHIP =...