Skip to main content

How to build NER dataset for Urdu language?


Prodigy annotation tool


Named Entity Recognition is the most common and important task in NLP. There are a lot of resources and prebuild solutions available for the English language. Urdu is a scarce resource language and there are no usable datasets available that can be used. In this article, I'm going to show you how you can build your own NER dataset with minimal effort by annotating the entities. I'm using UNER(https://github.com/mirfan899/Urdu#uner-dataset) entities for this article.

Annotator:
There are some good annotators available to annotate the text data like http://brat.nlplab.org/https://prodi.gy/https://www.lighttag.io/, and https://github.com/YuukanOO/tracy. I'm more interested in building a dataset that can be used for a chatbot in the future. So I've decided to use Prodigy (https://prodi.gy/), you need to purchase the license to use it or you can apply for educational research to get the license.

Commands:
Train ner-ur-model using SpaCy model "ur_model". You need to build a JSONL file with this structure

{"text": "پیپلزپارٹی کی حکومت اور مسلم لیگ ن کے مابین جاری دوستانہ کشمکش اب حقیقی تناؤ میں متشکل ہونا شروع ہوگئی ہے۔"}
{"text": "اس کھینچا تانی میں قومی سیاسی منظر نامے میں ایک بار پھر دائیں اور بائیں بازور کی سیاست کا ظہور ہوتا نظرآرہاہے۔"}
{"text": "اگر نواز شریف آڑے نہ آتے تو ساری قاف لیگ کب کی مسلم لیگ نون میں ضم ہوچکی ہوتی۔"}
{"text": "اب ایسا محسوس ہوتاہے کہ میاں صاحب کے اعصاب جواب دے رہے ہیں اور رفتہ رفتہ تلخ سیاسی حقائق کا ادراک کررہے ہیں۔"}
{"text": "حکومت کو کوئی بڑا ریلیف ملتانظر نہیں آتاہے لہٰذا یوسف رضا گیلانی کی ساری توجہ فوج کے ساتھ تعلقات کو ہموار رکھنے پر ہے۔"}
{"text": "اپنے ذاتی یا جماعتی مفادات کی خاطر بڑی سے بڑی مقدس روایت کو روندا لیاجاتاہے۔"}


After building the urdu.jsonl data file you need to provide the text entities file.
PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME
Now use the following prodigy command to train ner-ur-model.
prodigy ner.manual ner-ur-model ur_model data/urdu.jsonl --label data/entities.txt
It will save the data in a sqlite database. Here is the documentation link for prodigy tool (https://prodi.gy/docs/) for reference. Happy annotating the data. If you have some questions feel free to ask in comments.

Comments

  1. great work; can you share how we can do dependency parsing in urdu as well

    ReplyDelete
  2. Dependency parsing is still under development. Although you can use SpaCy Urdu model for this purpose.

    ReplyDelete
  3. Greetings,
    I have tried to install the urdu model
    pip install ur_model-0.0.0.tar.gz
    but it showed me this error. please help me
    ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: 'C:\\Users\\FIM\\ur_model-0.0.0.tar.gz'

    ReplyDelete
  4. model should be in the directory where you use pip command.

    ReplyDelete
  5. Can you tell the steps to train spacy?

    ReplyDelete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing