Skip to main content

Posts

Showing posts from August, 2019

How to build NER dataset for Urdu language?

Prodigy annotation tool Named Entity Recognition is the most common and important task in NLP. There are a lot of resources and prebuild solutions available for the English language. Urdu is a scarce resource language and there are no usable datasets available that can be used. In this article, I'm going to show you how you can build your own NER dataset with minimal effort by annotating the entities. I'm using UNER( https://github.com/mirfan899/Urdu#uner-dataset ) entities for this article. Annotator : There are some good annotators available to annotate the text data like  http://brat.nlplab.org/ ,  https://prodi.gy/ ,  https://www.lighttag.io/ , and  https://github.com/YuukanOO/tracy . I'm more interested in building a dataset that can be used for a chatbot in the future. So I've decided to use Prodigy ( https://prodi.gy/ ), you need to purchase the license to use it or you can apply for educational research to get the license. Commands: Train ner-...

How to build Word 2 Vector for Urdu language.

Vector representation of words Word to vector is an important feature of a language model. It gives you insights into a language model and how effective that model is. There is no general w2v  (word to vector) model for a language, you want to use. You need to build it for your language depending on the domain and usage for specific applications. You can build a large w2v model but it creates a heavy burden of performance and it will use a lot of resources in the applications. So, you need to choose the dataset accordingly and build a w2v model for that specific application. For example, if you are building a model for news, choose the news dataset and then build a w2v model using that dataset. In this article, I'm going to build a w2v model for freely available journalism dataset. COUNTER ( COrpus of Urdu News TExt Reuse): This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information. This ...

Urdu Word and Sentence Similarity using SpaCy

The similarity is the common measure of understanding how much close two words or sentences are to each other. There are multiple ways to find out the similarity of two documents and the most common being used in NLP is Cosine Similarity. Cosine Similarity is counted using vectors (word2vector) and provides information about how much two vectors are close in the context of orientation. Some helpful links to understand the similarity concepts: https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50 https://www.sciencedirect.com/topics/computer-science/cosine-similarity It mostly depends on the quality of the vectors of the documents. If you want to get better results, build a better word 2 vector model. To use the similarity feature of SpaCy, you need to build a language model (you can build a language model by following my article  https://www.urdunlp.com/2019/08/how-to-build-urdu-language-model-in.html ). Here is how I've calculated the cosi...

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing...