Skip to main content

Named Entity Recognition for Urdu

Urdu is a less developed language as compared to English. That's why it lacks resources of research and development for natural language processing, speech recognition, and other AI and ML related problems. It took me so long to build a dataset and enhance it for NLP tasks because the datasets which are available are not enough to do ML. Most of the dataset is proprietary which restricts the researchers and developers. Fortunately, I've made POS and NER dataset publicly available on Github for research and development.

This article is related to building the NER model using the UNER dataset using Python.

Install the necessary packages for training.
pip3 install numpy keras matplotlib scikit-learn 
 then import the packages.
import ast
import codecs
import json
import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Dense, InputLayer, Embedding, Activation, LSTM
from keras.models import Sequential
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
Read the dataset (dataset is converted to BILOU format like POS data format, for more information read my article) and train the model
tagged_sentences = codecs.open("../data/ner_dnn.txt", encoding="utf-8").readlines()

print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))

sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*ast.literal_eval(tagged_sentence))
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))

(train_sentences,
 test_sentences,
 train_tags,
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)

words = get_words(sentences)
tags = get_tags(sentence_tags)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1

tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0

train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)

train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)

MAX_LENGTH = len(max(train_sentences_x, key=len))
# MAX_LENGTH = 181

train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(word2index), 128))
model.add(LSTM(128, return_sequences=True))
model.add(Dense(len(tag2index)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
                    validation_split=0.2).history
model.save("../models/lstm_ner.h5")
model.summary()
Test accuracy of the model
scores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}")  # acc: 98.39311069478103
print(test_sentences[0])
print(test_tags[0])
test_samples = [
    test_sentences[0],
]

test_samples_x = []
for sentence in test_samples:
    sentence_index = []
    for word in sentence:
        try:
            sentence_index.append(word2index[word])
        except KeyError:
            sentence_index.append(word2index['-OOV-'])
    test_samples_x.append(sentence_index)

test_samples_X = pad_sequences(test_samples_x, maxlen=MAX_LENGTH, padding='post')

predictions = model.predict(test_samples_X)
print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()}))
Here is the output of the accuracy and loss of the model.
Train on 1115 samples, validate on 279 samples
Epoch 1/15
1115/1115 [==============================] - 8s 7ms/step - loss: 1.1546 - acc: 0.7742 - ignore_accuracy: 0.6824 - val_loss: 0.4370 - val_acc: 0.9302 - val_ignore_accuracy: 0.8298
Epoch 2/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.3495 - acc: 0.9360 - ignore_accuracy: 0.8358 - val_loss: 0.3257 - val_acc: 0.9422 - val_ignore_accuracy: 0.8295
Epoch 3/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2865 - acc: 0.9525 - ignore_accuracy: 0.8588 - val_loss: 0.2899 - val_acc: 0.9528 - val_ignore_accuracy: 0.8450
Epoch 4/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2561 - acc: 0.9570 - ignore_accuracy: 0.8592 - val_loss: 0.2644 - val_acc: 0.9542 - val_ignore_accuracy: 0.8463
Epoch 5/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2280 - acc: 0.9582 - ignore_accuracy: 0.8581 - val_loss: 0.2371 - val_acc: 0.9542 - val_ignore_accuracy: 0.8475
Epoch 6/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1982 - acc: 0.9582 - ignore_accuracy: 0.8589 - val_loss: 0.2082 - val_acc: 0.9541 - val_ignore_accuracy: 0.8510
Epoch 7/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1669 - acc: 0.9592 - ignore_accuracy: 0.8676 - val_loss: 0.1852 - val_acc: 0.9550 - val_ignore_accuracy: 0.8569
Epoch 8/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1452 - acc: 0.9625 - ignore_accuracy: 0.8810 - val_loss: 0.1700 - val_acc: 0.9566 - val_ignore_accuracy: 0.8611
Epoch 9/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1311 - acc: 0.9647 - ignore_accuracy: 0.8898 - val_loss: 0.1627 - val_acc: 0.9579 - val_ignore_accuracy: 0.8665
Epoch 10/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1207 - acc: 0.9667 - ignore_accuracy: 0.8962 - val_loss: 0.1543 - val_acc: 0.9600 - val_ignore_accuracy: 0.8710
Epoch 11/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1103 - acc: 0.9695 - ignore_accuracy: 0.9054 - val_loss: 0.1501 - val_acc: 0.9620 - val_ignore_accuracy: 0.8762
Epoch 12/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1018 - acc: 0.9721 - ignore_accuracy: 0.9132 - val_loss: 0.1420 - val_acc: 0.9643 - val_ignore_accuracy: 0.8823
Epoch 13/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0927 - acc: 0.9749 - ignore_accuracy: 0.9213 - val_loss: 0.1382 - val_acc: 0.9659 - val_ignore_accuracy: 0.8854
Epoch 14/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0839 - acc: 0.9773 - ignore_accuracy: 0.9274 - val_loss: 0.1327 - val_acc: 0.9676 - val_ignore_accuracy: 0.8915
Epoch 15/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0743 - acc: 0.9802 - ignore_accuracy: 0.9358 - val_loss: 0.1281 - val_acc: 0.9702 - val_ignore_accuracy: 0.8981
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_7 (Embedding)      (None, 101, 256)          1234432   
_________________________________________________________________
lstm_7 (LSTM)                (None, 101, 256)          525312    
_________________________________________________________________
dense_7 (Dense)              (None, 101, 31)           7967      
_________________________________________________________________
activation_7 (Activation)    (None, 101, 31)           0         
=================================================================
Total params: 1,767,711
Trainable params: 1,767,711
Non-trainable params: 0
_________________________________________________________________
Original data from test samples
['پاکستان' 'عوامی' 'تحریک' 'کے' 'قائد' 'ڈاکٹر' 'طاہرالقادری' 'کی' 'وطن'
 'واپسی' 'کے' 'انتظامات' 'کے' 'حوالے' 'سے' 'مرکزی' 'صدر' 'ڈاکٹر' 'رحیق'
 'عباسی' 'کی' 'زیرصدارت' 'خصوصی' 'اجلاس' 'ہوا' '۔']
['B-ORGANIZATION' 'I-ORGANIZATION' 'L-ORGANIZATION' 'O' 'O'
 'U-DESIGNATION' 'U-PERSON' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'U-DESIGNATION' 'B-PERSON' 'L-PERSON' 'O' 'O' 'O' 'O' 'O' 'O']
349/349 [==============================] - 1s 2ms/step
acc: 96.79990988064631
[['B-ORGANIZATION', 'I-ORGANIZATION', 'L-ORGANIZATION', 'O', 'O', 'U-DESIGNATION', 'U-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-DESIGNATION', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-']]
Plot the model using the following code.
plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='lower right')
plt.savefig("../results/accuracy_lstm_ner.png")
plt.clf()

# Plot training & validation loss values
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper right')
plt.savefig("../results/loss_lstm_ner.png")
The accuracy of NER is 96.79%(not bad). Accuracy for test and train datasets.
The loss for training and test datasets.
LSTM gives us good results for the UNER dataset. You can tune hyperparameters to get more accuracy.

That's it if you have any question feel free to ask.

Comments

  1. SyntaxError: invalid character in identifier
    im getting this error
    when im trying to load the datasset

    ReplyDelete
  2. Hello Dear,

    I was trying to check your code but facing the same issue as shown below.
    برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔

    Tagged sentences: 1638
    File "", line 1
    برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔
    ^
    SyntaxError: invalid syntax

    ReplyDelete
  3. May you please share me the right code to check your code and will refer your work in the paper as well.
    javaid.ciit@gmail.com

    ReplyDelete
  4. Try to use UTF-8 in Python scripts.

    ReplyDelete
  5. can you give some idea about pre Processing of Urdu

    ReplyDelete
    Replies
    1. This is an active research area. There some resources for stop words and lemmatization but not good enough.

      Delete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing