Named Entity Recognition for Urdu

Urdu is a less developed language as compared to English. That's why it lacks resources of research and development for natural language processing, speech recognition, and other AI and ML related problems. It took me so long to build a dataset and enhance it for NLP tasks because the datasets which are available are not enough to do ML. Most of the dataset is proprietary which restricts the researchers and developers. Fortunately, I've made POS and NER dataset publicly available on Github for research and development.

This article is related to building the NER model using the UNER dataset using Python.

Install the necessary packages for training.

pip3 install numpy keras matplotlib scikit-learn

then import the packages.

import ast
import codecs
import json
import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Dense, InputLayer, Embedding, Activation, LSTM
from keras.models import Sequential
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

Read the dataset (dataset is converted to BILOU format like POS data format, for more information read my article) and train the model

tagged_sentences = codecs.open("../data/ner_dnn.txt", encoding="utf-8").readlines()

print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))

sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*ast.literal_eval(tagged_sentence))
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))

(train_sentences,
 test_sentences,
 train_tags,
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)

words = get_words(sentences)
tags = get_tags(sentence_tags)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1

tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0

train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)

train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)

MAX_LENGTH = len(max(train_sentences_x, key=len))
# MAX_LENGTH = 181

train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(word2index), 128))
model.add(LSTM(128, return_sequences=True))
model.add(Dense(len(tag2index)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
                    validation_split=0.2).history
model.save("../models/lstm_ner.h5")
model.summary()

Test accuracy of the model

scores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}")  # acc: 98.39311069478103
print(test_sentences[0])
print(test_tags[0])
test_samples = [
    test_sentences[0],
]

test_samples_x = []
for sentence in test_samples:
    sentence_index = []
    for word in sentence:
        try:
            sentence_index.append(word2index[word])
        except KeyError:
            sentence_index.append(word2index['-OOV-'])
    test_samples_x.append(sentence_index)

test_samples_X = pad_sequences(test_samples_x, maxlen=MAX_LENGTH, padding='post')

predictions = model.predict(test_samples_X)
print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()}))

Here is the output of the accuracy and loss of the model.

Train on 1115 samples, validate on 279 samples
Epoch 1/15
1115/1115 [==============================] - 8s 7ms/step - loss: 1.1546 - acc: 0.7742 - ignore_accuracy: 0.6824 - val_loss: 0.4370 - val_acc: 0.9302 - val_ignore_accuracy: 0.8298
Epoch 2/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.3495 - acc: 0.9360 - ignore_accuracy: 0.8358 - val_loss: 0.3257 - val_acc: 0.9422 - val_ignore_accuracy: 0.8295
Epoch 3/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2865 - acc: 0.9525 - ignore_accuracy: 0.8588 - val_loss: 0.2899 - val_acc: 0.9528 - val_ignore_accuracy: 0.8450
Epoch 4/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2561 - acc: 0.9570 - ignore_accuracy: 0.8592 - val_loss: 0.2644 - val_acc: 0.9542 - val_ignore_accuracy: 0.8463
Epoch 5/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2280 - acc: 0.9582 - ignore_accuracy: 0.8581 - val_loss: 0.2371 - val_acc: 0.9542 - val_ignore_accuracy: 0.8475
Epoch 6/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1982 - acc: 0.9582 - ignore_accuracy: 0.8589 - val_loss: 0.2082 - val_acc: 0.9541 - val_ignore_accuracy: 0.8510
Epoch 7/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1669 - acc: 0.9592 - ignore_accuracy: 0.8676 - val_loss: 0.1852 - val_acc: 0.9550 - val_ignore_accuracy: 0.8569
Epoch 8/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1452 - acc: 0.9625 - ignore_accuracy: 0.8810 - val_loss: 0.1700 - val_acc: 0.9566 - val_ignore_accuracy: 0.8611
Epoch 9/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1311 - acc: 0.9647 - ignore_accuracy: 0.8898 - val_loss: 0.1627 - val_acc: 0.9579 - val_ignore_accuracy: 0.8665
Epoch 10/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1207 - acc: 0.9667 - ignore_accuracy: 0.8962 - val_loss: 0.1543 - val_acc: 0.9600 - val_ignore_accuracy: 0.8710
Epoch 11/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1103 - acc: 0.9695 - ignore_accuracy: 0.9054 - val_loss: 0.1501 - val_acc: 0.9620 - val_ignore_accuracy: 0.8762
Epoch 12/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1018 - acc: 0.9721 - ignore_accuracy: 0.9132 - val_loss: 0.1420 - val_acc: 0.9643 - val_ignore_accuracy: 0.8823
Epoch 13/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0927 - acc: 0.9749 - ignore_accuracy: 0.9213 - val_loss: 0.1382 - val_acc: 0.9659 - val_ignore_accuracy: 0.8854
Epoch 14/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0839 - acc: 0.9773 - ignore_accuracy: 0.9274 - val_loss: 0.1327 - val_acc: 0.9676 - val_ignore_accuracy: 0.8915
Epoch 15/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0743 - acc: 0.9802 - ignore_accuracy: 0.9358 - val_loss: 0.1281 - val_acc: 0.9702 - val_ignore_accuracy: 0.8981
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_7 (Embedding)      (None, 101, 256)          1234432   
_________________________________________________________________
lstm_7 (LSTM)                (None, 101, 256)          525312    
_________________________________________________________________
dense_7 (Dense)              (None, 101, 31)           7967      
_________________________________________________________________
activation_7 (Activation)    (None, 101, 31)           0         
=================================================================
Total params: 1,767,711
Trainable params: 1,767,711
Non-trainable params: 0
_________________________________________________________________
Original data from test samples
['پاکستان' 'عوامی' 'تحریک' 'کے' 'قائد' 'ڈاکٹر' 'طاہرالقادری' 'کی' 'وطن'
 'واپسی' 'کے' 'انتظامات' 'کے' 'حوالے' 'سے' 'مرکزی' 'صدر' 'ڈاکٹر' 'رحیق'
 'عباسی' 'کی' 'زیرصدارت' 'خصوصی' 'اجلاس' 'ہوا' '۔']
['B-ORGANIZATION' 'I-ORGANIZATION' 'L-ORGANIZATION' 'O' 'O'
 'U-DESIGNATION' 'U-PERSON' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
 'U-DESIGNATION' 'B-PERSON' 'L-PERSON' 'O' 'O' 'O' 'O' 'O' 'O']
349/349 [==============================] - 1s 2ms/step
acc: 96.79990988064631
[['B-ORGANIZATION', 'I-ORGANIZATION', 'L-ORGANIZATION', 'O', 'O', 'U-DESIGNATION', 'U-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-DESIGNATION', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-']]

Plot the model using the following code.

plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='lower right')
plt.savefig("../results/accuracy_lstm_ner.png")
plt.clf()

# Plot training & validation loss values
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper right')
plt.savefig("../results/loss_lstm_ner.png")

The accuracy of NER is 96.79%(not bad). Accuracy for test and train datasets.

The loss for training and test datasets.

LSTM gives us good results for the UNER dataset. You can tune hyperparameters to get more accuracy.

That's it if you have any question feel free to ask.

Text Summarization for Urdu: Part 1

Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization : This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Anonymous20 May 2020 at 18:02
SyntaxError: invalid character in identifier
im getting this error
when im trying to load the datasset
Javaid Iqbal10 June 2020 at 02:14
Hello Dear,

I was trying to check your code but facing the same issue as shown below.
برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔

Tagged sentences: 1638
File "", line 1
برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔
^
SyntaxError: invalid syntax
Javaid Iqbal10 June 2020 at 02:15
May you please share me the right code to check your code and will refer your work in the paper as well.
javaid.ciit@gmail.com
Muhammad Irfan22 October 2020 at 21:04
Try to use UTF-8 in Python scripts.
Anonymous14 October 2022 at 20:27
can you give some idea about pre Processing of Urdu

UrduNLP

Search This Blog

Named Entity Recognition for Urdu

Comments

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

Transformer Based QA System for Urdu

Urdu News Classification