Urdu is a less developed language as compared to English. That's why it lacks resources of research and development for natural language processing, speech recognition, and other AI and ML related problems. It took me so long to build a dataset and enhance it for NLP tasks because the datasets which are available are not enough to do ML. Most of the dataset is proprietary which restricts the researchers and developers. Fortunately, I've made POS and NER dataset publicly available on Github for research and development.
This article is related to building the NER model using the UNER dataset using Python.
Install the necessary packages for training.
The loss for training and test datasets.
LSTM gives us good results for the UNER dataset. You can tune hyperparameters to get more accuracy.
That's it if you have any question feel free to ask.
This article is related to building the NER model using the UNER dataset using Python.
Install the necessary packages for training.
pip3 install numpy keras matplotlib scikit-learn
then import the packages.import ast
import codecs
import json
import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Dense, InputLayer, Embedding, Activation, LSTM
from keras.models import Sequential
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
Read the dataset (dataset is converted to BILOU format like POS data format, for more information read my article) and train the modeltagged_sentences = codecs.open("../data/ner_dnn.txt", encoding="utf-8").readlines()
print("Tagged sentences: ", len(tagged_sentences))
sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
sentence, tags = zip(*ast.literal_eval(tagged_sentence))
test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)
words = get_words(sentences)
tags = get_tags(sentence_tags)
word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0
train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)
train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)
MAX_LENGTH = len(max(train_sentences_x, key=len))
# MAX_LENGTH = 181
train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')
model = Sequential()
model.add(Embedding(len(word2index), 128))
model.add(LSTM(128, return_sequences=True))
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
Test accuracy of the modelscores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}") # acc: 98.39311069478103
test_samples = [
test_samples_x = []
for sentence in test_samples:
sentence_index = []
for word in sentence:
except KeyError:
test_samples_X = pad_sequences(test_samples_x, maxlen=MAX_LENGTH, padding='post')
predictions = model.predict(test_samples_X)
print(logits_to_tokens(predictions, {i: t for t, i in tag2index.items()}))
Here is the output of the accuracy and loss of the model.Train on 1115 samples, validate on 279 samples
Epoch 1/15
1115/1115 [==============================] - 8s 7ms/step - loss: 1.1546 - acc: 0.7742 - ignore_accuracy: 0.6824 - val_loss: 0.4370 - val_acc: 0.9302 - val_ignore_accuracy: 0.8298
Epoch 2/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.3495 - acc: 0.9360 - ignore_accuracy: 0.8358 - val_loss: 0.3257 - val_acc: 0.9422 - val_ignore_accuracy: 0.8295
Epoch 3/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2865 - acc: 0.9525 - ignore_accuracy: 0.8588 - val_loss: 0.2899 - val_acc: 0.9528 - val_ignore_accuracy: 0.8450
Epoch 4/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2561 - acc: 0.9570 - ignore_accuracy: 0.8592 - val_loss: 0.2644 - val_acc: 0.9542 - val_ignore_accuracy: 0.8463
Epoch 5/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.2280 - acc: 0.9582 - ignore_accuracy: 0.8581 - val_loss: 0.2371 - val_acc: 0.9542 - val_ignore_accuracy: 0.8475
Epoch 6/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1982 - acc: 0.9582 - ignore_accuracy: 0.8589 - val_loss: 0.2082 - val_acc: 0.9541 - val_ignore_accuracy: 0.8510
Epoch 7/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1669 - acc: 0.9592 - ignore_accuracy: 0.8676 - val_loss: 0.1852 - val_acc: 0.9550 - val_ignore_accuracy: 0.8569
Epoch 8/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1452 - acc: 0.9625 - ignore_accuracy: 0.8810 - val_loss: 0.1700 - val_acc: 0.9566 - val_ignore_accuracy: 0.8611
Epoch 9/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1311 - acc: 0.9647 - ignore_accuracy: 0.8898 - val_loss: 0.1627 - val_acc: 0.9579 - val_ignore_accuracy: 0.8665
Epoch 10/15
1115/1115 [==============================] - 7s 6ms/step - loss: 0.1207 - acc: 0.9667 - ignore_accuracy: 0.8962 - val_loss: 0.1543 - val_acc: 0.9600 - val_ignore_accuracy: 0.8710
Epoch 11/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1103 - acc: 0.9695 - ignore_accuracy: 0.9054 - val_loss: 0.1501 - val_acc: 0.9620 - val_ignore_accuracy: 0.8762
Epoch 12/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.1018 - acc: 0.9721 - ignore_accuracy: 0.9132 - val_loss: 0.1420 - val_acc: 0.9643 - val_ignore_accuracy: 0.8823
Epoch 13/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0927 - acc: 0.9749 - ignore_accuracy: 0.9213 - val_loss: 0.1382 - val_acc: 0.9659 - val_ignore_accuracy: 0.8854
Epoch 14/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0839 - acc: 0.9773 - ignore_accuracy: 0.9274 - val_loss: 0.1327 - val_acc: 0.9676 - val_ignore_accuracy: 0.8915
Epoch 15/15
1115/1115 [==============================] - 6s 6ms/step - loss: 0.0743 - acc: 0.9802 - ignore_accuracy: 0.9358 - val_loss: 0.1281 - val_acc: 0.9702 - val_ignore_accuracy: 0.8981
Layer (type) Output Shape Param #
embedding_7 (Embedding) (None, 101, 256) 1234432
lstm_7 (LSTM) (None, 101, 256) 525312
dense_7 (Dense) (None, 101, 31) 7967
activation_7 (Activation) (None, 101, 31) 0
Total params: 1,767,711
Trainable params: 1,767,711
Non-trainable params: 0
Original data from test samples
['پاکستان' 'عوامی' 'تحریک' 'کے' 'قائد' 'ڈاکٹر' 'طاہرالقادری' 'کی' 'وطن'
'واپسی' 'کے' 'انتظامات' 'کے' 'حوالے' 'سے' 'مرکزی' 'صدر' 'ڈاکٹر' 'رحیق'
'عباسی' 'کی' 'زیرصدارت' 'خصوصی' 'اجلاس' 'ہوا' '۔']
'U-DESIGNATION' 'U-PERSON' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'
349/349 [==============================] - 1s 2ms/step
acc: 96.79990988064631
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-DESIGNATION', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-', '-PAD-']]
Plot the model using the following code.plt.plot(history['acc'])
plt.title('Model accuracy')
plt.legend(['Train', 'Test'], loc='lower right')
# Plot training & validation loss values
plt.title('Model loss')
plt.legend(['Train', 'Test'], loc='upper right')
The accuracy of NER is 96.79%(not bad). Accuracy for test and train datasets.The loss for training and test datasets.
LSTM gives us good results for the UNER dataset. You can tune hyperparameters to get more accuracy.
That's it if you have any question feel free to ask.
SyntaxError: invalid character in identifier
ReplyDeleteim getting this error
when im trying to load the datasset
Hello Dear,
ReplyDeleteI was trying to check your code but facing the same issue as shown below.
برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔
Tagged sentences: 1638
File "", line 1
برطانیہ کے سابق فوجی سربراہ کا کہنا ہے کہ فوجی حکام نے افغانستان میں دہشت گردی کے خلاف جنگ کی شدت کا غلط اندازہ لگایا تھا۔
SyntaxError: invalid syntax
May you please share me the right code to check your code and will refer your work in the paper as well.
Try to use UTF-8 in Python scripts.
ReplyDeletecan you give some idea about pre Processing of Urdu
ReplyDeleteThis is an active research area. There some resources for stop words and lemmatization but not good enough.