Urdu is a less developed language as compared to English for natural language processing applications. POS is a simple and most common natural language processing task but the dataset for training Urdu POS is in scarcity. There are different POS tagsets such as Muaz’s Tagset and Sajjad’s Tagset available in the literature. Due to the non-availability of the dataset and restriction to use dataset, much of NLP work is under progress.
I’ve used Keras to build the MLP model for POS. Data is in tab-separated form and converted to sentences and tags using utility functions.
Here is the format of data.txt
The model obtains the 98% accuracy for the train dataset and 97% accuracy on the test dataset.
Here are utility functions that are used in building the model.
and that’s all. You can build Urdu as well as the Arabic POS MLP model using this article.
Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, contact me on Linkedin.
I’ve developed a dataset of training POS for the Urdu language. It is available on Github. It is a small dataset more than enough to train the POS tagger. It has been build using Sajja’s Tagset because this tagset covers all the words in Urdu literature and has 39 tags. Here are some examples of this tag set.
Here is the format of data.txt
[('اشتتیاق', 'NN'), ('اور', 'CC'), ('ملائکہ', 'NN'), ('ہی', 'I'), ('ببانگِ', 'NN'), ('دہل', 'PN'), ('موجود', 'ADJ'), ('ہیں', 'VB'), ('اس', 'PD'), ('وقت', 'NN'), ('تو', 'I'), ('۔', 'SM')]
Here is the Kaggle link(https://www.kaggle.com/mirfan899/data-lstm) to download the dataset used in this tutorial.
import codecs
import numpy as np
from sklearn.model_selection import train_test_split
tagged_sentences = codecs.open("../data/pos.txt", encoding="utf-8").readlines()
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
sentences, sentence_tags = [], []
for tagged_sentence in tagged_sentences:
sentence, tags = zip(*ast.literal_eval(tagged_sentence))
sentences.append(np.array(sentence))
sentence_tags.append(np.array(tags))
(train_sentences,
test_sentences,
train_tags,
test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)
words = get_words(train_sentences)
tags = get_tags(train_tags)
Read the dataset and split it into train and test datasets.
Word indexes and tag indexes are built to handle different lengths of sentences and also maintain the OOV dictionary.
word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0
word2index['-OOV-'] = 1
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0
Train and test sentences, as well as tags, are converted to a proper format to be used in the MLP model:
train_sentences_x = get_train_sentences_x(train_sentences, word2index)
test_sentences_x = get_test_sentences_x(test_sentences, word2index)
train_tags_y = get_train_tags_y(train_tags, tag2index)
test_tags_y = get_test_tags_y(test_tags, tag2index)
Finally adding the padding to train and test sets to be used in the model.
MAX_LENGTH = len(max(train_sentences_x, key=len))
train_sentences_x = pad_sequences(train_sentences_x, maxlen=MAX_LENGTH, padding='post')
test_sentences_x = pad_sequences(test_sentences_x, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')
Here is MLP architecture:model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(word2index), 128))
model.add(Dense(128))
model.add(Dense(len(tag2index)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=Adam(0.001),
metrics=['accuracy'])
history = model.fit(train_sentences_x, to_categorical(train_tags_y, len(tag2index)), batch_size=32, epochs=10,
validation_split=0.2).history
model.save("../models/mlp.h5")
scores = model.evaluate(test_sentences_x, to_categorical(test_tags_y, len(tag2index)))
model.summary()
The model obtains the 98% accuracy for the train dataset and 97% accuracy on the test dataset.
The loss for train and test dataset
Evaluation of test dataset shows 97% accuracy.
def logits_to_tokens(sequences, index):
token_sequences = []
for categorical_sequence in sequences:
token_sequence = []
for categorical in categorical_sequence:
token_sequence.append(index[np.argmax(categorical)])
token_sequences.append(token_sequence)
return token_sequences
def to_categorical(sequences, categories):
cat_sequences = []
for s in sequences:
cats = []
for item in s:
cats.append(np.zeros(categories))
cats[-1][item] = 1.0
cat_sequences.append(cats)
return np.array(cat_sequences)
def get_words(sentences):
words = set([])
for sentence in sentences:
for word in sentence:
words.add(word)
return words
def get_tags(sentences_tags):
tags = set([])
for tag in sentences_tags:
for t in tag:
tags.add(t)
return tags
def get_train_sentences_x(train_sentences, word2index):
train_sentences_x = []
for sentence in train_sentences:
sentence_index = []
for word in sentence:
try:
sentence_index.append(word2index[word])
except KeyError:
sentence_index.append(word2index['-OOV-'])
train_sentences_x.append(sentence_index)
return train_sentences_x
def get_test_sentences_x(test_sentences, word2index):
test_sentences_x = []
for sentence in test_sentences:
sentence_index = []
for word in sentence:
try:
sentence_index.append(word2index[word])
except KeyError:
sentence_index.append(word2index['-OOV-'])
test_sentences_x.append(sentence_index)
return test_sentences_x
def get_train_tags_y(train_tags, tag2index):
train_tags_y = []
for tags in train_tags:
train_tags_y.append([tag2index[t] for t in tags])
return train_tags_y
def get_test_tags_y(test_tags, tag2index):
test_tags_y = []
for tags in test_tags:
test_tags_y.append([tag2index[t] for t in tags])
return test_tags_y
Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, contact me on Linkedin.
how can i preprocess my urdu text dataset for deep learning using your code ?
ReplyDeletei am new to nlp please guide me