Transformer Based QA System for Urdu

Question Answer Bot

The Question-Answer System is the latest trend in NLP. There are currently two main techniques used for the Question-Answer system.

1 - Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system.

2 - Closed Domain:

A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system.

Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites. It is better to use the KB system for small datasets or narrow domains like questions about a company business etc.

I'm going to use the Transformers library to build a Knowledge-Based system. You can use a Transformer or BERT to build such a system easily. You need the question answers pairs for such a system. I have a small dataset which contains questions and answers for simple Islamic problems.

Here it goes!!!

Data Format: It an array of dictionaries, each dictionary contains an index, question, and answer to that question.

{
    "index": 782,
    "question": "کریوں کے باڑے میں نماز پڑھنا کیسا ہے",
    "answer": "انس بن مالک رضی الله عنہ سے روایت ہے کہ نبی اکرم صلی اللہ علیہ وسلم بکریوں کے باڑے میں نماز پڑھتے تھے۔"
},

Encode the Data:

Convert the entire dataset into Transformer encoded format. I'm using the BERT-BASED-MULTILINGUAL-CASED model.

import numpy as np
import json
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('/home/irfan/Downloads/bert-base-multilingual-cased')
model = BertModel.from_pretrained('/home/irfan/Downloads/bert-base-multilingual-cased')
model_gpu = model.to('cuda:0')
questions_enc = np.load("./data/questions.npy")
questions_enc_len = np.load("./data/questions_len.npy")


def assign_GPU(tokenizer_output):
    tokens_tensor = tokenizer_output['input_ids'].to('cuda:0')
    token_type_ids = tokenizer_output['token_type_ids'].to('cuda:0')
    attention_mask = tokenizer_output['attention_mask'].to('cuda:0')

    output = {'input_ids': tokens_tensor,
              'token_type_ids': token_type_ids,
              'attention_mask': attention_mask}

    return output


def get_most_similar_question_id(q_question):
    inputs = assign_GPU(tokenizer(q_question, return_tensors="pt", truncation=True, max_length=30, padding="max_length"))
    outputs = model_gpu(**inputs)
    q_vector = outputs[1].cpu().data.numpy()[0]
    score = np.sum((questions_enc * q_vector), axis=1) / (
            questions_enc_len * (np.sum(q_vector * q_vector) ** 0.5))
    best_id = np.argsort(score)[::-1][0]
    return best_id, score[best_id]


def encode_question():
    data = json.load(open("./data/knowledge_base.json", "r", encoding="utf-8"))
    questions = [each['question'] for each in data]
    questions_enc = []
    print("Question size", len(questions))
    print("Encoding Starts....")
    for question in questions:
        inputs = assign_GPU(
            tokenizer(question, return_tensors="pt", truncation=True, max_length=30, padding="max_length"))
        outputs = model_gpu(**inputs)
        questions_enc.append(outputs[1].cpu().data.numpy()[0])

    np.save("./data/questions", np.array(questions_enc))
    print("data/questions.npy created.")
    questions_enc_len = np.sqrt(np.sum(np.array(questions_enc) * np.array(questions_enc), axis=1))
    np.save("./data/questions_len", questions_enc_len)
    print("data/questions_len.npy created.")


if __name__ == '__main__':
    encode_question()

Predict the Answer:

I'm using similarity for finding out the best matching question and then providing the answer.

import json
import numpy as np
from transformer_encoder import get_most_similar_question_id


faq_data = json.load(open("./data/knowledge_base.json", "r", encoding="utf-8"))
questions_enc = np.load("./data/questions.npy")
questions_enc_len = np.load("./data/questions_len.npy")

if __name__ == "__main__":
    idx, score = get_most_similar_question_id("وضو کب واجب ہوتا ہے ")
    print(idx, score)
    if score >= 0.96:
        data = [answer for answer in faq_data if idx == answer["index"]]
        print(data)
    else:
        print("Sorry cant understand your question.")

And that's pretty much it. You have a working Knowledge-Based Question Answer system for the Urdu language.

Feel free to ask questions in the comments.

Text Summarization for Urdu: Part 1

Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization : This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Unknown19 July 2021 at 09:55
Great work. I think you are the first one to discuss this topic on web
Unknown19 July 2021 at 09:59
While running the above code, pls share the configuration stepststeps need to be performed before running the above code.
Muhammad Irfan19 July 2021 at 23:12
Here is the Colab working example
https://colab.research.google.com/drive/1p6jkvtbBy94wbk5T5D2uRa_AC1Ozh0FZ?usp=sharing
Vickey26 February 2022 at 23:05
can you plz share the dataset?
Muhammad Irfan8 December 2022 at 06:30
No

UrduNLP

Search This Blog

Transformer Based QA System for Urdu

Comments

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

Urdu News Classification