Skip to main content

Posts

Fine-Tuning LLaMA 2 (7B) for News Article Summarization in Urdu

With the explosion of natural language processing (NLP) models, fine-tuning large language models like Meta’s LLaMA 2 for specific tasks has become more accessible. In this post, we will guide you through the steps to fine-tune LLaMA 2 (7B) for summarizing news articles in Urdu using the Hugging Face Transformers library. Why Fine-Tune LLaMA 2 for Urdu News Summarization? LLaMA 2’s robust architecture makes it a powerful choice for NLP tasks. However, fine-tuning is essential when working with a low-resource language like Urdu. By fine-tuning, you can adapt the model to understand the nuances of Urdu grammar and vocabulary, as well as the specific style of news articles. Before diving into the fine-tuning process, ensure you have the following: High-Performance GPU : Training a 7B model requires significant computational resources. Platforms like Google Colab Pro, AWS, or Azure are ideal. Datasets : A curated dataset of Urdu news articles and their summaries. Ensure the data is cleaned...
Recent posts

Urdu to Braille Translation for Blind People.

Braille is a tactile writing system used by visually impaired people. It can be read on embossed paper or using refreshable braille displays connecting to computers and smartphone devices. Braille can be written using a slate and stylus, a braille writer, an electronic braille notetaker, or a computer connected to a braille embosser. Liblouis is an open-source braille translator and back-translator named in honor of Louis Braille. It features support for computer and literary braille, supports contracted and uncontracted translation for many languages, and has support for hyphenation. New languages can easily be added through tables that support a rule- or dictionary-based approach. Tools for testing and debugging tables are also included. Install louis on your OS. git clone https://github.com/liblouis/liblouis cd liblouis ./configure make sudo make install sudo ldconfig To use it in Python you need to install it for Python 3 cd python sudo python3 setup.py install Now you can u...

Generative AI to Summarize the Urdu Text

Generative AI is like a creative machine, using data to dream up new things! It can write poems, paint pictures, even compose music. Imagine: feeding it words and getting a story back, or showing it a sketch and having it design a building! It's still young, but its potential is mind-blowing. To generate the summary for the Urdu language, I have trained an LLM model. Here are the steps you can follow to train your own Generative AI model for summarization. Install required packages. pip install transformers [ torch ] datasets == 2 .14.5 evaluate rouge_score --quiet Import necessary libraries. from datasets import load_dataset from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer import torch import time import evaluate import pandas as pd import numpy as np Load Dataset huggingface_dataset_name = "mirfan899/usummary" dataset = load_dataset(huggingface_dataset_name) Load pretrained LLM model model_na...

Text Summarization for Urdu: Part 2

In the last article, I have used the extractive method to get the summary of text. Now in this article I'm going to show you how you can get the abstractive summary using deep learning model based on transformers. Using Pre-trained model. There is a model which performs best as compared to other models available on `hugginface` site for Urdu language. Here is how can you use it to generate the abstractive summary. import re from transformers import AutoTokenizer, AutoModelForSeq2SeqLM WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip())) article_text = """ چھٹیوں کے دن ختم ہونے میں آخری دس دن باقی تھے اور میں ہمیشہ کی طرح کتابوں کا پہاڑ سامنے رکھ کر بیٹھ گئی تھی اور ہر بار کی طرح اس بار بھی میں نے چھٹیوں کے آغاز میں سوچا تھا کہ سارا کام پہلے ہی کر لوں گی،مگر ہر بار کی طرح اس بار بھی اس پر عمل نہیں کر سکی اور پھر وہی ہوا کہ آخری کے دس دنوں میں کتابوں کا پہاڑ سامنے رکھ کر بیٹھ گئی اور اپنی قسمت کو کوستی اور خو...

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu News Classification

News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes. Preprocessing: News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did it import pandas as pd import glob files = glob.glob( "data/*.xlsx" ) df = pd.DataFrame() # if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues. for file in files: excel_file = pd.read_excel(file , index_col= None , na_values=[ 'NA' ] , usecols=[ "category" , "summery" , "title" ] , engine= "xlrd" ) df = df.append(excel_file , ignore_index= True ) df.drop_duplicates(inplace= True ) # use single word for classification. df.category = df.category.str.replace( "weird ne...