Skip to main content

Fine-Tuning LLaMA 2 (7B) for News Article Summarization in Urdu

With the explosion of natural language processing (NLP) models, fine-tuning large language models like Meta’s LLaMA 2 for specific tasks has become more accessible. In this post, we will guide you through the steps to fine-tune LLaMA 2 (7B) for summarizing news articles in Urdu using the Hugging Face Transformers library.

Why Fine-Tune LLaMA 2 for Urdu News Summarization?

LLaMA 2’s robust architecture makes it a powerful choice for NLP tasks. However, fine-tuning is essential when working with a low-resource language like Urdu. By fine-tuning, you can adapt the model to understand the nuances of Urdu grammar and vocabulary, as well as the specific style of news articles.

Before diving into the fine-tuning process, ensure you have the following:

  1. High-Performance GPU: Training a 7B model requires significant computational resources. Platforms like Google Colab Pro, AWS, or Azure are ideal.

  2. Datasets: A curated dataset of Urdu news articles and their summaries. Ensure the data is cleaned and preprocessed.

  3. Python Environment: Python 3.8+ with the necessary libraries installed, including transformersdatasets, and accelerate.

  4. Hugging Face Account: Access to the LLaMA 2 weights requires accepting Meta’s license agreement via Hugging Face.

Dataset Preparation

  1. Collecting Data: I'm using https://huggingface.co/datasets/mirfan899/ur_news_sum dataset.

  2. Cleaning Data: Use Python libraries like pandas or nltk to remove HTML tags, normalize text, and handle missing values.

  3. Formatting Data: Create format that is used to train llama2 model.

  4. Splitting Dataset: Divide the data into training, validation, and testing sets, typically in a 70:20:10 ratio.

Model Fine-Tuning Workflow

The process involves loading LLaMA 2, preprocessing the dataset, and training the model. Below is an overview of the steps:
  1. Install Dependencies: Make sure you have the required Python libraries.

  2. Load LLaMA 2: Use Hugging Face’s Transformers library to load the pre-trained model.

  3. Tokenization: Use a tokenizer compatible with LLaMA 2 to preprocess Urdu text. Custom tokenization might be needed for better results with Urdu.

  4. Training Loop: Use Hugging Face’s Trainer or PyTorch to fine-tune the model.


Dependencies installation
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.41.3 transformers==4.37.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Variables and parameters

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mirfan899/ur_news_sum"

# Fine-tuned model name
new_model = "llama2-7b-usum"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

load dataset
dataset = load_dataset(dataset_name)

helping functions
DEFAULT_SYSTEM_PROMPT = """
Below is a news article written by a human. Write a summary of the news.
""".strip()


def generate_training_prompt(
conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
return f"""### Instruction: {system_prompt}

### Input:
{conversation.strip()}

### Response:
{summary}
""".strip()

def generate_text(data_point):

return {
"news": data_point["text"],
"summary": data_point["summary"],
"text": generate_training_prompt(data_point["text"], data_point["summary"]),
}

def process_dataset(data):
return (
data.shuffle(seed=42)
.map(generate_text)
)

dataset["train"] = process_dataset(dataset["train"])
dataset["test"] = process_dataset(dataset["test"])

Fine tuning


# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Train model
trainer.train()

login to huggingface to publish the model.

from huggingface_hub import login
login(token="your_token")

push the model to huggingface.

model.push_to_hub("llama2-7b-usum", use_temp_dir=False)
tokenizer.push_to_hub("llama2-7b-usum", use_temp_dir=False)

Deployment

Once fine-tuning is complete, the model can be deployed using Hugging Face’s transformers library for inference. You can integrate the model into applications like news aggregators, mobile apps, or chatbots that provide Urdu summaries.

Conclusion

Fine-tuning LLaMA 2 for Urdu news summarization opens new doors for NLP in low-resource languages. With careful dataset preparation and model optimization, you can achieve impressive results that cater to a growing Urdu-speaking audience.


Comments

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...