Skip to main content

Urdu News Classification

News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes.

Preprocessing:

News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did it
import pandas as pd
import glob

files = glob.glob("data/*.xlsx")
df = pd.DataFrame()

# if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues.
for file in files:
    excel_file = pd.read_excel(file, index_col=None, na_values=['NA'], usecols=["category", "summery", "title"], engine="xlrd")
    df = df.append(excel_file, ignore_index=True)

df.drop_duplicates(inplace=True)

# use single word for classification.
df.category = df.category.str.replace("weird news", "weird")
df.to_csv("data/headlines.csv", header=True, index=False)
This code snippet will read every excel file in the data directory, merge these files into a single dataframe, and then save it as csv file. I'm using the single word class for each news so I have to replace the weird news with weird.

If you want to perform further preprocessing you can use the tutorial from here Sentiment Analysis.

I'm not applying anything else for now for preprocessing, it depends on you if you want to lemmatize the words or remove stop words.

Count Plot of Classes:

To see the classes of news dataset in a graphical way, I have used the countplot of seaborn with matplotlib for this purpose. Here is how it looks.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


data = pd.read_csv("data/headlines.csv", sep=",")
plt.figure(figsize=(8, 5))
sns.countplot(x="category", data=data)
plt.savefig("categories.png")

plt.show()
News Categories
Counts of each news category

Model Training:

I have chosen SVM and logistic regression models for news classification. You can choose whatever model you want for classification. It's your choice.

Here is my code of SVM and Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import pandas as pd

data = pd.read_csv("data/headlines.csv", sep=",")
X = data["title"].values.astype(str).tolist()
Y = data["category"].values.astype(str).tolist()

# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)

# training: tf-idf + logistic regression
max_feature_num = 10000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

# train model
clf = svm.LinearSVC(max_iter=5000)
clf.fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

# train model
clf = LogisticRegression().fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
Here are the results.
# SVM
acc 0.8445011854824002
precision 0.8415581287611198
rec 0.8445011854824002
f1 0.8426318990306771

# LR
acc 0.8365858106875798
precision 0.8329923072929364
rec 0.8365858106875798
f1 0.8325166408219948
You can improve the accuracy of these models by removing the stop words and lemmatization.

If you have any questions, feel free to ask.

Comments

  1. Please reach out to me, need to discuss something related to improving scores and text classification for urdu news.

    ReplyDelete
    Replies
    1. reach out to unknown? Great!!

      Delete
    2. I need help related to fake news classification for urdu news,let me know how may i reach out to either of you?

      Delete
  2. This is my email: virtuoso.irfan@gmail.com

    ReplyDelete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...