Urdu News Classification

News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes.

Preprocessing:

News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did it

import pandas as pd
import glob

files = glob.glob("data/*.xlsx")
df = pd.DataFrame()

# if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues.
for file in files:
    excel_file = pd.read_excel(file, index_col=None, na_values=['NA'], usecols=["category", "summery", "title"], engine="xlrd")
    df = df.append(excel_file, ignore_index=True)

df.drop_duplicates(inplace=True)

# use single word for classification.
df.category = df.category.str.replace("weird news", "weird")
df.to_csv("data/headlines.csv", header=True, index=False)

This code snippet will read every excel file in the data directory, merge these files into a single dataframe, and then save it as csv file. I'm using the single word class for each news so I have to replace the weird news with weird.

If you want to perform further preprocessing you can use the tutorial from here Sentiment Analysis.

I'm not applying anything else for now for preprocessing, it depends on you if you want to lemmatize the words or remove stop words.

Count Plot of Classes:

To see the classes of news dataset in a graphical way, I have used the countplot of seaborn with matplotlib for this purpose. Here is how it looks.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


data = pd.read_csv("data/headlines.csv", sep=",")
plt.figure(figsize=(8, 5))
sns.countplot(x="category", data=data)
plt.savefig("categories.png")

plt.show()

Counts of each news category

Model Training:

I have chosen SVM and logistic regression models for news classification. You can choose whatever model you want for classification. It's your choice.

Here is my code of SVM and Logistic Regression

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import pandas as pd

data = pd.read_csv("data/headlines.csv", sep=",")
X = data["title"].values.astype(str).tolist()
Y = data["category"].values.astype(str).tolist()

# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)

# training: tf-idf + logistic regression
max_feature_num = 10000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

# train model
clf = svm.LinearSVC(max_iter=5000)
clf.fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

# train model
clf = LogisticRegression().fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

Here are the results.

# SVM
acc 0.8445011854824002
precision 0.8415581287611198
rec 0.8445011854824002
f1 0.8426318990306771

# LR
acc 0.8365858106875798
precision 0.8329923072929364
rec 0.8365858106875798
f1 0.8325166408219948

You can improve the accuracy of these models by removing the stop words and lemmatization.

If you have any questions, feel free to ask.

Comments

Unknown16 May 2021 at 20:08
Please reach out to me, need to discuss something related to improving scores and text classification for urdu news.
ReplyDelete
Replies
Muhammad Irfan10 June 2021 at 00:51
This is my email: virtuoso.irfan@gmail.com
ReplyDelete
Replies

Add comment

UrduNLP

Search This Blog