News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes.
This code snippet will read every excel file in the data directory, merge these files into a single dataframe, and then save it as csv file. I'm using the single word class for each news so I have to replace the weird news with weird.
If you want to perform further preprocessing you can use the tutorial from here Sentiment Analysis.
I'm not applying anything else for now for preprocessing, it depends on you if you want to lemmatize the words or remove stop words.
I have chosen SVM and logistic regression models for news classification. You can choose whatever model you want for classification. It's your choice.
Here is my code of SVM and Logistic RegressionYou can improve the accuracy of these models by removing the stop words and lemmatization.
If you have any questions, feel free to ask.
Preprocessing:
News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did itimport pandas as pd
import glob
files = glob.glob("data/*.xlsx")
df = pd.DataFrame()
# if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues.
for file in files:
excel_file = pd.read_excel(file, index_col=None, na_values=['NA'], usecols=["category", "summery", "title"], engine="xlrd")
df = df.append(excel_file, ignore_index=True)
df.drop_duplicates(inplace=True)
# use single word for classification.
df.category = df.category.str.replace("weird news", "weird")
df.to_csv("data/headlines.csv", header=True, index=False)
Count Plot of Classes:
To see the classes of news dataset in a graphical way, I have used the countplot of seaborn with matplotlib for this purpose. Here is how it looks.import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("data/headlines.csv", sep=",")
plt.figure(figsize=(8, 5))
sns.countplot(x="category", data=data)
plt.savefig("categories.png")
plt.show()
Counts of each news category |
Model Training:
Here is my code of SVM and Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import pandas as pd
data = pd.read_csv("data/headlines.csv", sep=",")
X = data["title"].values.astype(str).tolist()
Y = data["category"].values.astype(str).tolist()
# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)
# training: tf-idf + logistic regression
max_feature_num = 10000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)
# train model
clf = svm.LinearSVC(max_iter=5000)
clf.fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
# train model
clf = LogisticRegression().fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='weighted')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
Here are the results.# SVM
acc 0.8445011854824002
precision 0.8415581287611198
rec 0.8445011854824002
f1 0.8426318990306771
# LR
acc 0.8365858106875798
precision 0.8329923072929364
rec 0.8365858106875798
f1 0.8325166408219948
If you have any questions, feel free to ask.
Please reach out to me, need to discuss something related to improving scores and text classification for urdu news.
ReplyDeletereach out to unknown? Great!!
DeleteI need help related to fake news classification for urdu news,let me know how may i reach out to either of you?
DeleteThis is my email: virtuoso.irfan@gmail.com
ReplyDelete