Sentiment Analysis of Products and Services in Pakistan

This blog post is about the dataset and sentiment analysis of the products and services provided in Pakistan. It's the first step in sentiment analysis for manufacturing industry related reviews of people. It took me some time to build this dataset with the help of a few students. We have used the following products and services provided by the company for analysis.

Let's begin with the implementation of SVM for sentiment analysis.
Import necessary packages.

import re
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm

Read the dataset and convert it to a list for further processing.

raw_data = pd.read_csv('../data/products_sentiment_urdu.csv')
raw_data.head()
# check the size of the data and its class distribution
sentences = raw_data['sentence'].tolist()
sentiments = raw_data['sentiment'].tolist()

Show plots of sentiment classes:

sns.countplot(x='sentiment', data=raw_data)
plt.show()

Now clean the text with the following function

# text cleaning and pre-processing:
def delete_urdu_english_symbols(sentences):
    cleaned = []
    for sentence in sentences:
        # Remove English and Urdu punctuations
        text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[]^_`{|}~؛؟’‘٭ء،۔]+""", " ", sentence)
        # remove multiple spaces.
        text = re.sub(r" +", " ", text)
        text = text.split(" ")
        # some stupid empty tokens should be removed.
        text = [t.strip() for t in text if t.strip()]
        cleaned.append(" ".join(text))
    return cleaned


X = delete_urdu_english_symbols(sentences)
Y = sentiments

Split data into 80-20 ratio and convert it to the TFIDF vector. You can tune the model by using max_feature_num and test train split ratio.

# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)

# training: tf-idf + logistic regression
max_feature_num = 6000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(
    test_text)

Initialize and train the model.

# train model
clf = svm.LinearSVC()
clf.fit(train_vecs, train_labels)

Now predict the sentiments

test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

SVM gives the following results:

acc 0.822863403944485
precision 0.8242985375774442
rec 0.819176278364857
f1 0.8214720755500949

Not bad! We got an 82 f1-score with this very simple SVM model.
Save the model for later use

# save model and other necessary modules
all_info_want_to_save = {
    'model': clf,
    'vectorizer': TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("../models/sentiment_urdu_svm.pickle", "wb")
pickle.dump(all_info_want_to_save, save_path)

And that's pretty much it. If you have kind of questions related to the article feel free to ask.

Comments

Anonymous30 July 2021 at 04:51
I am unable to locate the dataset can anybody help in this one ?
ReplyDelete
Replies
Muhammad Irfan30 July 2021 at 20:55
This dataset is not publicly available.
ReplyDelete
Replies
Unknown25 October 2021 at 23:44
Hello Sir,
Can you please share the "products_sentiment_urdu.csv" dataset through email.

ehtishamrehman1@gmail.com

Regards,
Ehtisham
ReplyDelete
Replies
Muhammad Irfan28 October 2021 at 08:25
Download from here.
https://github.com/mirfan899/Urdu
ReplyDelete
Replies

Add comment

UrduNLP

Search This Blog

Sentiment Analysis of Products and Services in Pakistan

Comments

Post a Comment

Popular posts from this blog

Transformer Based QA System for Urdu

Text Summarization for Urdu: Part 1

Urdu News Classification