This blog post is about the dataset and sentiment analysis of the products and services provided in Pakistan. It's the first step in sentiment analysis for manufacturing industry related reviews of people. It took me some time to build this dataset with the help of a few students. We have used the following products and services provided by the company for analysis.
Let's begin with the implementation of SVM for sentiment analysis.
Import necessary packages.
import re
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm
raw_data = pd.read_csv('../data/products_sentiment_urdu.csv')
raw_data.head()
# check the size of the data and its class distribution
sentences = raw_data['sentence'].tolist()
sentiments = raw_data['sentiment'].tolist()
sns.countplot(x='sentiment', data=raw_data)
plt.show()
Now clean the text with the following function# text cleaning and pre-processing:
def delete_urdu_english_symbols(sentences):
cleaned = []
for sentence in sentences:
# Remove English and Urdu punctuations
text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[]^_`{|}~؛؟’‘٭ء،۔]+""", " ", sentence)
# remove multiple spaces.
text = re.sub(r" +", " ", text)
text = text.split(" ")
# some stupid empty tokens should be removed.
text = [t.strip() for t in text if t.strip()]
cleaned.append(" ".join(text))
return cleaned
X = delete_urdu_english_symbols(sentences)
Y = sentiments
# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)
# training: tf-idf + logistic regression
max_feature_num = 6000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(
test_text)
Initialize and train the model.# train model
clf = svm.LinearSVC()
clf.fit(train_vecs, train_labels)
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
SVM gives the following results:
acc 0.822863403944485
precision 0.8242985375774442
rec 0.819176278364857
f1 0.8214720755500949
Save the model for later use
# save model and other necessary modules
all_info_want_to_save = {
'model': clf,
'vectorizer': TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("../models/sentiment_urdu_svm.pickle", "wb")
pickle.dump(all_info_want_to_save, save_path)
I am unable to locate the dataset can anybody help in this one ?
ReplyDeleteThis dataset is not publicly available.
ReplyDeleteHello Sir,
ReplyDeleteCan you please share the "products_sentiment_urdu.csv" dataset through email.
ehtishamrehman1@gmail.com
Regards,
Ehtisham
Download from here.
ReplyDeletehttps://github.com/mirfan899/Urdu