Skip to main content

Sentiment Analysis of Products and Services in Pakistan

 This blog post is about the dataset and sentiment analysis of the products and services provided in Pakistan. It's the first step in sentiment analysis for manufacturing industry related reviews of people. It took me some time to build this dataset with the help of a few students.  We have used the following products and services provided by the company for analysis.


Let's begin with the implementation of SVM for sentiment analysis.
Import necessary packages.
import re
import pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn import svm
Read the dataset and convert it to a list for further processing.
raw_data = pd.read_csv('../data/products_sentiment_urdu.csv')
raw_data.head()
# check the size of the data and its class distribution
sentences = raw_data['sentence'].tolist()
sentiments = raw_data['sentiment'].tolist()
Show plots of sentiment classes:
sns.countplot(x='sentiment', data=raw_data)
plt.show()
Now clean the text with the following function
# text cleaning and pre-processing:
def delete_urdu_english_symbols(sentences):
    cleaned = []
    for sentence in sentences:
        # Remove English and Urdu punctuations
        text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[]^_`{|}~؛؟’‘٭ء،۔]+""", " ", sentence)
        # remove multiple spaces.
        text = re.sub(r" +", " ", text)
        text = text.split(" ")
        # some stupid empty tokens should be removed.
        text = [t.strip() for t in text if t.strip()]
        cleaned.append(" ".join(text))
    return cleaned


X = delete_urdu_english_symbols(sentences)
Y = sentiments
Split data into 80-20 ratio and convert it to the TFIDF vector. You can tune the model by using max_feature_num and test train split ratio.
# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)

# training: tf-idf + logistic regression
max_feature_num = 6000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_).fit_transform(
    test_text)
Initialize and train the model.
# train model
clf = svm.LinearSVC()
clf.fit(train_vecs, train_labels)
Now predict the sentiments
test_pred = clf.predict(test_vecs)
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
SVM gives the following results:
acc 0.822863403944485
precision 0.8242985375774442
rec 0.819176278364857
f1 0.8214720755500949
Not bad! We got an 82 f1-score with this very simple SVM model.
Save the model for later use
# save model and other necessary modules
all_info_want_to_save = {
    'model': clf,
    'vectorizer': TfidfVectorizer(max_features=max_feature_num, vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("../models/sentiment_urdu_svm.pickle", "wb")
pickle.dump(all_info_want_to_save, save_path)
And that's pretty much it. If you have kind of questions related to the article feel free to ask.



Comments

  1. I am unable to locate the dataset can anybody help in this one ?

    ReplyDelete
  2. This dataset is not publicly available.

    ReplyDelete
  3. Hello Sir,
    Can you please share the "products_sentiment_urdu.csv" dataset through email.

    ehtishamrehman1@gmail.com

    Regards,
    Ehtisham

    ReplyDelete
  4. Download from here.
    https://github.com/mirfan899/Urdu

    ReplyDelete

Post a Comment

Popular posts from this blog

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing