Skip to main content

Urdu Sentiment Classification


Sentiment Analysis is a classic task in NLP. There is a lot of research done for different languages but not in the Urdu language. Although there are some papers available for sentiment analysis on Urdu, but not data or source code is provided to reproduce the results. This is my first attempt to run logic regression on the Urdu dataset.

Let's start coding for logistic regression. I'm using Urdu Corpus V1 for this tutorial. Here is what data looks.
# load data and take a quick look
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

raw_data = pd.read_csv('../data/sentiment_urdu.csv')
raw_data.head(5)
Original datasets have 3 classes but one class has only 20 records. I've removed it because it will cause some issues for classification. Here it is what the bar chat looks like with three classes.
# check the size of the data and its class distribution
sentences = raw_data['sentence'].tolist()
sentiments = raw_data['sentiment'].tolist()

sns.countplot(x='sentiment', data=raw_data)
plt.show()

Sentiment Count for each class

I've removed the "O" class form dataset with the following code.

indexNames = raw_data[raw_data['sentiment'] == "O"].index
raw_data.drop(indexNames, inplace=True)
sns.countplot(x='sentiment', data=raw_data)
plt.show()

Sentiment Count for each Class

Now, perform pre-processing on the data for better results. Currently, I'm only removing English text, symbols and numbers.

# text cleaning and pre-processing:
def delete_urdu_english_symbols(sentences):
    cleaned = []
    for sentence in sentences:
        text = re.sub(r"\d+", " ", sentence)
        # English punctuations
        text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]+""", " ", text)
        # Urdu punctuations
        text = re.sub(r"[:؛؟’‘٭ء،۔]+", " ", text)
        # Arabic numbers
        text = re.sub(r"[٠‎١‎٢‎٣‎٤‎٥‎٦‎٧‎٨‎٩]+", " ", text)
        text = re.sub(r"[^\w\s]", " ", text)
        # Remove English characters and numbers.
        text = re.sub(r"[a-zA-z0-9]+", " ", text)
        # remove multiple spaces.
        text = re.sub(r" +", " ", text)
        text = text.split(" ")
        # some stupid empty tokens should be removed.
        text = [t.strip() for t in text if t.strip()]
        cleaned.append(" ".join(text))
    return cleaned

X = delete_urdu_english_symbols(sentences)
Y = sentiments

Now the important part is to split the data for training and testing. I'm using scikit function train_test_split for ease.

# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)
Here is the classifier for configuration. 5000 max features is a general guess. You can use a different size for your convenience. Also, I'm using TFIDF feature for LogisticRegression. There are three class and LogisticRegression can handle multi-class data so let's see how it performs on the Urdu dataset.
# training: tf-idf + logistic regression
max_feature_num = 5000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

# train model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(train_vecs, train_labels)

# test model
test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
The first attempt is not bad. It gives us nearly 84% accuracy for LogisticRegression. You can use SVM to get better classification results. I will use different approaches in the future for this purpose.

Now save the model for later use. I'm saving the model as well as tfidf as well.
import pickle

# save model and other necessary modules
all_info_want_to_save = {
    'model': clf,
    'vectorizer': TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("sentiment_urdu_logistic_regression.pickle","wb")
pickle.dump(all_info_want_to_save, save_path)
And that's it. If you want to test the model, use the following code for it.
import pickle

# reload your model and use it to make predictions for test text
# you should adjust the code so as to load to your saved model/components
def test_trained_model(model_path, test_text):
    saved_model_dic = pickle.load(open(model_path,"rb"))
    saved_clf = saved_model_dic['model']
    saved_vectorizer = saved_model_dic['vectorizer']
    print(len(saved_vectorizer.vocabulary))
    new_test_vecs = saved_vectorizer.fit_transform(test_text)
    return saved_clf.predict(new_test_vecs)


# load sample test data
import pandas as pd
test_data = pd.read_csv('coursework1_train.csv')
test_text = test_data['text'].tolist()[-5000:]
test_labels = test_data['sentiment'].tolist()[-5000:]

print('test data size', len(test_labels))

# test model
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
new_test_pred = test_trained_model("sentiment_urdu_logistic_regression.pickle", test_text)
acc = accuracy_score(test_labels, new_test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, new_test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)




Comments

  1. I'm having a " KeyError: 'sentence' " in the second part of the code.

    ReplyDelete
  2. is stopwords necessary for urdu/pashtu datasets or not ?

    ReplyDelete
  3. kindly share your data set
    in which data you are give the example thanks

    ReplyDelete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...