Let's start coding for logistic regression. I'm using Urdu Corpus V1 for this tutorial. Here is what data looks.
# load data and take a quick look
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
raw_data = pd.read_csv('../data/sentiment_urdu.csv')
raw_data.head(5)
Original datasets have 3 classes but one class has only 20 records. I've removed it because it will cause some issues for classification. Here it is what the bar chat looks like with three classes.
# check the size of the data and its class distribution
sentences = raw_data['sentence'].tolist()
sentiments = raw_data['sentiment'].tolist()
sns.countplot(x='sentiment', data=raw_data)
plt.show()
Sentiment Count for each class |
I've removed the "O" class form dataset with the following code.
indexNames = raw_data[raw_data['sentiment'] == "O"].index
raw_data.drop(indexNames, inplace=True)
sns.countplot(x='sentiment', data=raw_data)
plt.show()
Sentiment Count for each Class
Now, perform pre-processing on the data for better results. Currently, I'm only removing English text, symbols and numbers.
# text cleaning and pre-processing:
def delete_urdu_english_symbols(sentences):
cleaned = []
for sentence in sentences:
text = re.sub(r"\d+", " ", sentence)
# English punctuations
text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]+""", " ", text)
# Urdu punctuations
text = re.sub(r"[:؛؟’‘٭ء،۔]+", " ", text)
# Arabic numbers
text = re.sub(r"[٠١٢٣٤٥٦٧٨٩]+", " ", text)
text = re.sub(r"[^\w\s]", " ", text)
# Remove English characters and numbers.
text = re.sub(r"[a-zA-z0-9]+", " ", text)
# remove multiple spaces.
text = re.sub(r" +", " ", text)
text = text.split(" ")
# some stupid empty tokens should be removed.
text = [t.strip() for t in text if t.strip()]
cleaned.append(" ".join(text))
return cleaned
X = delete_urdu_english_symbols(sentences)
Y = sentiments
# Feel free to use different ratios to split the data.
train_text, test_text, train_labels, test_labels = train_test_split(X, Y, test_size=0.20, random_state=42)
# training: tf-idf + logistic regression
max_feature_num = 5000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)
# train model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(train_vecs, train_labels)
# test model
test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
The first attempt is not bad. It gives us nearly 84% accuracy for LogisticRegression. You can use SVM to get better classification results. I will use different approaches in the future for this purpose.
Now save the model for later use. I'm saving the model as well as tfidf as well.
Now save the model for later use. I'm saving the model as well as tfidf as well.
import pickle
# save model and other necessary modules
all_info_want_to_save = {
'model': clf,
'vectorizer': TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("sentiment_urdu_logistic_regression.pickle","wb")
pickle.dump(all_info_want_to_save, save_path)
And that's it. If you want to test the model, use the following code for it.import pickle
# reload your model and use it to make predictions for test text
# you should adjust the code so as to load to your saved model/components
def test_trained_model(model_path, test_text):
saved_model_dic = pickle.load(open(model_path,"rb"))
saved_clf = saved_model_dic['model']
saved_vectorizer = saved_model_dic['vectorizer']
print(len(saved_vectorizer.vocabulary))
new_test_vecs = saved_vectorizer.fit_transform(test_text)
return saved_clf.predict(new_test_vecs)
# load sample test data
import pandas as pd
test_data = pd.read_csv('coursework1_train.csv')
test_text = test_data['text'].tolist()[-5000:]
test_labels = test_data['sentiment'].tolist()[-5000:]
print('test data size', len(test_labels))
# test model
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
new_test_pred = test_trained_model("sentiment_urdu_logistic_regression.pickle", test_text)
acc = accuracy_score(test_labels, new_test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, new_test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)
I'm having a " KeyError: 'sentence' " in the second part of the code.
ReplyDeleteWhich dataset are you using?
ReplyDeleteis stopwords necessary for urdu/pashtu datasets or not ?
ReplyDeleteIt depends on the use of text.
Deletekindly share your data set
ReplyDeletein which data you are give the example thanks