Skip to main content

Urdu Word and Sentence Similarity using SpaCy

The similarity is the common measure of understanding how much close two words or sentences are to
each other. There are multiple ways to find out the similarity of two documents and the most common being used in NLP is Cosine Similarity. Cosine Similarity is counted using vectors (word2vector) and provides information about how much two vectors are close in the context of orientation.

Some helpful links to understand the similarity concepts:
  1. https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50
  2. https://www.sciencedirect.com/topics/computer-science/cosine-similarity
It mostly depends on the quality of the vectors of the documents. If you want to get better results, build a better word 2 vector model. To use the similarity feature of SpaCy, you need to build a language model (you can build a language model by following my article https://www.urdunlp.com/2019/08/how-to-build-urdu-language-model-in.html).

Here is how I've calculated the cosine similarity of the words and sentences.

import spacy

nlp = spacy.load('ur_model')

doc1 = nlp("عمران")
doc2 = nlp("عرفان")
print("Cosine Similarity of words.")

cosine_similarity = doc1.similarity(doc2)
print(cosine_similarity)

print("Cosine Similarity of sentences.")
doc3 = nlp("میں کھیلتا ہوں")
doc4 = nlp("میں کام کرتا ہوں")

cosine_similarity = doc3.similarity(doc4)
print(cosine_similarity)
If you have any questions, feel free to ask in comments.


Comments

  1. Salam , is there any tutorials available for URDU_NLP in youtube

    ReplyDelete
  2. OSError: [E050] Can't find model 'ur_model'. It doesn't seem to be a Python package or a valid path to a data directory.

    ReplyDelete
    Replies
    1. Download model from here

      https://github.com/mirfan899/Urdu

      Delete
    2. how can add ur_model in jupyter notebook?

      Delete
    3. Just install it as package then load it with spacy.
      pip install ur_model-0.0.0.tar.gz

      Delete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. I think by using above spacy module, cosine similarity gives high similarity results even two sentences are not much similar.

    ReplyDelete
  6. Yes, to get better results you need to train large model with vectors.

    ReplyDelete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu News Classification

News Classification is the latest buzz word in NLP for identifying the type of news and figuring out its a fake or not. There is a dataset available Urdu News extracted from web and has multiple classes and can be used for news classificaiton and other purposes. Preprocessing: News dataset is in multiple excel files, for the sake of classification, we need to convert it to single csv file. Here is how I did it import pandas as pd import glob files = glob.glob( "data/*.xlsx" ) df = pd.DataFrame() # if you want to use xlrd then 1.2.0 is good to go, openpyxl has a lot of issues. for file in files: excel_file = pd.read_excel(file , index_col= None , na_values=[ 'NA' ] , usecols=[ "category" , "summery" , "title" ] , engine= "xlrd" ) df = df.append(excel_file , ignore_index= True ) df.drop_duplicates(inplace= True ) # use single word for classification. df.category = df.category.str.replace( "weird ne...