each other. There are multiple ways to find out the similarity of two documents and the most common being used in NLP is Cosine Similarity. Cosine Similarity is counted using vectors (word2vector) and provides information about how much two vectors are close in the context of orientation.
Some helpful links to understand the similarity concepts:
- https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50
- https://www.sciencedirect.com/topics/computer-science/cosine-similarity
It mostly depends on the quality of the vectors of the documents. If you want to get better results, build a better word 2 vector model. To use the similarity feature of SpaCy, you need to build a language model (you can build a language model by following my article https://www.urdunlp.com/2019/08/how-to-build-urdu-language-model-in.html).
Here is how I've calculated the cosine similarity of the words and sentences.
Here is how I've calculated the cosine similarity of the words and sentences.
import spacy
nlp = spacy.load('ur_model')
doc1 = nlp("عمران")
doc2 = nlp("عرفان")
print("Cosine Similarity of words.")
cosine_similarity = doc1.similarity(doc2)
print(cosine_similarity)
print("Cosine Similarity of sentences.")
doc3 = nlp("میں کھیلتا ہوں")
doc4 = nlp("میں کام کرتا ہوں")
cosine_similarity = doc3.similarity(doc4)
print(cosine_similarity)
If you have any questions, feel free to ask in comments.
Salam , is there any tutorials available for URDU_NLP in youtube
ReplyDeleteWslam!
DeleteNot yet.
OSError: [E050] Can't find model 'ur_model'. It doesn't seem to be a Python package or a valid path to a data directory.
ReplyDeleteDownload model from here
Deletehttps://github.com/mirfan899/Urdu
how can add ur_model in jupyter notebook?
DeleteJust install it as package then load it with spacy.
Deletepip install ur_model-0.0.0.tar.gz
This comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteI think by using above spacy module, cosine similarity gives high similarity results even two sentences are not much similar.
ReplyDeleteYes, to get better results you need to train large model with vectors.
ReplyDelete