SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words(Urdu). Here is how you can use the tokenizer for the Urdu language.
First, install SpaCy.
$ pip install spacy
Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available.
import spacy
nlp = spacy.blank('ur')
doc = nlp("
کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔")
print("Urdu Tokenization using SpaCy")
for word in doc:
print(word)
Here is the output:کچھ
ممالک
ایسے
بھی
ہیں
جہاں
اس
برس
روزے
کا
دورانیہ
20
گھنٹے
تک
ہے
۔
Note that Urdu has different punctuation symbols such as ۔ ، etc and it also uses English numbers 12 etc. Accuracy is 100% for Urdu language tokenization.If you have any question feel free to ask in comments.
can you do a tutorial on doing Urdu lemmatization using Spacy please?
ReplyDeleteGreetings, I am curious about how to create detection/annotation for numeric and date expression for Urdu? Like ۱۹۹۳, اکتوبر۳ or ۹۹روپے ? because in your blog it is in english-number from[1-9] how about urdu-number[۰-۹]? Please do tell me.Thanks for your blog due that I knew about URDU natural language processing. Keep the good work.
ReplyDeleteFor detection or annotation you need to train NER model.
ReplyDeleteThanks for sharing this useful information. I wanted to ask what's the data type of return tokens. Sorry for this question I'm new to NLP .
ReplyDeleteIts a SpaCy doc..
ReplyDeleteHi, can you do a tutorial on doing Urdu text summarization using Spacy please?
ReplyDeleteYes, sure. I will do it in future. Currently working on Q&A system.
ReplyDeleteHi , how can we handle missing white spaces like روزےکادورانیہ between urdu words ?
ReplyDeleteUse word segmentation. This is a very difficult problem and can only by done using large corpus for training the model.
ReplyDeletecan you please help regarding Urdu word segmentation problem
ReplyDeleteThis is a challenging problem. You need to get a clean Urdu dataset, then I can guide you about it.
ReplyDeletelanguage model for urdu is currently unavailable on spacy. i was wondering what more can you do then, other than tokenization on urdu text with spacy.
ReplyDeletesentiment, NER, lemmatization, pretty much everything spacy provides.
ReplyDelete