Skip to main content

How to build Urdu language model in SpaCy

Urdu alphabets


SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model.

Step 1: Build word frequencies for Urdu.
I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this.
frequency    document_id     word
Here is the script I'm using to build word frequencies for SpaCy.
from __future__ import unicode_literals
import string
import codecs
import glob
from collections import Counter
import re
import plac
from multiprocessing import Pool
from tqdm import tqdm

# Arabic number replacement with English number
table = {
    1776: 48,  # 0
    1777: 49,  # 1
    1778: 50,  # 2
    1779: 51,  # 3
    1780: 52,  # 4
    1781: 53,  # 5
    1782: 54,  # 6
    1783: 55,  # 7
    1784: 56,  # 8
    1785: 57,  # 9
}


def count_words(fpath):
    with codecs.open(fpath, encoding="utf8") as f:
        sentences = f.read()
        # sentences = sentences.translate(table)
        sentences = re.sub(r"\d+", " ", sentences)
        # English punctuations
        sentences = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]+""", " ", sentences)
        # Urdu punctuations
        sentences = re.sub(r"[:؛؟’‘٭ء،۔]+", " ", sentences)
        # Arabic numbers
        sentences = re.sub(r"[٠‎١‎٢‎٣‎٤‎٥‎٦‎٧‎٨‎٩]+", " ", sentences)
        sentences = re.sub(r"[^\w\s]", " ", sentences)
        # Remove English characters and numbers.
        sentences = re.sub(r"[a-zA-z0-9]+", " ", sentences)
        # remove multiple spaces.
        sentences = re.sub(r" +", " ", sentences)
        words = sentences.split()
        counter = Counter(words)
    return counter
    

@plac.annotations(
    input_glob="Location of input directory",
    out_loc="Location of output file",
    workers=("Number of workers", "option", "n", int)
)


def main(input_glob, out_loc, workers=4):
    p = Pool(processes=workers)
    counts = p.map(count_words, tqdm(list(glob.glob(input_glob))))
    df_counts = Counter()
    word_counts = Counter()
    for wc in tqdm(counts):
        df_counts.update(wc.keys())
        word_counts.update(wc)
    with codecs.open(out_loc, "w", encoding="utf-8") as f:
        for word, df in df_counts.items():
            f.write(u"{freq}\t{df}\t{word}\n".format(word=word, df=df, freq=word_counts[word]))


if __name__ == "__main__":
    plac.call(main)
This script takes a txt file and outputs the word frequency txt file.

Step 2: Build clusters of your data.
SpaCy requires clusters to build a language model. I'm using a brown cluster (https://github.com/percyliang/brown-cluster) to build clusters for my raw dataset. Clone and compile the brown cluster algorithm and make 50 clusters which are more than enough for the Urdu language.
git clone https://github.com/percyliang/brown-cluster.git
cd brown-cluster
make
# Clusters input.txt into 50 clusters:
./wcluster --text input.txt --c 50
# Output in input-c50-p1.out/paths
Step 3: Dataset for the Urdu language
We also need the dataset for the Urdu language to build the model. You can use your own data if the dataset has universal dependencies and POS tags. For the initial model, I'm using https://github.com/UniversalDependencies/UD_Urdu-UDTB dataset.

Step 4: Word to vector representation
Word to vector is a good algorithm that provides a lot of information to the model about words. There are multiple resources available on the internet. You can choose any of them to build w2v. SpaCy can use these vectors easily and integrates with the model. Some common w2v resources are gensim (https://radimrehurek.com/gensim/models/word2vec.html), using Tensorflow https://www.tensorflow.org/tutorials/representation/word2vec or even using SpaCy https://github.com/explosion/sense2vec. Here is how I did that.
from __future__ import print_function, unicode_literals, division

import codecs
import io
import logging
import re
from os import path
import os
import plac
try:
    import ujson as json
except ImportError:
    import json
from gensim.models import Word2Vec
from preshed.counter import PreshCounter

logger = logging.getLogger(__name__)


class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, doc):
        # Get counts for this document
        for word in doc:
            self.counts.inc(word.orth, 1)
        return len(doc)

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8', errors='ignore') as file_:
                text = file_.read()
            yield text


def iter_dir(loc):
    for fn in os.listdir(loc):
        if path.isdir(path.join(loc, fn)):
            for sub in os.listdir(path.join(loc, fn)):
                yield path.join(loc, fn, sub)
        else:
            yield path.join(loc, fn)

@plac.annotations(
    in_dir=("Location of input directory"),
    out_loc=("Location of output file"),
    n_workers=("Number of workers", "option", "n", int),
    size=("Dimension of the word vectors", "option", "d", int),
    window=("Context window size", "option", "w", int),
    min_count=("Min count", "option", "m", int),
    negative=("Number of negative samples", "option", "g", int),
    nr_iter=("Number of iterations", "option", "i", int),
)
def main(in_dir, out_loc, negative=5, n_workers=4, window=5, size=100, min_count=5, nr_iter=2):
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    corpus = Corpus(in_dir)
    sentences = []
    for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
        file_sentences = []
        with codecs.open(text_loc, "r", encoding='utf-8', errors='ignore') as file_:
            lines = file_.readlines()
            for line in lines:
                text = re.sub(r"\d+", " ", line)
                # English punctuations
                text = re.sub(r"""[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]+""", " ", text)
                # Urdu punctuations
                text = re.sub(r"[:؛؟’‘٭ء،۔]+", " ", text)
                # Arabic numbers
                text = re.sub(r"[٠‎١‎٢‎٣‎٤‎٥‎٦‎٧‎٨‎٩]+", " ", text)
                text = re.sub(r"[^\w\s]", " ", text)
                # Remove English characters and numbers.
                text = re.sub(r"[a-zA-z0-9]+", " ", text)
                # remove multiple spaces.
                text = re.sub(r" +", " ", text)
                file_sentences.extend(text.split(" "))
        sentences.append(file_sentences)

    model = Word2Vec(
        sentences=sentences,
        size=size,
        window=window,
        min_count=min_count,
        workers=n_workers,
        sample=1e-5,
        negative=negative,
        iter=5
    )
    model.train(sentences=sentences, total_examples=len(sentences), epochs=5)
    model.wv.save_word2vec_format(out_loc, binary=False)

if __name__ == '__main__':
    plac.call(main)
Now we have all the necessary files and dataset to build a language model.

Step 5: Train and build the model
SpaCy provides the cli commands to easily use the core functionality. Here are the two commands which I've used to build the model.

python -m spacy init-model ur output/model/ur_model_sm/ -f /urdu/freqs/urdu.freqs -c urdu/cluster/input-c50-p1.out/paths -v urdu/vectors/urdu_w2v.txt
python -m spacy train ur output/model/final_model/ urdu/data/universal_dependencies_urdu/ur-ud-train.json /urdu/data/universal_dependencies_urdu/ur-ud-dev.json -p 'tagger,parser' -v output/model/ur_model_sm/
And that's it. These commands will build the Urdu language model. You can make the package using cli for your Urdu NLP apps. Here is the cli command for making the package of the built model.
python -m spacy package /input /output
cd /output/ur_model_sm-0.0.0
python setup.py sdist
pip install dist/ur_model_sm-0.0.0.tar.gz
Feel free to ask questions.

Comments

  1. I have trained my own model of urdu in spacy. Can you help me how to test that model?

    ReplyDelete
  2. Bro i am new to NLP. can you tell me how to use fastext vec to create spacy model! for urdu

    ReplyDelete
    Replies
    1. just replace the vector file in init spacy command.

      Delete
  3. What files are you using in step1 script!!!!!!!!

    ReplyDelete
  4. When I try to link the urdu model from git (https://github.com/mirfan899/Urdu/tree/master/spacy) using the following command:

    python -m spacy link ur-model ur

    the following error is thrown:
    Can't locate model data
    The data should be located in ur-model

    ReplyDelete
  5. How to use dataset having Conell format, (single token with a tag by tab seperated), Following is an example.

    اہور B-LOC
    ( O
    این O
    این O
    آئی O
    ) O
    اداکار O
    وسیم B-PER
    عباس I-PER
    بھی O
    عارضی O
    طور O
    پر O
    کراچی B-LOC
    شفٹ O
    ہو O
    گئے O

    ReplyDelete
  6. You have to convert the dataset to spacy format for training.

    ReplyDelete
  7. Hello dear,

    I am trying to replicate this article. Could you please guide me about using Brown Cluster Algorithm? I windows 10 OS and when I compile it, it gives me error message
    The system cannot find the path specified.
    g++ -Wall -g -std=c++0x -O3 -o wcluster basic/opt.o -lpthread
    c:/mingw/bin/../lib/gcc/mingw32/9.2.0/../../../../mingw32/bin/ld.exe: cannot find -lpthread
    collect2.exe: error: ld returned 1 exit status
    make: *** [wcluster] ′í?ó 1

    ReplyDelete
  8. It's related to GCC compiler. Somehow MingW can't find the path to libraries. You need to reinstall it or install visual studio.

    ReplyDelete
  9. Hye Sir, can we use it to build a model that converts anything written in Urdu to English?

    ReplyDelete
  10. Its not a good idea to use this model for language translation. You should use DL model for language translation.

    ReplyDelete
  11. Use this for translation
    https://nlp.johnsnowlabs.com/2021/01/03/translate_ur_en_xx.html

    ReplyDelete
  12. How to do a sentence segmentation in urdu?

    ReplyDelete
    Replies
    1. You need to train a model using sentencepiece.
      check it out. https://github.com/google/sentencepiece

      Delete
  13. I need to find similarity between two urdu news And then the similarity for all the news available in database, do you have any idea how can this be done?

    ReplyDelete
  14. Yes, sure. We can use BERT to find similar sentences.

    ReplyDelete
  15. can you please share the txt file used by the script for building word frequencies for spacy

    ReplyDelete
  16. You can use https://github.com/mirfan899/Urdu repository for this. Use counter corpus.

    ReplyDelete
  17. bro can i get ur whatsapp number ??

    ReplyDelete
  18. i worked on urdu news recommdation i need ur help contact me on 03045235184 whatsapp

    ReplyDelete
  19. Hi,
    Thank you for a great post. Can you please let me know step 5 for spacy 3.x as init_model is no longer supported by the latest versions. Thank you

    ReplyDelete
    Replies
    1. You need to follow the latest training documentation.
      https://spacy.io/usage/training

      Delete
    2. Thanks, I'll have a go. BTW, I wish the spaCy documentation was as good as the code.

      Delete
  20. Hi Sir!
    Can you please explain Step #05 that how to execute it?
    Like I want to know about what parameters we are passing to it

    Thank you for your time.

    ReplyDelete
  21. Check this.

    https://github.com/mirfan899/SpaCy3Urdu

    ReplyDelete
  22. Hi dear.. i want urdu text summarization code

    ReplyDelete
  23. can we use TensorFlow to make a pos tagging model from scratch?

    ReplyDelete
  24. sir i am new in NLP and want to work on sentence similarity.
    suppose i have a sentence "وزیراعظم نے سرمایہ کاروں کو بھرپور سہولتیں فراہم کرنے کی ہدایت کی"and i want to shuffle each word and make sentences and then i want to extract the information which sentence is giving me a proper sense so how we can find it?

    ReplyDelete
  25. You need to train a new model for this type of problem.

    ReplyDelete
  26. Traceback (most recent call last):
    File "", line 189, in _run_module_as_main
    File "", line 148, in _get_module_details
    File "", line 112, in _get_module_details
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\__init__.py", line 14, in
    from . import pipeline # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\pipeline\__init__.py", line 1, in
    from .attributeruler import AttributeRuler
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\pipeline\attributeruler.py", line 6, in
    from .pipe import Pipe
    File "spacy\pipeline\pipe.pyx", line 8, in init spacy.pipeline.pipe
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\training\__init__.py", line 11, in
    from .callbacks import create_copy_from_base_model # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\training\callbacks.py", line 3, in
    from ..language import Language
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\language.py", line 25, in
    from .training.initialize import init_vocab, init_tok2vec
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\training\initialize.py", line 14, in
    from .pretrain import get_tok2vec_ref
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\training\pretrain.py", line 16, in
    from ..schemas import ConfigSchemaPretrain
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\spacy\schemas.py", line 157, in
    class TokenPatternString(BaseModel):
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\pydantic\main.py", line 369, in __new__
    cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "D:\Forbmax User Data\Zohaibb\urdu_ner\myenv\Lib\site-packages\pydantic\utils.py", line 231, in generate_model_signature
    merged_params[param_name] = Parameter(
    ^^^^^^^^^^
    File "C:\Program Files\Python311\Lib\inspect.py", line 2725, in __init__
    raise ValueError('{!r} is not a valid parameter name'.format(name))
    ValueError: 'in' is not a valid parameter name

    ReplyDelete
  27. hey buddy, inform if i can create an URDU LLM Using Spacy?

    ReplyDelete

Post a Comment

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza