Skip to main content

Urdu Baby Name Generation Using AI

Common Urdu Names.
Text generation is an advanced field of AI. It uses state of the art techniques to generate texts using text corpus. You can generate books, poems, songs, and even research papers using this technique.

How to generate short text like names? Well, you are in the right place. You can create the unique baby names in Urdu by following this tutorial. The first thing for this tutorial is to get the baby names, I've written a tutorial for scrapping the baby names from the website. Check it Baby Names.
I've also created a Git repository urdu-baby-names for baby names, check it out.

Let's start.

First import libraries we are going to use:
import numpy as np
import pandas as pd
from keras.callbacks import LambdaCallback
from keras.layers import LSTM, Dense
from keras.models import Sequential
Read the names file, extract the characters and indices to dictionaries of every character in names. I'm using boys_names.csv for this tutorial. You can use girls_names.csv for girl names.
names = pd.read_csv("../data/boys_names.csv")
full_names = names["boys_names"].tolist()
full_names = list(map(lambda s: s + "۔", full_names))
chars = sorted(list(set(" ".join(full_names))))
print("total chars:", len(chars))
char_to_index = dict((c, i) for i, c in enumerate(chars))
index_to_char = dict((i, c) for i, c in enumerate(chars))
Now get the max length of name from names and max dimension.
max_char = len(max(full_names, key=len))

m = len(full_names)
char_dim = len(char_to_index)
Generate X and Y variables which will be used for storing the dataset values for training the model
X = np.zeros((m, max_char, char_dim))
Y = np.zeros((m, max_char, char_dim))
Feed X and Y with values.
for i in range(m):
    name = list(full_names[i])
    for j in range(len(name)):
        X[i, j, char_to_index[name[j]]] = 1
        if j < len(name) - 1:
            Y[i, j, char_to_index[name[j + 1]]] = 1
Build and train the model.
model = Sequential()
model.add(LSTM(128, input_shape=(max_char, char_dim), return_sequences=True))
model.add(LSTM(128, return_sequences=True))
model.add(Dense(char_dim, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
model.fit(X, Y, batch_size=32, epochs=100, verbose=0)
After training the model, you can now generate names using this model by using the following function.
def generate_name(model):
    name = []
    x = np.zeros((1, max_char, char_dim))
    end = False
    i = 0

    while not end:
        probs = list(model.predict(x)[0, i])
        probs = probs / np.sum(probs)
        index = np.random.choice(range(char_dim), p=probs)
        if i == max_char - 2:
            character = "۔"
            end = True
        else:
            character = index_to_char[index]
        name.append(character)
        x[0, i + 1, index] = 1
        i += 1
        if character == "۔":
            end = True

    print(''.join(name))

for i in range(5):
    generate_name(model)
It will generate a unique name as well as names you see in the file. Choose the best name you like for your baby or suggest it to someone you want to.

Comments

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما...

Transformer Based QA System for Urdu

Question Answer Bot   The Question-Answer System is the latest trend in NLP.  There are currently two main techniques used for the Question-Answer system. 1 -  Open Domain: It is a wast land of NLP applications to build a QA system. A huge amount of data and text used to build such a system. I will write a blog post later about using the Open-Domain QA system. 2 - Closed Domain:  A closed domain question system is a narrow domain and strictly answers the questions which can be found in the domain. One example of a Closed Domain question system is a Knowledge-Based system. In this tutorial, I will explain the steps to build a Knowledge-Based QA system. Knowledge Base (KB) question answers are mostly used for FAQs. Where the user asks the questions and the model returns the best-matched answer based on the question. It's easy to implement and easy to integrate with chatbots and websites.  It is better to use the KB system for small datasets or narrow domains like...

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza...