Skip to main content

Classifying Indeed Jobs using DNNs



Indeed is a top site for posting job adds. Companies and employers post ads of required jobs on Indeed to hire people with required skills. It helps job seekers to find jobs depending on their skills and expertise. There are many parameters for finding the job such as salary range, location, skills, requirements, job type and many more. But for classification of jobs, I’ve chosen requirements and location parameters to predict jobs using job type as classes or labels.

Step 1: Scrapping Indeed Jobs

I’ve used beautiful soap to scrap Indeed for different types of jobs. Indeed uses some tricks to prevent the site to be scrapped. Here is how I did it.

import re
import pandas as pd
from time import sleep
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from urllib.request import urlopen
import nltk
FULL_TIME = 'full%20time&l=New%20York'
PART_TIME = 'part%20time&l=New%20York'
CONTRACT = 'contract&l=New%20York'
INTERNSHIP = 'internship&l=New%20York'
TEMPORARY = 'temporary&l=New%20York'
COMMISSION = 'commission&l=New%20York'
def get_clean_text(website):
    """
    this function scrape the page and cleans the required text
    :param webpage
    :return text
    """
    try:
        site = urlopen(website).read()
    except:
        return

    soup_obj = BeautifulSoup(site, features='html5lib')
    text = soup_obj.find("div", attrs={"id": "jobDescriptionText",   "class": "jobsearch-jobDescriptionText"})  

    if text:
        text = soup_obj.find("div", attrs={"id":   "jobDescriptionText", "class": "jobsearch-  jobDescriptionText"}).get_text()

        lines = (line.strip() for line in text.splitlines())
        text = " ".join(line for line in lines if line)
        text = text.lower().split()

        stop_words = set(stopwords.words("english"))
        text = [w for w in text if w not in stop_words]
        return " ".join(text)
    else:
        return
    

def get_jobs_by_type(job_type=None):
    """
    this function extracts the jobs from indeed page using param   job_type
    :param job_type
    :return jobs
    """
    final_site_list = ['http://www.indeed.com/jobs?q=', job_type]
    final_site = "".join(final_site_list)

    base_url = "http://www.indeed.com"

    try:
        # Open up the front page of our search first
        html = urlopen(final_site).read()
    except:
        "That city/state combination did not have any jobs. Exiting   . . ."
        return
    soup = BeautifulSoup(html, features="html5lib")

    # Find jobs count
    num_jobs_area = soup.find(id='searchCount').text

    job_numbers = re.findall(r'\d+', num_jobs_area)

    if len(job_numbers) >= 3:
        total_num_jobs = (int(job_numbers[1]) * 1000) + int(job_numbers[2])
    else:
        total_num_jobs = int(job_numbers[1])

    # Total jobs
    print("There were", total_num_jobs, "jobs found,")
    num_pages = int(total_num_jobs / 10)
    job_descriptions = []

    for i in range(1, num_pages + 1):
        print('Getting page', i)
        start_num = str(i * 10)
        current_page = ''.join([final_site, '&start=', start_num])

        html_page = urlopen(current_page).read()

        page_obj = BeautifulSoup(html_page, features="html5lib")
        job_link_area = page_obj.find(id='resultsCol')

        job_URLS = [base_url + link.get('href') for link in
                    job_link_area.find_all('a', href=True)]

        job_URLS = list(filter(lambda x: 'clk' in x, job_URLS))

        for j in range(0, len(job_URLS)):
            final_description = get_clean_text(job_URLS[j])
            if final_description:
                job_descriptions.append(final_description)
            sleep(3)

    print("There were", len(job_descriptions), "jobs successfully found.")

    return job_descriptions
ft = get_jobs_by_type(job_type=FULL_TIME)
ft_count = ["FULL_TIME"] * len(ft)
full_time = pd.DataFrame({"description": ft, "job_type": ft_count})
full_time.to_csv("full_time.csv", index=False)

pt = get_jobs_by_type(job_type=PART_TIME)
pt_count = ["PART_TIME"] * len(pt)
part_time = pd.DataFrame({"description": pt, "job_type": pt_count})
part_time.to_csv("part_time.csv", index=False)

cont = get_jobs_by_type(job_type=CONTRACT)
cont_count = ["CONTRACT"] * len(cont)
contract = pd.DataFrame({"description": cont, "job_type": cont_count})
contract.to_csv("contract.csv", index=False)

intern = get_jobs_by_type(job_type=INTERNSHIP)
intern_count = ["INTERN"] * len(intern)
internship = pd.DataFrame({"description": intern, "job_type": intern_count})
internship.to_csv("internship.csv", index=False)

temp = get_jobs_by_type(job_type=TEMPORARY)
temp_count = ["TEMPORARY"] * len(temp)
temporary = pd.DataFrame({"description": temp, "job_type": temp_count})
temporary.to_csv("temporary.csv", index=False)

comm = get_jobs_by_type(job_type=COMMISSION)
comm_count = ["COMMISSION"] * len(comm)
commission = pd.DataFrame({"description": comm, "job_type": comm_count})
commission.to_csv("commission.csv", index=False)

frames = [full_time, part_time, contract, internship, temporary, commission]
total = pd.concat(frames)

total.to_csv("indeed_jobs.csv", index=False)
After scrapping, I noticed that some of the jobs overlap with each other. For example, part-time and temporary jobs have some common jobs. To remove these common jobs I’ve published an article on finding and removing common jobs in two different data frames.

Step 2: Preprocessing the data

Now, I applied data preprocessing on job ads to remove punctuations, numbers, and URLs, etc.

import pandas as pd
import string
import re
data = pd.read_csv("indeed_jobs.csv", encoding='utf-8')
data.description = data.description.apply(lambda x: re.sub("\d+\.\d+", ' ', x))
data.description = data.description.apply(lambda x: x.lower())
data.description = data.description.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
data.description = data.description.apply(lambda x: re.sub(r'[\w\.-]+@[\w\.-]+', ' ', x))

# https://gist.github.com/mirfan899/fc7cf1d1f49fbf435c43fdb12f299e60
data.description = data.description.apply(lambda x: x.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))))
data.description = data.description.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))

data.description = data.description.apply(lambda x: re.sub(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""", ' ', x))
data.description = data.description.apply(lambda x: re.sub(r'[^\w]', ' ', x))
data.description = data.description.apply(lambda x: re.sub(r'\s+', ' ', x))
data.description = data.description.apply(lambda x: x.strip())

data.to_csv("indeed_jobs_cleaned.csv", index=False)

Step 3: Classification of Jobs

Now, we are ready to train the model on Indeed jobs. I’m using Keras to train the model for this task.

import pandas as pd
from keras import Sequential
from keras.layers import Dense, Activation
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelBinarizer

vocab = 15000
max_len = 200

documents = pd.read_csv("indeed_jobs_cleaned.csv")

# shuffle data
documents = documents.sample(frac=1)

employment_type = documents["job_type"].values.astype(str)
num_classes = len(set(employment_type))

descriptions = documents["description"].values.astype(str)
train_size = int(len(descriptions) * .8)

x_train_texts = descriptions[:train_size]
y_train = list(employment_type[:train_size])
x_test_texts = descriptions[train_size:]
y_test = list(employment_type[train_size:])

tokenizer = Tokenizer(num_words=vocab)
tokenizer.fit_on_texts(x_train_texts)

x_train = tokenizer.texts_to_matrix(x_train_texts, mode="tfidf")
x_test = tokenizer.texts_to_matrix(x_test_texts, mode="tfidf")

encoder = LabelBinarizer(sparse_output=False)
encoder.fit(y_train+y_test)
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

print(x_train.shape[1], y_train.shape[1])
print(x_test.shape[1], y_test.shape[1])

model = Sequential()
model.add(Dense(256, input_shape=(vocab,)))
model.add(Activation('relu'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=32,
                    epochs=10,
                    validation_split=0.1)

score = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
I’m getting 87% accuracy on the test data set using the MLP model. Here are the logs of training.


I’ve used MLP because I scrapped 3000 jobs which are not enough to training RNNs such as LSTM or BLSTM. To train the LSTM model you can scrape at least 500 jobs for each class and get good accuracy.

Results
Accuracy of the MLP model is good for small dataset.
Loss of the data is not bad and requires more data to build a good model.
The gap can be shortened by adding more data and tuning parameters.

Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, feel free to comment or contact me on Linkedin.

Comments

Popular posts from this blog

Text Summarization for Urdu: Part 1

 Text Summarization is an important task for large documents to get the idea of the document. There are two main summarization techniques used in NLP for text summarization. Extractive Text Summarization :  This approach's name is self-explanatory. Most important sentences or phrases are extracted from the original text and a short summary provided with these important sentences. See the figure for the explanation. Abstractive Text Summarization : This approach uses more advanced deep learning techniques to generate new sentences by learning from the original text. It is a complex task and requires heavy computing power such as GPU. Let's dive into the code for generating the text summary. I'm using Arabic as a parameter because the contributor did an excellent job of handling a lot of things like stemming, Urdu characters support, etc. from summa.summarizer import summarize text = """ اسلام آباد : صدر مملکت ڈاکٹر عارف علوی بھی کورونا وائرس کا شکار ہوگئے۔ سما

Urdu Tokenization using SpaCy

SpaCy is an NLP library which supports many languages. It’s fast and has DNNs build in for performing many NLP tasks such as POS and NER. It has extensive support and good documentation. It is fast and provides GPU support and can be integrated with Tensorflow, PyTorch, Scikit-Learn, etc. SpaCy provides the easiest way to add any language support. A new language can be added by simply following Adding Languages article. I’ve added the Urdu language with dictionary-based lemmatization, lexical support and stop words( Urdu ). Here is how you can use the tokenizer for the Urdu language. First, install SpaCy . $ pip install spacy Now import spacy and create a blank object with support of Urdu language. I’m using blank because there is no proper model available for Urdu yet, but tokenization support available. import spacy nlp = spacy.blank('ur') doc = nlp(" کچھ ممالک ایسے بھی ہیں جہاں اس برس روزے کا دورانیہ 20 گھنٹے تک ہے۔") print("Urdu Tokeniza

How to build Urdu language model in SpaCy

Urdu alphabets SpaCy is the most commonly used NLP library for building NLP and chatbot apps. The Urdu language does not have resources for building chatbot and NLP apps. Most of the tools are proprietary or data is licensed. After adding the support for the Urdu language, I'm going to show you how to build an Urdu model which can be used for multiple applications such as word and sentence similarity, chatbots, knowledgebase, etc. Follow the steps to build the model. Step 1: Build word frequencies for Urdu. I've created a script that can be used to build word frequencies. There are multiple resources available for building word frequencies, you can choose whatever you want but the format should be like this. frequency document_id word Here is the script I'm using to build word frequencies for SpaCy. from __future__ import unicode_literals import string import codecs import glob from collections import Counter import re import plac from multiprocessing