Indeed is a top site for posting job adds. Companies and employers post ads of required jobs on Indeed to hire people with required skills. It helps job seekers to find jobs depending on their skills and expertise. There are many parameters for finding the job such as salary range, location, skills, requirements, job type and many more. But for classification of jobs, I’ve chosen requirements and location parameters to predict jobs using job type as classes or labels.
Step 1: Scrapping Indeed Jobs
I’ve used beautiful soap to scrap Indeed for different types of jobs. Indeed uses some tricks to prevent the site to be scrapped. Here is how I did it.import re
import pandas as pd
from time import sleep
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from urllib.request import urlopen
import nltk
FULL_TIME = 'full%20time&l=New%20York'
PART_TIME = 'part%20time&l=New%20York'
CONTRACT = 'contract&l=New%20York'
INTERNSHIP = 'internship&l=New%20York'
TEMPORARY = 'temporary&l=New%20York'
COMMISSION = 'commission&l=New%20York'
def get_clean_text(website):
"""
this function scrape the page and cleans the required text
:param webpage
:return text
"""
try:
site = urlopen(website).read()
except:
return
soup_obj = BeautifulSoup(site, features='html5lib')
text = soup_obj.find("div", attrs={"id": "jobDescriptionText", "class": "jobsearch-jobDescriptionText"})
if text:
text = soup_obj.find("div", attrs={"id": "jobDescriptionText", "class": "jobsearch- jobDescriptionText"}).get_text()
lines = (line.strip() for line in text.splitlines())
text = " ".join(line for line in lines if line)
text = text.lower().split()
stop_words = set(stopwords.words("english"))
text = [w for w in text if w not in stop_words]
return " ".join(text)
else:
return
def get_jobs_by_type(job_type=None):
"""
this function extracts the jobs from indeed page using param job_type
:param job_type
:return jobs
"""
final_site_list = ['http://www.indeed.com/jobs?q=', job_type]
final_site = "".join(final_site_list)
base_url = "http://www.indeed.com"
try:
# Open up the front page of our search first
html = urlopen(final_site).read()
except:
"That city/state combination did not have any jobs. Exiting . . ."
return
soup = BeautifulSoup(html, features="html5lib")
# Find jobs count
num_jobs_area = soup.find(id='searchCount').text
job_numbers = re.findall(r'\d+', num_jobs_area)
if len(job_numbers) >= 3:
total_num_jobs = (int(job_numbers[1]) * 1000) + int(job_numbers[2])
else:
total_num_jobs = int(job_numbers[1])
# Total jobs
print("There were", total_num_jobs, "jobs found,")
num_pages = int(total_num_jobs / 10)
job_descriptions = []
for i in range(1, num_pages + 1):
print('Getting page', i)
start_num = str(i * 10)
current_page = ''.join([final_site, '&start=', start_num])
html_page = urlopen(current_page).read()
page_obj = BeautifulSoup(html_page, features="html5lib")
job_link_area = page_obj.find(id='resultsCol')
job_URLS = [base_url + link.get('href') for link in
job_link_area.find_all('a', href=True)]
job_URLS = list(filter(lambda x: 'clk' in x, job_URLS))
for j in range(0, len(job_URLS)):
final_description = get_clean_text(job_URLS[j])
if final_description:
job_descriptions.append(final_description)
sleep(3)
print("There were", len(job_descriptions), "jobs successfully found.")
return job_descriptions
ft = get_jobs_by_type(job_type=FULL_TIME)
ft_count = ["FULL_TIME"] * len(ft)
full_time = pd.DataFrame({"description": ft, "job_type": ft_count})
full_time.to_csv("full_time.csv", index=False)
pt = get_jobs_by_type(job_type=PART_TIME)
pt_count = ["PART_TIME"] * len(pt)
part_time = pd.DataFrame({"description": pt, "job_type": pt_count})
part_time.to_csv("part_time.csv", index=False)
cont = get_jobs_by_type(job_type=CONTRACT)
cont_count = ["CONTRACT"] * len(cont)
contract = pd.DataFrame({"description": cont, "job_type": cont_count})
contract.to_csv("contract.csv", index=False)
intern = get_jobs_by_type(job_type=INTERNSHIP)
intern_count = ["INTERN"] * len(intern)
internship = pd.DataFrame({"description": intern, "job_type": intern_count})
internship.to_csv("internship.csv", index=False)
temp = get_jobs_by_type(job_type=TEMPORARY)
temp_count = ["TEMPORARY"] * len(temp)
temporary = pd.DataFrame({"description": temp, "job_type": temp_count})
temporary.to_csv("temporary.csv", index=False)
comm = get_jobs_by_type(job_type=COMMISSION)
comm_count = ["COMMISSION"] * len(comm)
commission = pd.DataFrame({"description": comm, "job_type": comm_count})
commission.to_csv("commission.csv", index=False)
frames = [full_time, part_time, contract, internship, temporary, commission]
total = pd.concat(frames)
total.to_csv("indeed_jobs.csv", index=False)
After scrapping, I noticed that some of the jobs overlap with each other. For example, part-time and temporary jobs have some common jobs. To remove these common jobs I’ve published an article on finding and removing common jobs in two different data frames.Step 2: Preprocessing the data
Now, I applied data preprocessing on job ads to remove punctuations, numbers, and URLs, etc.import pandas as pd
import string
import re
data = pd.read_csv("indeed_jobs.csv", encoding='utf-8')
data.description = data.description.apply(lambda x: re.sub("\d+\.\d+", ' ', x))
data.description = data.description.apply(lambda x: x.lower())
data.description = data.description.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
data.description = data.description.apply(lambda x: re.sub(r'[\w\.-]+@[\w\.-]+', ' ', x))
# https://gist.github.com/mirfan899/fc7cf1d1f49fbf435c43fdb12f299e60
data.description = data.description.apply(lambda x: x.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))))
data.description = data.description.apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
data.description = data.description.apply(lambda x: re.sub(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""", ' ', x))
data.description = data.description.apply(lambda x: re.sub(r'[^\w]', ' ', x))
data.description = data.description.apply(lambda x: re.sub(r'\s+', ' ', x))
data.description = data.description.apply(lambda x: x.strip())
data.to_csv("indeed_jobs_cleaned.csv", index=False)
Step 3: Classification of Jobs
Now, we are ready to train the model on Indeed jobs. I’m using Keras to train the model for this task.import pandas as pd
from keras import Sequential
from keras.layers import Dense, Activation
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelBinarizer
vocab = 15000
max_len = 200
documents = pd.read_csv("indeed_jobs_cleaned.csv")
# shuffle data
documents = documents.sample(frac=1)
employment_type = documents["job_type"].values.astype(str)
num_classes = len(set(employment_type))
descriptions = documents["description"].values.astype(str)
train_size = int(len(descriptions) * .8)
x_train_texts = descriptions[:train_size]
y_train = list(employment_type[:train_size])
x_test_texts = descriptions[train_size:]
y_test = list(employment_type[train_size:])
tokenizer = Tokenizer(num_words=vocab)
tokenizer.fit_on_texts(x_train_texts)
x_train = tokenizer.texts_to_matrix(x_train_texts, mode="tfidf")
x_test = tokenizer.texts_to_matrix(x_test_texts, mode="tfidf")
encoder = LabelBinarizer(sparse_output=False)
encoder.fit(y_train+y_test)
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)
print(x_train.shape[1], y_train.shape[1])
print(x_test.shape[1], y_test.shape[1])
model = Sequential()
model.add(Dense(256, input_shape=(vocab,)))
model.add(Activation('relu'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=32,
epochs=10,
validation_split=0.1)
score = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
I’m getting 87% accuracy on the test data set using the MLP model. Here are the logs of training.
I’ve used MLP because I scrapped 3000 jobs which are not enough to training RNNs such as LSTM or BLSTM. To train the LSTM model you can scrape at least 500 jobs for each class and get good accuracy.
Results
Accuracy of the MLP model is good for small dataset.
The gap can be shortened by adding more data and tuning parameters.
Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, feel free to comment or contact me on Linkedin.
Thank you for reading the article. Hit the clap button if you like the article. If you need help in this article, feel free to comment or contact me on Linkedin.
Comments
Post a Comment