marianne hoogeveen demo1
Post on 16-Apr-2017
28 Views
Preview:
TRANSCRIPT
Searching for data science jobs is tiring and depressing
‘Data Scientist’ Ranges from Excel pusher to software engineer
! MANY jobs, but how to find the right kind?
! What IS the right kind (for me)?
Data: 11000 job postings from Indeed
Specific data challenges:Buzzwords (“world-class”, “driven”, “exciting”, “mission”, “opportunity”)
Difficult to validate: what is ground truth?
Look for “Data Scientist”, “Data Engineer”, “Data Analyst” job descriptions
US Wide
Compute text similarity
Remove low-information words (stop words, dates, locations, numbers, common verbs, …)
Count occurrence of words (and bigrams), weighted negatively if they are common in the corpus of all documents (TF-IDF)
Compare similarity between these weighted TF-IDF vectors using cosine similarity
What are the job titles of top-100 most similar job postings?
Search term: Data Scientist Search term: Data EngineerSearch term: Data Analyst
DA DA DS DEDSDS DE DEDAother other other
Using cosine similarity on TF-IDF vectors, after removing buzzwords; find 100 most similar and compare job titles
Which job titles can we predict from job description?
Removing buzzwords
False positive rate0.0 1.00.4 0.6 0.80.2
1.0 1.0
False positive rate0.0 1.00.4 0.6 0.80.2
Keeping buzzwords
True
pos
itive
rate
True
pos
itive
rate
(“world-class”, “exciting”, “mission”, “opportunity”)
Why is this useful?
Less irrelevant results
Don’t miss similar jobs that don’t have the right job title
About me
PhD Theoretical Physics, King’s College London
Data Science Internship at Cytora Ltd:
Recognising street addresses in newspaper articles
Validation
200 random docs
Compare with same docs cut in half
Compute pairwise cosine similarities
0
0.2
0.4
0.6
0.8
1
Orig
inal
doc
umen
ts s
orte
d by
ID
Half documents sorted by original’s ID
Topic modellingSalient terms:
ClientMachineModelSystemStatusFinancialResearchRiskAnalysisStatisticalEngineerEmploymentSupportBig
LDA topics
e.g. “analysis”, “report”, “statistical” , “excel”, “sas”, “insight”
e.g. “big”, “technology”, “system” , “engineer”, “design”, “build”, “hadoop”, “platform”
top related