=Paper=
{{Paper
|id=Vol-2967/paper3
|storemode=property
|title=Job Posting-Enriched Knowledge Graph for Skills-based
Matching
|pdfUrl=https://ceur-ws.org/Vol-2967/paper_3.pdf
|volume=Vol-2967
|authors=Maurits de Groot,Jelle Schutte,David Graus
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/GrootSG21
}}
==Job Posting-Enriched Knowledge Graph for Skills-based
Matching==
Job Posting-Enriched Knowledge Graph for Skills-based
Matching
Maurits de Groot∗ Jelle Schutte David Graus
maurits.degroot@live.nl jelle.schutte@randstad.com david.graus@randstadgroep.nl
Leiden University Randstad Randstad Groep Nederland
Leiden, The Netherlands Diemen, The Netherlands Diemen, The Netherlands
ABSTRACT In addition, with demand of skills changing over time, having
The labor market is constantly evolving. Occupations are changing, the correct skills for specific occupations is more crucial than ever.
being added, or disappearing to fit the needs of today’s market. In The increasing amount of digitization has made computer skills
recent years the pace of this change has accelerated, due to factors more valuable [20]. The COVID-19 pandemic has resulted in a
such as globalization, digitization, and the shift to working from double-disruption effect where technological adoption is acceler-
home. Different factors are relevant when selecting employment, ated and companies lay off employees [10]. Most aging workers
e.g., cultural fit, compensation, provided degree of freedom. To do not posses the newly required technical skills which leads to
successfully fulfill an occupation the gap between required (by the lower job opportunities [5]. Not only technical skills are important,
job) and possessed (by the job seeker) skills needs to be as small as having good people skills is becoming increasingly important as
possible. Decreasing this skill-gap improves the fit between a job well [4].
candidate and occupation. The volatility in the labor market results in a change of occu-
In this paper we propose a custom-built Skills & Occupation pations with new required skills, and being able to keep up with
Knowledge Graph (KG) that fits the above described dynamic nature the latest developments is a challenge. To find relevant vacancies
of the labor market, by leveraging existing skills and occupation and job postings, individuals can use external services to match
taxonomies enriched with external job posting data. their skills with their desired work. In 2019, employment agen-
We leverage this KG and explore several applications for skills- cies were responsible for fulfilling 10% of the available jobs in the
based matching of jobs to job seekers. First, we study link prediction Netherlands [6].
as a means to quantify relevance of skills to occupations, which can As explained above, in recent years the labor market has become
help in prioritizing learning and development of employees. Next, more competitive, and requirements more dynamic. As a result of
we study node similarity methods and shortest path algorithms for this, there is a rising interest in skill-based matching of candidates
career pathfinding. Finally, we leverage a term weighting method to jobs [10], as the desired profiles for a given occupation are no
for identifying which skills are most “distinctive” for different (types longer static and unambiguous.
of) occupations.
1.1 Problem Statement
CCS CONCEPTS To facilitate candidate to job posting matching, it is important to
• Computing methodologies → Ontology engineering; • Theory know which skills are relevant, in demand, and in supply. Here, the
of computation → Graph algorithms analysis; • Information need for a flexible data representation for skills arises. This repre-
systems → Content analysis and feature selection. sentation should facilitate various tasks, such as a skills similarity
metric to be able to quantify likeliness between skills, skills-to-
KEYWORDS occupation similarity metrics, to help people navigate the labor
market and find new occupations, and understanding which skills
labor market, skill matching, knowledge graphs
relate to which occupations to inform which skills are needed for
desired occupations. And since relations between skills and occupa-
1 INTRODUCTION tions are not static and need robust and accurate updating methods
In recent years the number of people that change their job is in- to ensure the information does not get outdated.
creasing [9], the average duration of a position is shorter [16] and In this paper we address the task of skills and occupation graph
the total working population is growing [17]. Due to increasing construction which we describe in Section 2, and apply this data
globalization, the number of possible job candidates per position representation to the following set of use-cases: link prediction
is higher. And candidates enjoy, on average, a higher level of ed- for identifying novel skills-occupation relations in Section 3, skills-
ucation compared to a number of years ago [1]. This results in a based occupational similarity for career pathfinding in Section 4,
rapidly increasing number of potential job candidates and the labor and identifying distinctive skills per occupational group for learning
market is more competitive than it has ever been [2]. & development in Section 5.
∗ Work done while on internship at Randstad Groep Nederland.
2 KNOWLEDGE GRAPH CONSTRUCTION
Our Skills & Occupational KG is based on existing structured data,
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons more specifically, we combine the ISCO (occupations) and ESCO
License Attribution 4.0 International (CC BY 4.0). (skills) taxonomies (bottom row in Figure 1). Next, we enrich this
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al.
existing data with information from noisy, unstructured job post-
ings (top row in Figure 1) to ensure our KG represents the current
state of the labor market.
Job Postings Extractor Skills
Skill Matching Knowledge Graph
Occupations
Merge Skills + Occupations
Skills
Figure 1: Knowledge Graph creation flow
Figure 2: The structure of the occupations pillar [8]
2.1 Occupations (ISCO) and skills (ESCO)
The first step involves constructing a shared Skills & Occupational
Knowledge Graph, through combining the existing ISCO and ESCO Skill “the ability to apply knowledge and use know-how to com-
taxonomies. plete tasks and solve problems”
2.1.1 ISCO (occupations). The International Standard Classifica- The ESCO covers 13,485 skills, connected to 2,942 occupations
tion of Occupations (ISCO) is ordered as a taxonomy of occupa- (in 27 languages).
tional groups with four granularity levels across ten different major We link our ISCO occupations to ESCO by using the direct
groups. An occupation is defined as “a set of jobs whose main tasks links that are defined between ISCO level 4 groups (most fine-
and duties are characterized by a high degree of similarity”, where grained/lowest level of the taxonomy) and ESCO concepts, in the
a job is defined as “a set of tasks and duties performed, or meant to ESCO. These links between ESCO and ISCO are not (necessarily)
be performed, by one person, including for an employer or in self- 1-to-1, as multiple ESCO occupations can be linked to a single (level
employment.” [14] Take, for example: the occupation “computer 4) ISCO group.
programmer,” which is defined by the level 4 ISCO code: 2132. The In Figure 2 we illustrate this connection between ISCO and ESCO.
occupation then belongs to the the level 3 group “computing profes- ESCO occupations are shown in blue, with ISCO occupation groups
sionals” (ISCO-code 213), which in turn belongs the level 2 group in purple. In addition to the ESCO occupations shown in the image,
“computing, engineering and science professionals” (ISCO-code 21), ESCO also defines skills (not shown), e.g., the ESCO occupation
which, finally, falls in the level 1 group “professionals” (ISCO-code “Cattle breeder,” has skills linked to them such as “feed livestock”
2). and “assist animal birth.”
Group Number Major Group Name 2.2 KG enrichment through job posting data
Now that we have our high-level KG structure based on ISCO and
1 Managers
ESCO, which defines occupations and skills as nodes, and edges as
2 Professional
links between ESCO and ISCO objects, we turn to job posting data
3 Technicians and associate professionals
to account for the dynamic nature of associations between skills and
4 Clerical support workers
occupations, as described in Section 1. To make sure our KG reflects
5 Service and sales workers
the current status of the labor market, we use information from
6 Skilled agricultural, forestry and fishery workers
job postings to enrich the structure of our KG. More specifically,
7 Craft and related trades workers
we create additional edges by identifying and extracting ESCO
8 Plant and machine operators, and assemblers
skills for each job posting’s ISCO occupation group, and assign
9 Elementary occupations
weights to edges by relying on co-occurrence statistics of skills and
10 Armed forces occupations
occupations.
Table 1: The 10 major job groups of the ISCO-08 This second step of our process revolves around extracting skills
from job postings. We describe our job posting dataset in Section
2.2.2, our approach for skill extraction in Section 2.2.2, and how we
match extracted skills to ESCO skills in 2.2.3.
2.1.2 ESCO (skills). We define our initial high-level occupation
groups by using the ISCO standard. For skills, we turn to The Euro- 2.2.1 Vacancy data. Our vacancy dataset consists of sample of
pean Skills, Competences, Qualifications and Occupations (ESCO) 600,000 Dutch vacancies collected by Jobdigger [11], each job post-
taxonomy [8]. ESCO defines a skill as follows: ing is labeled with a level 4 ISCO code. Our sample was chosen
Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Skill from Job Normalized n-gram of
likely incomplete coverage of the TextKernel Extract method we
Posting candidate skill candidate skill
use for skill extraction, and (ii) our skills matching methodology
further reducing the number of identified skills. As the focus of this
paper is on downstream applications, we consider matching out
n-gram of
Calculate jaccard
Normalize skill Create n-gram
distance
normalized
ESCO skill
of scope, and rely on our naive but solid character 𝑛-grams-based
method.
3 KG COMPLETION USING LINK
No Distance greater Yes
Match
than threshold?
No Match
PREDICTION
One of the challenges of modeling skills and occupations is the
dynamic nature of the labor market. In this section we explore
Figure 3: Overview of skill matching process our first down-stream application of our data-driven dynamically
constructed Skills & Occupation Knowledge Graph: matching oc-
cupations to skills. We focus on discovering novel connections
by selecting a uniform distribution of ISCO level 1 occupations, to between skills and occupations through leveraging the structure of
make sure our set covers the entire breadth of the labor market. our knowledge graph enriched with job posting data.
Prior to sampling our set at the ISCO level 1, the initial dataset was More specifically, in this section we compare link prediction
cleaned by discarding low quality and noisy job postings, such as algorithms, to quantify the relatedness between a skill and occu-
postings that represented multiple occupations, or job postings that pation node, in order to discover novel connections between skills
contained a low number of sentences. Here, we treat vacancy data and occupations, not present in our initial KG. We describe our two
as a proxy for the demand in the job market. By doing so, internal link prediction methods in the following sections, the first, Prefer-
promotions and career paths and informal channels are not taken ential Attachment, is described in Section 3.2.1, next, Node2Vec is
into account. described in Section 3.2.2
2.2.2 Skill Extraction. For skill extraction we rely on the industry-
standard Textkernel Extract [22] parser. For each vacancy text,
3.1 Experimental setup
Textkernel Extract returns a json object with corresponding skills, We employ link prediction to estimate the relatedness between skills
represented by the surface form identified in the job posting (skill and occupation nodes. To evaluate and reliably compare different
mention), a unique identifier representing the skill (skill id), and methods, we first split our KG into train, test, and validation sets.
finally, a confidence score that quantifies the likelihood of the ex- More specifically, we sample 55% of all edges for training the link
tracted skill to be correct. prediction algorithms (where applicable), leaving leave 30% for
testing, and 15% for validation. For each existing pair of occupation
2.2.3 Skill Matching. Given the skills extracted by Textkernel, we and skills node — which we consider a positive sample in our train,
match them to the skill nodes in our KG, by relying on the surface test and validation sets — we randomly generate a negative sample
forms of the skills (skill mentions). More specifically, we leverage (i.e., a pair of skills and occupation nodes that do not exist in our
character 𝑛-grams Jaccard similarity between the normalized skill KG). An overview of the number of edges in each set is shown in
mention and the normalized ESCO skill names. We set the similarity Table 2.
threshold to 0.66, which was empirically determined to be optimal
using a smaller set of our 39, 758, 827 Textkernel skills to ESCO
Positive Negative
skill-mappings. The high-level process is shown in Figure 3.
Training edges 2151 2151
2.3 Final Skills & Occupational Knowledge Validation edges 586 586
Graph Test edges 1173 1173
Our final KG, resulting from the process shown in Figure 1 and Total 3910 3910
described in the previous section, consists of 1,220 nodes, of which Table 2: Number of positive and negative edges with a training (55%),
983 represent (ESCO) skills, and 237 (ISCO) occupations. These validation (15%), test (30%) split
nodes are connected through 3, 910 edges, with an average node
degree of 6.4.
This KG is a subset of the full ESCO (13.485 skills), and ISCO
(436 occupations) taxonomies. There are several reasons why our
KG is a subset and does not span the entirety of the ISCO and ESCO 3.2 Link Prediction Methods
taxonomies. 3.2.1 Method 1: Preferential Attachment (PA). The first link predic-
First, it is conceivable that not all ISCO occupations are in cur- tion method is preferential attachment [15]. This method takes a
rent demand, e.g., we found that there were no vacancies for ISCO set of nodes, i.e. node 𝑣 and node 𝑢, and calculates a closeness (𝐶)
occupation code 8111: “mining-plant operators,” which is not sur- between two nodes:
prising with currently no mines in operation in The Netherlands.
Next, it is likely we are dealing with coverage issues, from (i) the 𝐶 (𝑢, 𝑣) = |Γ(𝑢)| × |Γ(𝑣)|, (1)
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al.
where Γ(𝑢) denotes the neighbors of 𝑢.
A higher score here corresponds to a larger probability the nodes 1.0
Preferential Attachment
are connected. The intuition behind this is that if both nodes have Node2Vec
a high amount of neighbors the nodes might function as a hub.
0.8
Most graphs have the property that hubs have a higher chance to
be connected.
F1Score of the postive class
To compute all scores, we represent our KG as a matrix, where 0.6
each node is represented as a row and a column. Note that this
matrix is symmetric since the value for row 𝑢 and column 𝑣 is equal
0.4
to the value at row 𝑣 and column 𝑢. At the intersecting cell of two
nodes, we store the preferential attachment. We normalize this
matrix by dividing each score by the maximum Closeness score, to 0.2
ensure that each value is between 0 and 1. We consider the resulting
normalized Closeness score as the probability the corresponding
0.0
nodes are related. 1 2 3 4 5 6 7 8 9 10
Ratio negative edges / positive edges
3.2.2 Method 2: Node2Vec (N2V). The second link prediction method
we use is the Node2Vec algorithm [12]. This algorithm can have
a number of configurations. For this paper we use the following Figure 4: Comparison of Node2Vec and Preferential Attachment for
parameters: different ratio’s negative edges / positive edges
• dimensions = 1024
• walk length = 4
• number of walks = 2500
3.4 Analysis
• 𝑝 (return parameter) = 1 Now that it has been established that N2V is more suitable for our
• 𝑞 (in-out parameter) = 1 task, we aim to employ this algorithm to predict the relationships
These parameters were selected after a grid search on a large between occupations and skills. When doing so we need to realize
number of possible combinations of parameters. that the graph which we use as input is imperfect in terms of
correctness and completeness [19].
3.3 Results Looking at the false positives of the algorithm, skills that are —
according to our dataset — incorrectly linked to occupations can
Table 3 shows the performance of both Preferential Attachment
be identified. For KG completion, we aim to identify those skills
(PA) and Node2Vec (N2V).
that are not linked to occupations, but should be. Table 4 shows a
random sample of False Positives: it reinforces our intuition that
class precision recall f1-score
link prediction can be employed for KG completion, as some of the
0.0 0.83 0.64 0.72 predicted edges make sense, e.g., the skill: “preparing materials for
PA
1.0 0.71 0.87 0.78 dental procedures” is shown as a relevant skill for the occupation:
0.0 0.66 0.90 0.76 “dentist.” By consulting domain experts, skills can be efficiently
N2V added to enrich the current graph.
1.0 0.84 0.53 0.65
To further explore these intuitions, in Figure 5 we show the edges
Table 3: Precision, recall and F1-scores of multiple link prediction
to skill nodes predicted by N2V, for the node representing ISCO
algorithms with an equal number of positive and negative edges
code 2611: “Lawyers.” The y-axis shows skills edges, and the x-axes
used for training
show the link prediction probabilities, for all predictions with a
probability>0.5 (i.e., positive predictions by the method). The green
bars denote True Positives (i.e., correctly predicted edges between
When the number of positive and negative edges in the test set
the skill and occupation), and blue bars depict False Positives (skills
is equal, PA outperforms the more complex N2V method, with an
that are predicted to have an edge with the occupation, but do not
f1-score for the positive class of 0.78 against 0.65. In most realistic
exist in our KG). The figure shows “education law” and “investiga-
situations however, we may want to explore how a node can be
tion research methods” as newly identified skills for lawyers, not
linked to any other node, making the number of comparisons, or
found in the original ESCO taxonomy nor in co-occurrences in job
edges to predict 1-to-(N-1), i.e., for each node we compare each other
postings.
node (excluding self). To approximate this real world performance
the ratio of negative to positive edges should reflect these more
realistic proportions. To do so we compute F1-score at increasing
4 CAREER PATHFINDING USING SHORTEST
ratios of positive-to-negative edges, ranging from 1 (as shown in PATH ALGORITHMS
Table 3) to 7. Results are shown in Figure 4. The figure shows that According to recent data (2019) 1.1 million people switched occupa-
up to ratio of 3:1, N2V is on par with PA, but as ratios increase, tion in the Netherlands [6]. When transitioning between one job to
N2V outperforms PA, suggesting N2V is better suited for most real another, the gap between both jobs cannot be too large. This gap
world situations. can be considered too large if the required skills for one, differs too
Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
ISCO-Code Occupation Predicted Skill
1341 Child care services managers children’s physical development
2261 Dentists prepare materials for dental procedures
3251 Dental assistants and therapists dentistry science
4110 General office clerks demonstrate professional attitude to clients
5411 Fire fighters safety engineering
6121 Livestock and dairy producers promote animal welfare
7132 Spray painters and varnishers spray pesticides
8344 Lifting truck operators hazardous materials transportation
9111 Domestic cleaners and helpers provide lawn care
Table 4: False positives: edges predicted by N2V that do not exist in our KG
0.5
international law C E
intellectual property law
provide legal advice
civil process order
environmental legislation
0.5
joint ventures A B
property law
moderate in negotiations
commercial law
D F
employment law
contract law
think analytically
international trade
mergers and acquisitions Figure 6: Jaccard distance in a graph where nodes {A, B} are occu-
observe confidentiality
negotiate in legal cases pations and nodes {C, D, E, F} are skills. Solid lines denote direct
legal case management connections, dashed lines denote Jaccard distance.
show responsibility
education law
tax legislation
investigation research methods 17500
0.0 0.2 0.4 0.6 0.8 1.0 15000
Skills prediction for a lawyer
12500
10000
Figure 5: Predictions of the Node2Vec algorithm for ISCO group 2611
Count
Type
occupation
(Lawyers) 7500 skill
5000
2500
much from the other. Consequently, occupations that share a large
0
number of skills should be easier to transfer between. In this chap- 0.0 0.2 0.4
Jaccard Distance
0.6 0.8 1.0
ter we focus on leveraging skills for better informing transitions
between occupations. More specifically, we aim to leverage the KG
Figure 7: Distribution of the jaccard distance where the orange color
structure for matching occupations with occupations, to identify represents the skills and the blue color represent the occupations
how an individual can change jobs in the most optimal way.
4.1 Skills-based Occupation Similarity 0.88. Over 99% of occupations have a Jaccard distance between
To determine the feasibility of an occupation transfer, we propose to 0.8 and 1, meaning that occupations require distinct skillsets. Both
model the distance between occupations with Jaccard distance. We distributions are skewed to the left, meaning that the mean (average
compute Jaccard distances between occupations by representing of the observations) is left of the mode (most observed value).
each occupation as the set of its required skills (which we extract In the distribution we see a number of spikes, which can be
from our KG), and computing the overlap between two sets of skills. explained by the prevalence of some fractions over others, e.g., if
See Figure for an illustration 6. half of the neighbors are shared, the Jaccard distance will be 12 ,
In our KG a total of 120, 952 links can be made between pairs of which can be achieved in a number of different ways. Other spikes
skills and pairs of occupations. From these pairs 89.3% is between occur at additional common fractions such as 23 and 43 .
skills and 10.7% between occupations. To gain insight in the overall In Table 5 we show a description of the distance distributions.
similarity of skills and occupations, we study the distribution of For both skills and occupations the minimum distance is 0, meaning
jaccard distances in Figure 7. that a skill is shared by every occupation where the skill is con-
Looking at the distribution of Jaccard distance one can see that nected to or that two occupations share every skill. An example is
on average, skills are more similar to one another than occupations. “Food service counter attendants” and “Hotel receptionists,” both share
This becomes apparent when looking at the mean value of both the same skillset and thus have a Jaccard distance of 0. Skills with
distributions: for occupations the mean is 0.96, and for skills around a distance of 0 are for example “Lop trees” and “Pruning techniques.”
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al.
Skill Occupation Total We show a real world example in Figure 9. Due to the COVID-19
pandemic a lot of people find themselves out of a job, especially
count 107959 12993 120952
individuals that work in restaurants. Using the described model
mean 0.825 0.938 0.837
we can calculate which occupation has the smallest distance to
std 0.163 0.070 0.160
the occupation: “cook.” Dijkstra’s algorithm yields “bakers, pastry-
min 0.000 0.000 0.000
cooks and confectionery makers” as most feasible transition.
25% 0.800 0.928 0.800
50% 0.875 0.960 0.888
75% 0.923 0.977 0.933
max 0.985 0.993 0.993
Table 5: Statistics of the jaccard distribution
The highest distance found in the dataset is 0.993, this corresponds
with the occupations “Electronics engineers” and “Policy administra-
tion professionals.” They share at least one skill but are — next to
the shared skill — completely different. The common skill in this
example is “perform project management.”
4.2 Career Pathfinding using Dijkstra’s
algorithm
With the distances between each occupation and between skills,
we can proceed to identify the most efficient transition between
every pair of occupations. This is done by assigning the Jaccard
Figure 9: The shortest path between the occupation “Cook” and the
distance scores as edge weights between nodes in our graph, to closest connected occupation, in this case “Bakers, pastry-cooks and
enable computational methods for finding the most efficient path confectionery makers.”
between a start node (the current occupation) and an end node (the
desired occupation). We show an example of such a transition in
Figure 8: here we set a threshold for the maximum possible distance
at 0.8. This threshold was determined to be optimal based on eye- 5 MOST RELEVANT SKILLS PER
balling and comparing a different cutoff points. If two occupations OCCUPATION GROUP
are further apart than 0.8 we consider the step too large.
Next to fine-grained analysis of occupations and skills, gaining
macro-level insights is an important task for monitoring and under-
standing the labor market. The ISCO taxonomy provides multiple
X
0.2 0.6
levels of granularity, which allows us to aggregate the information
contained in our KG at different levels, too. In this section we ex-
W
0.9
Z plore a method for identifying the most relevant skills occupations
(ISCO level 4) and aggregation of occupations (ISCO level 1-3). More
0.4 0.5
specifically, we match skills to occupations at an aggregated level.
Y As we’ve seen in the previous section, different occupations may
share skills. Several skills, such as teamwork, are commonly required
for a large number of occupations, which can be considered generic
Figure 8: Distance between the occupations {𝑊 , 𝑋 , 𝑌 , 𝑍 }. Black lines or sector-independent skills. At the other side, we may have highly
denote distances lower than 0.8. Red lines denote distances higher specialized skills, that are only required for specific occupations
than 0.8. or occupation groups. Whether a skill is specifically or generically
important can be quantified in different ways. For a skill to be
In this example we start at node 𝑊 and want to go to node 𝑍 . specific to an occupation or occupational group, we define two
We are not able to directly transition between 𝑊 and 𝑍 because criteria:
the occupations are not similar enough (0.9 > 0.8). • A skill needs to be frequently required within its context
4.2.1 Method. Finding the most efficient path in an undirected (occupation or occupation group).
weighted graph can be done by applying shortest path algorithms. • A skill needs to be characteristic for its context.
For this paper we turn to Dijkstra’s algorithm [7], because of its
proven speed and widespread availability of implementations. Ac- 5.1 Method
cording to Dijkstra’s algorithm, the shortest allowed path between The two criteria described above fit naturally to the Term Fre-
𝑊 and 𝑍 in Figure 8 is via node 𝑋 . quency–Inverse Document Frequency (TF-IDF) weighting scheme
Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Professionals (2)
for terms [21]. This statistic is chosen as it directly models the de- Network marketing
sired criteria described in the previous section, more specifically, Manage online communications
TF-IDF is used to assign weights to words in a corpus of documents, Communication
where a word is deemed more important if it (i) is observed fre-
quently within the document but (ii) not frequently across different Health professionals (22) Teaching professionals (23)
documents in the corpus. Coordinate care Communication sciences
Have computer literacy Communication studies
Citizen involvement in healthcare ICT communications protocols
𝑁
𝑇 𝐹 − 𝐼 𝐷𝐹 (𝑡, 𝑑) = 𝑡 𝑓𝑡,𝑑 × log , (2)
𝑑 𝑓𝑡 + 1
Nursing and midwifery Other health Other teaching
where 𝑡 𝑓𝑡,𝑑 denotes the Term Frequency of 𝑡 in 𝑑, 𝑑 𝑓𝑡 denotes the professionals (222) professionals (226) professionals (235)
Coordinate care Radiofarmaceutica Communication
number of documents containing 𝑡, and 𝑁 denotes the total number Have computer literacy Work analytically Communication disorders
of documents in the corpus. Solve problems in healthcare Analytical chemistry Microsoft Visio
We “transplant” this TF-IDF weighting scheme from terms in
documents to skills associated to occupations. TF-IDF consists of Nursing professionals
Dentists (2261) Pharmacists (2262)
Special needs teachers
(2221) (2352)
two parts: Term Frequency (TF) is the frequency of a word (skill) Coordinate care Dental studies Radiopharmaceuticals Education law
used in a given document (observed with an occupation), Inverse Have computer literacy Lead the dental team Work analytically Communication disorders
Solve problems in healthcare Handle payments in dentistry Analytical chemistry Pedagogy
Document Frequency (IDF) is a way to discount highly common
terms, i.e., it is high when a word (skill) appears in a smaller num-
ber of documents (observed with a low number of occupations). Figure 10: Three most relevant skills for multiple levels in major
Common terms (skills) will thus yield a lower IDF score. ISCO group 2
For our TF-IDF-based model, we consider skills identified in job
postings terms, and documents can be modeled as a collection of job
studies,” are more commonly observed in level 4 ISCO groups. A
postings belonging to an ISCO group. The counts of skills, which
possible explanation for this is that specialized skills do only appear
model term frequency, correspond to the number of times a skill is
at specialized occupations.
found in a job posting associated to a certain ISCO code.
5.2 Results and analysis 6 CONCLUSION
5.2.1 Level 1 ISCO groups. The resulting score provides us with In recent years the labor market has changed drastically. This is
skills that are common for a given occupation (group) but uncom- mostly due to increased globalization, a growing working popu-
mon in all other occupation(s) (groups). Table 6 shows the top 5 lation and disappearing jobs due to digitalization. The COVID-19
skills for the level 1 ISCO groups. pandemic has accelerated this change. This paper aims to explore
In this table Microsoft Office appears both in the Managers and algorithmic and data-driven methods for exploring and improving
Clerical support workers groups. For this skill to score high in mul- the fit between job seekers and vacancies by modeling skills and
tiple contexts (occupation groups) the frequencies need to be sub- occupation data in a knowledge graph. Modeling and leveraging
stantial in both, to be able to compensate for the IDF component of relationships between occupations and skills can provide insights
the metric. In the Managers group, Microsoft Office has a TF of 9% for job seekers with existing skill sets.
and in Clerical support workers a TF of 5%. After constructing our knowledge graph by relying on the ex-
isting ISCO and ESCO taxonomies for occupations and skills, we
5.2.2 Multiple ISCO levels. ISCO level 1 helps us to understand enrich our KG by relying on job posting data.
which skills are relevant for the least granular level; to deepen our We explore our final KG using three different applications.
understanding we look at the development of multiple layers of First, we study link prediction methods for quantifying the relat-
ISCO group 2 in Figure 10. Here, we show the 3 most relevant skills edness between skills and occupations in Section 3. We compare
for several ISCO levels of the “Professionals” ISCO group. and evaluate two different link prediction methods, and find that
We notice the following: First, communication-related skills ap- “Node2Vec” performs best. Next to quantifying relatedness between
pear in multiple forms across occupation groups. The terms com- occupations and skills for, e.g., ranking skills for an occupation or
munication, communication sciences, communication studies, ICT using as edge weights in our KG, we explore Node2Vec for identi-
communication protocols, manage online communications and com- fying skills-to-occupation links that are not present in the original
munication disorders seem to be closely related. Because these skills KG.
are defined as distinct skills, each skill receives its own ranking. Next, in Section 4 we explore our KG for finding efficient job
This concept can appear multiple times. transitions. When an individual is searching for a job, knowing
Next, “Nursing professionals” and “Nursing and midwifery pro- which occupations can help the search process. In our next appli-
fessionals” share the same set of relevant skills, which are highly cation we explore shortest path finding algorithms for identifying
similar to those of their parent group “Health professionals”. Skills potential careerpath prediction. We use a skills-based Jaccard simi-
that appear in those groups are the most frequent skills in the parent larity metric to model distance between occupations. Furthermore,
group. we show examples of job transitions and study properties of our
Finally, the further down the figure we go, the more specialized KG by analyzing the distribution of distances between skills and
the skills appear to be, and more specialized skills, such as “dental occupations.
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al.
Managers Professionals Technicians and associate professionals
1 Microsoft Office Network Marketing Marker Making
2 Service-oriented Modelling Manage Online Communications Electronic Communication
3 Communication Principles Communication Service-oriented Modelling
4 Electronic Communication Explain Accounting Records Education Administration
5 Coordinate Patrols Accounting Manage Standard ERP System
Clerical support workers Service and sales workers Skilled agricultural, forestry and fishery workers
1 Execute Administration Security Panels Leadership Principles
2 Perform Clerical Duties Electronic Communication Agricultural Information Systems and Databases
3 Microsoft Office Create Solutions to Problems Pruning Techniques
4 Education Administration Execute Administration Spray Pesticides
5 Human Resource Management Recreation Activities Lop Trees
Craft and related trades workers Plant and machine operators, and assemblers Elementary occupations
1 Attend to Detail in Casting Processes Mechatronics Inventory Management Rules
2 Attention to Detail Mechanical Engineering Have Computer Literacy
3 Adobe Illustrator Electrical Engineering Carpentry
4 Adobe Photoshop Operate Soldering Equipment Place Concrete Forms
5 ML (computer programming) Act Reliably Operate on-board Computer Systems
Table 6: Five most relevant skills per major ISCO group based in the TF-IDF matric
Finally, in Section 5 we study a method to determine which skills associated occupations, etc.). In general, this problem of matching
are most relevant to different levels of aggregated occupations, can be considered an entity linking task, which is considered out of
using the ISCO taxonomy. The skill relevance to an ISCO (group) scope for this application paper. Having a flawed knowledge graph
is calculated by taking the frequency of the skills being required as a result of sub-optimal prepossessing does not invalidate the
for an ISCO (group) with the uniqueness of the skill in the overall methods used. Whichever approach is used to create a knowledge
ISCO taxonomy. Here, the uniqueness is high if a skill occurs more graph, the outcome will never be perfect [3].
often in one group compared to the other groups. The metric that Finally, two out of three applications of our KG are not validated
reflects this intuition is called “TF-IDF.” By doing so we construct a empirically: for both our shortest path finder (Section 4) and identi-
birds-eye view of the labor market. fying the most relevant skill per ISCO group (Section 5, we focused
The findings from the three sections described above are all on the analysis and interpretation of results, omitting a more formal
variations to the same theme, of finding or enabling the perfect fit evaluation methodology. For future research it would be interesting
between a job seeker and a vacancy, by leveraging skills. to benchmark the current against different career path prediction
models. Validating if, e.g., the discovered paths between occupa-
7 DISCUSSION & FUTURE RESEARCH tions indeed are practically the shortest one, requires additional
In this paper we present different KG-driven applications for skills- data. Unfortunately, no such data was available at the time of writ-
based job matching. In principle, the methods presented are data- ing. One place to acquire such data, is, e.g., by collecting data of
agnostic, as long as similar concepts (occupations and skills) and historic career paths. However, collecting such data and composing
data (job postings with identified occupations and skills) are avail- was determined out of scope for this work. The same arose for the
able. More specifically, we leverage the ISCO and the ESCO tax- method for quantifying the relevance of skills per ISCO group; these
onomies, which are available in a large number of languages, and aggregated insights were difficult to validate. We could imagine
are considered standards that are freely available. Other frameworks involving human expert annotators to annotate which skills they
could be used as well, where ESCO is widely used in Europe, the deem (most) relevant to a certain ISCO (group). However, similarly
O*NET framework [18] is often referred to as the de facto standard to the above, collecting and analyzing such data did not fit in the
in the United States. scope of our present work. In summary, our paper revolves around
The outcome of any research is heavily dependent on the avail- studying algorithmic methods that aim to help both jobseekers and
able data. In the case of this research this data is preprocessed recruiters find a better match between individuals and occupations,
in a number of steps, one of which is the skill matching step de- we consider studies with actual end-users out of scope [13].
scribed in Section 2.2.3. We opted for a naive character 𝑛-gram
based method for matching surface forms found in a job posting
with skill names in ESCO. Obviously, more refined methods can be ACKNOWLEDGMENTS
employed, e.g., by considering additional representation of the skill A special thanks to the thesis supervisors for the project Niels van
in the job posting (contextual words, occupation), and at the same Weeren and Prof. Aske Plaat as well as everybody at Randstad
time additional context at the side of the KG (e.g., skill descriptions, involved in this project.
Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
REFERENCES [11] Burning Glass. [n.d.]. Vacancy data. https://www.jobdigger.nl/ https://www.
[1] İ. Semih Akçomak, Lex Borghans, and Bas ter Weel. 2011. Measuring and In- jobdigger.nl/.
terpreting Trends in the Division of Labour in the Netherlands. De Econo- [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for
mist 159, 4 (01 Dec 2011), 435–482. https://doi.org/10.1007/s10645-011-9168-3 networks. In Proceedings of the 22nd ACM SIGKDD international conference on
https://doi.org/10.1007/s10645-011-9168-3. Knowledge discovery and data mining. 855–864.
[2] Pol Antràs, Luis Garicano, and Esteban Rossi-Hansberg. 2005. Offshoring in [13] Francisco Gutiérrez, Sven Charleer, Robin De Croon, Nyi Nyi Htun, Gerd
a Knowledge Economy. Working Paper 11094. National Bureau of Economic Goetschalckx, and Katrien Verbert. 2019. Explaining and Exploring Job Rec-
Research. https://doi.org/10.3386/w11094 http://www.nber.org/papers/w11094. ommendations: A User-Driven Approach for Interacting with Knowledge-Based
[3] Antoine Bordes and Evgeniy Gabrilovich. 2014. Constructing and Mining Web- Job Recommender Systems. In Proceedings of the 13th ACM Conference on Recom-
Scale Knowledge Graphs: KDD 2014 Tutorial. In Proceedings of the 20th ACM mender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing
SIGKDD International Conference on Knowledge Discovery and Data Mining (New Machinery, New York, NY, USA, 60–68. https://doi.org/10.1145/3298689.3347001
York, New York, USA) (KDD ’14). Association for Computing Machinery, New [14] international labour office. [n.d.]. International Standard Classification of Occupa-
York, NY, USA, 1967. https://doi.org/10.1145/2623330.2630803 https://doi.org/10. tions. https://www.ilo.org/public/english/bureau/stat/isco/ https://www.ilo.org/
1145/2623330.2630803. public/english/bureau/stat/isco/.
[4] Lex Borghans, Bas Ter Weel, and Bruce A. Weinberg. 2014. People Skills [15] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for
and the Labor-Market Outcomes of Underrepresented Groups. ILR Re- social networks. Journal of the American society for information science and
view 67, 2 (2014), 287–334. https://doi.org/10.1177/001979391406700202 technology 58, 7 (2007), 1019–1031.
arXiv:https://doi.org/10.1177/001979391406700202 https://doi.org/10.1177/ [16] OECD. [n.d.]. Employment by job tenure intervals - average tenure. https://stats.
001979391406700202. oecd.org/Index.aspx?DataSetCode=TENURE_AVE https://stats.oecd.org/Index.
[5] Nicole Bosch and Bas Weel. 2013. Labour-Market Outcomes of Older Workers in aspx?DataSetCode=TENURE_AVE.
the Netherlands: Measuring Job Prospects Using the Occupational Age Structure. [17] OECD. [n.d.]. FTPT employment based on national definitions. https://stats.
De Economist 161 (06 2013). https://doi.org/10.1007/s10645-013-9202-8 oecd.org/Index.aspx?DataSetCode=FTPTN_D https://stats.oecd.org/Index.aspx?
[6] centraal bureau voor de statistiek. [n.d.]. De arbeidsmarkt in cijfers. https: DataSetCode=FTPTN_D.
//www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf https:// [18] O*NET. [n.d.]. O*NET OnLine. https://www.onetonline.org/ https://www.
www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf. onetonline.org/.
[7] Edsger W Dijkstra et al. 1959. A note on two problems in connexion with graphs. [19] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches
Numerische mathematik 1, 1 (1959), 269–271. and evaluation methods. Semantic web 8, 3 (2017), 489–508.
[8] european commission. [n.d.]. ESCO handbook. https://ec.europa.eu/esco/ [20] Gang Peng. 2017. Do computer skills affect worker employment? An empirical
portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24 https://ec.europa. study from CPS surveys. Computers in Human Behavior 74 (2017), 26 – 34.
eu/esco/portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24. https://doi.org/10.1016/j.chb.2017.04.013 http://www.sciencedirect.com/science/
[9] Eurostat. [n.d.]. Labour market transitions – annual data. https://ec.europa. article/pii/S0747563217302510.
eu/eurostat/web/lfs/data/database https://ec.europa.eu/eurostat/web/lfs/data/ [21] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro-
database. duction to information retrieval. Vol. 39. Cambridge University Press Cambridge.
[10] World Economic Forum. 2020. The Future of Jobs Report 2020. World Economic [22] Textkernel. [n.d.]. Extract. https://www.textkernel.com/nl/solution/extract/
Forum, Geneva, Switzerland. https://www.textkernel.com/nl/solution/extract/.