=Paper=
{{Paper
|id=Vol-2967/paper3
|storemode=property
|title=Job Posting-Enriched Knowledge Graph for Skills-based
Matching
|pdfUrl=https://ceur-ws.org/Vol-2967/paper_3.pdf
|volume=Vol-2967
|authors=Maurits de Groot,Jelle Schutte,David Graus
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/GrootSG21
}}
==Job Posting-Enriched Knowledge Graph for Skills-based
Matching==
<pdf width="1500px">https://ceur-ws.org/Vol-2967/paper_3.pdf</pdf>
<pre>
           Job Posting-Enriched Knowledge Graph for Skills-based
                                 Matching
               Maurits de Groot∗                                           Jelle Schutte                                David Graus
             maurits.degroot@live.nl                               jelle.schutte@randstad.com                  david.graus@randstadgroep.nl
                Leiden University                                            Randstad                            Randstad Groep Nederland
             Leiden, The Netherlands                                Diemen, The Netherlands                      Diemen, The Netherlands
ABSTRACT                                                                                In addition, with demand of skills changing over time, having
The labor market is constantly evolving. Occupations are changing,                   the correct skills for specific occupations is more crucial than ever.
being added, or disappearing to fit the needs of today’s market. In                  The increasing amount of digitization has made computer skills
recent years the pace of this change has accelerated, due to factors                 more valuable [20]. The COVID-19 pandemic has resulted in a
such as globalization, digitization, and the shift to working from                   double-disruption effect where technological adoption is acceler-
home. Different factors are relevant when selecting employment,                      ated and companies lay off employees [10]. Most aging workers
e.g., cultural fit, compensation, provided degree of freedom. To                     do not posses the newly required technical skills which leads to
successfully fulfill an occupation the gap between required (by the                  lower job opportunities [5]. Not only technical skills are important,
job) and possessed (by the job seeker) skills needs to be as small as                having good people skills is becoming increasingly important as
possible. Decreasing this skill-gap improves the fit between a job                   well [4].
candidate and occupation.                                                               The volatility in the labor market results in a change of occu-
   In this paper we propose a custom-built Skills & Occupation                       pations with new required skills, and being able to keep up with
Knowledge Graph (KG) that fits the above described dynamic nature                    the latest developments is a challenge. To find relevant vacancies
of the labor market, by leveraging existing skills and occupation                    and job postings, individuals can use external services to match
taxonomies enriched with external job posting data.                                  their skills with their desired work. In 2019, employment agen-
   We leverage this KG and explore several applications for skills-                  cies were responsible for fulfilling 10% of the available jobs in the
based matching of jobs to job seekers. First, we study link prediction               Netherlands [6].
as a means to quantify relevance of skills to occupations, which can                    As explained above, in recent years the labor market has become
help in prioritizing learning and development of employees. Next,                    more competitive, and requirements more dynamic. As a result of
we study node similarity methods and shortest path algorithms for                    this, there is a rising interest in skill-based matching of candidates
career pathfinding. Finally, we leverage a term weighting method                     to jobs [10], as the desired profiles for a given occupation are no
for identifying which skills are most “distinctive” for different (types             longer static and unambiguous.
of) occupations.
                                                                                     1.1    Problem Statement
CCS CONCEPTS                                                                         To facilitate candidate to job posting matching, it is important to
• Computing methodologies → Ontology engineering; • Theory                           know which skills are relevant, in demand, and in supply. Here, the
of computation → Graph algorithms analysis; • Information                            need for a flexible data representation for skills arises. This repre-
systems → Content analysis and feature selection.                                    sentation should facilitate various tasks, such as a skills similarity
                                                                                     metric to be able to quantify likeliness between skills, skills-to-
KEYWORDS                                                                             occupation similarity metrics, to help people navigate the labor
                                                                                     market and find new occupations, and understanding which skills
labor market, skill matching, knowledge graphs
                                                                                     relate to which occupations to inform which skills are needed for
                                                                                     desired occupations. And since relations between skills and occupa-
1    INTRODUCTION                                                                    tions are not static and need robust and accurate updating methods
In recent years the number of people that change their job is in-                    to ensure the information does not get outdated.
creasing [9], the average duration of a position is shorter [16] and                    In this paper we address the task of skills and occupation graph
the total working population is growing [17]. Due to increasing                      construction which we describe in Section 2, and apply this data
globalization, the number of possible job candidates per position                    representation to the following set of use-cases: link prediction
is higher. And candidates enjoy, on average, a higher level of ed-                   for identifying novel skills-occupation relations in Section 3, skills-
ucation compared to a number of years ago [1]. This results in a                     based occupational similarity for career pathfinding in Section 4,
rapidly increasing number of potential job candidates and the labor                  and identifying distinctive skills per occupational group for learning
market is more competitive than it has ever been [2].                                & development in Section 5.
∗ Work done while on internship at Randstad Groep Nederland.
                                                                                     2     KNOWLEDGE GRAPH CONSTRUCTION
                                                                                     Our Skills & Occupational KG is based on existing structured data,
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   more specifically, we combine the ISCO (occupations) and ESCO
License Attribution 4.0 International (CC BY 4.0).                                   (skills) taxonomies (bottom row in Figure 1). Next, we enrich this
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                                            de Groot, et al.


existing data with information from noisy, unstructured job post-
ings (top row in Figure 1) to ensure our KG represents the current
state of the labor market.


     Job Postings          Extractor          Skills


                                                              Skill Matching   Knowledge Graph

     Occupations


                            Merge      Skills + Occupations


        Skills


                       Figure 1: Knowledge Graph creation flow


                                                                                                          Figure 2: The structure of the occupations pillar [8]
2.1              Occupations (ISCO) and skills (ESCO)
The first step involves constructing a shared Skills & Occupational
Knowledge Graph, through combining the existing ISCO and ESCO                                          Skill “the ability to apply knowledge and use know-how to com-
taxonomies.                                                                                              plete tasks and solve problems”
2.1.1 ISCO (occupations). The International Standard Classifica-                                    The ESCO covers 13,485 skills, connected to 2,942 occupations
tion of Occupations (ISCO) is ordered as a taxonomy of occupa-                                   (in 27 languages).
tional groups with four granularity levels across ten different major                               We link our ISCO occupations to ESCO by using the direct
groups. An occupation is defined as “a set of jobs whose main tasks                              links that are defined between ISCO level 4 groups (most fine-
and duties are characterized by a high degree of similarity”, where                              grained/lowest level of the taxonomy) and ESCO concepts, in the
a job is defined as “a set of tasks and duties performed, or meant to                            ESCO. These links between ESCO and ISCO are not (necessarily)
be performed, by one person, including for an employer or in self-                               1-to-1, as multiple ESCO occupations can be linked to a single (level
employment.” [14] Take, for example: the occupation “computer                                    4) ISCO group.
programmer,” which is defined by the level 4 ISCO code: 2132. The                                   In Figure 2 we illustrate this connection between ISCO and ESCO.
occupation then belongs to the the level 3 group “computing profes-                              ESCO occupations are shown in blue, with ISCO occupation groups
sionals” (ISCO-code 213), which in turn belongs the level 2 group                                in purple. In addition to the ESCO occupations shown in the image,
“computing, engineering and science professionals” (ISCO-code 21),                               ESCO also defines skills (not shown), e.g., the ESCO occupation
which, finally, falls in the level 1 group “professionals” (ISCO-code                            “Cattle breeder,” has skills linked to them such as “feed livestock”
2).                                                                                              and “assist animal birth.”

Group Number                 Major Group Name                                                    2.2     KG enrichment through job posting data
                                                                                                 Now that we have our high-level KG structure based on ISCO and
1                            Managers
                                                                                                 ESCO, which defines occupations and skills as nodes, and edges as
2                            Professional
                                                                                                 links between ESCO and ISCO objects, we turn to job posting data
3                            Technicians and associate professionals
                                                                                                 to account for the dynamic nature of associations between skills and
4                            Clerical support workers
                                                                                                 occupations, as described in Section 1. To make sure our KG reflects
5                            Service and sales workers
                                                                                                 the current status of the labor market, we use information from
6                            Skilled agricultural, forestry and fishery workers
                                                                                                 job postings to enrich the structure of our KG. More specifically,
7                            Craft and related trades workers
                                                                                                 we create additional edges by identifying and extracting ESCO
8                            Plant and machine operators, and assemblers
                                                                                                 skills for each job posting’s ISCO occupation group, and assign
9                            Elementary occupations
                                                                                                 weights to edges by relying on co-occurrence statistics of skills and
10                           Armed forces occupations
                                                                                                 occupations.
                    Table 1: The 10 major job groups of the ISCO-08                                 This second step of our process revolves around extracting skills
                                                                                                 from job postings. We describe our job posting dataset in Section
                                                                                                 2.2.2, our approach for skill extraction in Section 2.2.2, and how we
                                                                                                 match extracted skills to ESCO skills in 2.2.3.
2.1.2 ESCO (skills). We define our initial high-level occupation
groups by using the ISCO standard. For skills, we turn to The Euro-                              2.2.1 Vacancy data. Our vacancy dataset consists of sample of
pean Skills, Competences, Qualifications and Occupations (ESCO)                                  600,000 Dutch vacancies collected by Jobdigger [11], each job post-
taxonomy [8]. ESCO defines a skill as follows:                                                   ing is labeled with a level 4 ISCO code. Our sample was chosen
Job Posting-Enriched Knowledge Graph for Skills-based Matching                                           RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


          Skill from Job     Normalized               n-gram of
                                                                                          likely incomplete coverage of the TextKernel Extract method we
             Posting        candidate skill         candidate skill
                                                                                          use for skill extraction, and (ii) our skills matching methodology
                                                                                          further reducing the number of identified skills. As the focus of this
                                                                                          paper is on downstream applications, we consider matching out
                                                                              n-gram of
                                                   Calculate jaccard
          Normalize skill   Create n-gram
                                                       distance
                                                                             normalized
                                                                             ESCO skill
                                                                                          of scope, and rely on our naive but solid character 𝑛-grams-based
                                                                                          method.

                                                                                          3     KG COMPLETION USING LINK
                                              No   Distance greater    Yes
                                Match
                                                   than threshold?
                                                                             No Match
                                                                                                PREDICTION
                                                                                          One of the challenges of modeling skills and occupations is the
                                                                                          dynamic nature of the labor market. In this section we explore
               Figure 3: Overview of skill matching process                               our first down-stream application of our data-driven dynamically
                                                                                          constructed Skills & Occupation Knowledge Graph: matching oc-
                                                                                          cupations to skills. We focus on discovering novel connections
by selecting a uniform distribution of ISCO level 1 occupations, to                       between skills and occupations through leveraging the structure of
make sure our set covers the entire breadth of the labor market.                          our knowledge graph enriched with job posting data.
Prior to sampling our set at the ISCO level 1, the initial dataset was                       More specifically, in this section we compare link prediction
cleaned by discarding low quality and noisy job postings, such as                         algorithms, to quantify the relatedness between a skill and occu-
postings that represented multiple occupations, or job postings that                      pation node, in order to discover novel connections between skills
contained a low number of sentences. Here, we treat vacancy data                          and occupations, not present in our initial KG. We describe our two
as a proxy for the demand in the job market. By doing so, internal                        link prediction methods in the following sections, the first, Prefer-
promotions and career paths and informal channels are not taken                           ential Attachment, is described in Section 3.2.1, next, Node2Vec is
into account.                                                                             described in Section 3.2.2
2.2.2 Skill Extraction. For skill extraction we rely on the industry-
standard Textkernel Extract [22] parser. For each vacancy text,
                                                                                          3.1    Experimental setup
Textkernel Extract returns a json object with corresponding skills,                       We employ link prediction to estimate the relatedness between skills
represented by the surface form identified in the job posting (skill                      and occupation nodes. To evaluate and reliably compare different
mention), a unique identifier representing the skill (skill id), and                      methods, we first split our KG into train, test, and validation sets.
finally, a confidence score that quantifies the likelihood of the ex-                     More specifically, we sample 55% of all edges for training the link
tracted skill to be correct.                                                              prediction algorithms (where applicable), leaving leave 30% for
                                                                                          testing, and 15% for validation. For each existing pair of occupation
2.2.3 Skill Matching. Given the skills extracted by Textkernel, we                        and skills node — which we consider a positive sample in our train,
match them to the skill nodes in our KG, by relying on the surface                        test and validation sets — we randomly generate a negative sample
forms of the skills (skill mentions). More specifically, we leverage                      (i.e., a pair of skills and occupation nodes that do not exist in our
character 𝑛-grams Jaccard similarity between the normalized skill                         KG). An overview of the number of edges in each set is shown in
mention and the normalized ESCO skill names. We set the similarity                        Table 2.
threshold to 0.66, which was empirically determined to be optimal
using a smaller set of our 39, 758, 827 Textkernel skills to ESCO
                                                                                                                               Positive    Negative
skill-mappings. The high-level process is shown in Figure 3.
                                                                                                        Training edges             2151          2151
2.3     Final Skills & Occupational Knowledge                                                           Validation edges            586           586
        Graph                                                                                           Test edges                 1173          1173
Our final KG, resulting from the process shown in Figure 1 and                                          Total                      3910          3910
described in the previous section, consists of 1,220 nodes, of which                      Table 2: Number of positive and negative edges with a training (55%),
983 represent (ESCO) skills, and 237 (ISCO) occupations. These                            validation (15%), test (30%) split
nodes are connected through 3, 910 edges, with an average node
degree of 6.4.
   This KG is a subset of the full ESCO (13.485 skills), and ISCO
(436 occupations) taxonomies. There are several reasons why our
KG is a subset and does not span the entirety of the ISCO and ESCO                        3.2    Link Prediction Methods
taxonomies.                                                                               3.2.1 Method 1: Preferential Attachment (PA). The first link predic-
   First, it is conceivable that not all ISCO occupations are in cur-                     tion method is preferential attachment [15]. This method takes a
rent demand, e.g., we found that there were no vacancies for ISCO                         set of nodes, i.e. node 𝑣 and node 𝑢, and calculates a closeness (𝐶)
occupation code 8111: “mining-plant operators,” which is not sur-                         between two nodes:
prising with currently no mines in operation in The Netherlands.
Next, it is likely we are dealing with coverage issues, from (i) the                                            𝐶 (𝑢, 𝑣) = |Γ(𝑢)| × |Γ(𝑣)|,                        (1)
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                                                                              de Groot, et al.


where Γ(𝑢) denotes the neighbors of 𝑢.
   A higher score here corresponds to a larger probability the nodes                                          1.0
                                                                                                                        Preferential Attachment
are connected. The intuition behind this is that if both nodes have                                                     Node2Vec
a high amount of neighbors the nodes might function as a hub.
                                                                                                              0.8
Most graphs have the property that hubs have a higher chance to
be connected.


                                                                              F1­Score of the postive class
   To compute all scores, we represent our KG as a matrix, where                                              0.6
each node is represented as a row and a column. Note that this
matrix is symmetric since the value for row 𝑢 and column 𝑣 is equal
                                                                                                              0.4
to the value at row 𝑣 and column 𝑢. At the intersecting cell of two
nodes, we store the preferential attachment. We normalize this
matrix by dividing each score by the maximum Closeness score, to                                              0.2
ensure that each value is between 0 and 1. We consider the resulting
normalized Closeness score as the probability the corresponding
                                                                                                              0.0
nodes are related.                                                                                                  1     2          3            4     5         6          7      8   9         10
                                                                                                                                            Ratio negative edges / positive edges
3.2.2 Method 2: Node2Vec (N2V). The second link prediction method
we use is the Node2Vec algorithm [12]. This algorithm can have
a number of configurations. For this paper we use the following           Figure 4: Comparison of Node2Vec and Preferential Attachment for
parameters:                                                               different ratio’s negative edges / positive edges

     • dimensions = 1024
     • walk length = 4
     • number of walks = 2500
                                                                          3.4                                  Analysis
     • 𝑝 (return parameter) = 1                                           Now that it has been established that N2V is more suitable for our
     • 𝑞 (in-out parameter) = 1                                           task, we aim to employ this algorithm to predict the relationships
   These parameters were selected after a grid search on a large          between occupations and skills. When doing so we need to realize
number of possible combinations of parameters.                            that the graph which we use as input is imperfect in terms of
                                                                          correctness and completeness [19].
3.3     Results                                                              Looking at the false positives of the algorithm, skills that are —
                                                                          according to our dataset — incorrectly linked to occupations can
Table 3 shows the performance of both Preferential Attachment
                                                                          be identified. For KG completion, we aim to identify those skills
(PA) and Node2Vec (N2V).
                                                                          that are not linked to occupations, but should be. Table 4 shows a
                                                                          random sample of False Positives: it reinforces our intuition that
                     class    precision      recall    f1-score
                                                                          link prediction can be employed for KG completion, as some of the
                     0.0      0.83           0.64      0.72               predicted edges make sense, e.g., the skill: “preparing materials for
             PA
                     1.0      0.71           0.87      0.78               dental procedures” is shown as a relevant skill for the occupation:
                     0.0      0.66           0.90      0.76               “dentist.” By consulting domain experts, skills can be efficiently
             N2V                                                          added to enrich the current graph.
                     1.0      0.84           0.53      0.65
                                                                             To further explore these intuitions, in Figure 5 we show the edges
Table 3: Precision, recall and F1-scores of multiple link prediction
                                                                          to skill nodes predicted by N2V, for the node representing ISCO
algorithms with an equal number of positive and negative edges
                                                                          code 2611: “Lawyers.” The y-axis shows skills edges, and the x-axes
used for training
                                                                          show the link prediction probabilities, for all predictions with a
                                                                          probability>0.5 (i.e., positive predictions by the method). The green
                                                                          bars denote True Positives (i.e., correctly predicted edges between
   When the number of positive and negative edges in the test set
                                                                          the skill and occupation), and blue bars depict False Positives (skills
is equal, PA outperforms the more complex N2V method, with an
                                                                          that are predicted to have an edge with the occupation, but do not
f1-score for the positive class of 0.78 against 0.65. In most realistic
                                                                          exist in our KG). The figure shows “education law” and “investiga-
situations however, we may want to explore how a node can be
                                                                          tion research methods” as newly identified skills for lawyers, not
linked to any other node, making the number of comparisons, or
                                                                          found in the original ESCO taxonomy nor in co-occurrences in job
edges to predict 1-to-(N-1), i.e., for each node we compare each other
                                                                          postings.
node (excluding self). To approximate this real world performance
the ratio of negative to positive edges should reflect these more
realistic proportions. To do so we compute F1-score at increasing
                                                                          4                 CAREER PATHFINDING USING SHORTEST
ratios of positive-to-negative edges, ranging from 1 (as shown in                           PATH ALGORITHMS
Table 3) to 7. Results are shown in Figure 4. The figure shows that       According to recent data (2019) 1.1 million people switched occupa-
up to ratio of 3:1, N2V is on par with PA, but as ratios increase,        tion in the Netherlands [6]. When transitioning between one job to
N2V outperforms PA, suggesting N2V is better suited for most real         another, the gap between both jobs cannot be too large. This gap
world situations.                                                         can be considered too large if the required skills for one, differs too
Job Posting-Enriched Knowledge Graph for Skills-based Matching                                                              RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


                                               ISCO-Code         Occupation                          Predicted Skill
                                               1341              Child care services managers        children’s physical development
                                               2261              Dentists                            prepare materials for dental procedures
                                               3251              Dental assistants and therapists    dentistry science
                                               4110              General office clerks               demonstrate professional attitude to clients
                                               5411              Fire fighters                       safety engineering
                                               6121              Livestock and dairy producers       promote animal welfare
                                               7132              Spray painters and varnishers       spray pesticides
                                               8344              Lifting truck operators             hazardous materials transportation
                                               9111              Domestic cleaners and helpers       provide lawn care
                                                       Table 4: False positives: edges predicted by N2V that do not exist in our KG


                                                                                                                                                                  0.5
                   international law                                                                                                      C                                   E
          intellectual property law
              provide legal advice
                 civil process order
        environmental legislation
                                                                                                                                         0.5
                       joint ventures                                                                                        A                                     B
                         property law
        moderate in negotiations
                    commercial law
                                                                                                                                          D                                   F
                   employment law
                         contract law
                   think analytically
                 international trade
        mergers and acquisitions                                                                     Figure 6: Jaccard distance in a graph where nodes {A, B} are occu-
            observe confidentiality
          negotiate in legal cases                                                                   pations and nodes {C, D, E, F} are skills. Solid lines denote direct
         legal case management                                                                       connections, dashed lines denote Jaccard distance.
                show responsibility
                       education law
                       tax legislation
 investigation research methods                                                                               17500

                                         0.0     0.2       0.4             0.6           0.8   1.0            15000
                                                        Skills prediction for a lawyer
                                                                                                              12500

                                                                                                              10000
Figure 5: Predictions of the Node2Vec algorithm for ISCO group 2611
                                                                                                      Count


                                                                                                                                                                                        Type
                                                                                                                                                                                        occupation
(Lawyers)                                                                                                      7500                                                                     skill

                                                                                                               5000

                                                                                                               2500
much from the other. Consequently, occupations that share a large
                                                                                                                  0
number of skills should be easier to transfer between. In this chap-                                                  0.0        0.2           0.4
                                                                                                                                                     Jaccard Distance
                                                                                                                                                                        0.6       0.8    1.0

ter we focus on leveraging skills for better informing transitions
between occupations. More specifically, we aim to leverage the KG
                                                                                                     Figure 7: Distribution of the jaccard distance where the orange color
structure for matching occupations with occupations, to identify                                     represents the skills and the blue color represent the occupations
how an individual can change jobs in the most optimal way.

4.1          Skills-based Occupation Similarity                                                      0.88. Over 99% of occupations have a Jaccard distance between
To determine the feasibility of an occupation transfer, we propose to                                0.8 and 1, meaning that occupations require distinct skillsets. Both
model the distance between occupations with Jaccard distance. We                                     distributions are skewed to the left, meaning that the mean (average
compute Jaccard distances between occupations by representing                                        of the observations) is left of the mode (most observed value).
each occupation as the set of its required skills (which we extract                                     In the distribution we see a number of spikes, which can be
from our KG), and computing the overlap between two sets of skills.                                  explained by the prevalence of some fractions over others, e.g., if
See Figure for an illustration 6.                                                                    half of the neighbors are shared, the Jaccard distance will be 12 ,
   In our KG a total of 120, 952 links can be made between pairs of                                  which can be achieved in a number of different ways. Other spikes
skills and pairs of occupations. From these pairs 89.3% is between                                   occur at additional common fractions such as 23 and 43 .
skills and 10.7% between occupations. To gain insight in the overall                                    In Table 5 we show a description of the distance distributions.
similarity of skills and occupations, we study the distribution of                                   For both skills and occupations the minimum distance is 0, meaning
jaccard distances in Figure 7.                                                                       that a skill is shared by every occupation where the skill is con-
   Looking at the distribution of Jaccard distance one can see that                                  nected to or that two occupations share every skill. An example is
on average, skills are more similar to one another than occupations.                                 “Food service counter attendants” and “Hotel receptionists,” both share
This becomes apparent when looking at the mean value of both                                         the same skillset and thus have a Jaccard distance of 0. Skills with
distributions: for occupations the mean is 0.96, and for skills around                               a distance of 0 are for example “Lop trees” and “Pruning techniques.”
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                     de Groot, et al.


                              Skill   Occupation          Total               We show a real world example in Figure 9. Due to the COVID-19
                                                                           pandemic a lot of people find themselves out of a job, especially
                count       107959          12993      120952
                                                                           individuals that work in restaurants. Using the described model
                mean          0.825          0.938       0.837
                                                                           we can calculate which occupation has the smallest distance to
                std           0.163          0.070       0.160
                                                                           the occupation: “cook.” Dijkstra’s algorithm yields “bakers, pastry-
                min           0.000          0.000       0.000
                                                                           cooks and confectionery makers” as most feasible transition.
                25%           0.800          0.928       0.800
                50%           0.875          0.960       0.888
                75%           0.923          0.977       0.933
                max           0.985          0.993       0.993
             Table 5: Statistics of the jaccard distribution


The highest distance found in the dataset is 0.993, this corresponds
with the occupations “Electronics engineers” and “Policy administra-
tion professionals.” They share at least one skill but are — next to
the shared skill — completely different. The common skill in this
example is “perform project management.”

4.2     Career Pathfinding using Dijkstra’s
        algorithm
With the distances between each occupation and between skills,
we can proceed to identify the most efficient transition between
every pair of occupations. This is done by assigning the Jaccard
                                                                           Figure 9: The shortest path between the occupation “Cook” and the
distance scores as edge weights between nodes in our graph, to             closest connected occupation, in this case “Bakers, pastry-cooks and
enable computational methods for finding the most efficient path           confectionery makers.”
between a start node (the current occupation) and an end node (the
desired occupation). We show an example of such a transition in
Figure 8: here we set a threshold for the maximum possible distance
at 0.8. This threshold was determined to be optimal based on eye-          5     MOST RELEVANT SKILLS PER
balling and comparing a different cutoff points. If two occupations              OCCUPATION GROUP
are further apart than 0.8 we consider the step too large.
                                                                           Next to fine-grained analysis of occupations and skills, gaining
                                                                           macro-level insights is an important task for monitoring and under-
                                                                           standing the labor market. The ISCO taxonomy provides multiple
                                      X
                              0.2               0.6
                                                                           levels of granularity, which allows us to aggregate the information
                                                                           contained in our KG at different levels, too. In this section we ex-
                        W
                                      0.9
                                                      Z                    plore a method for identifying the most relevant skills occupations
                                                                           (ISCO level 4) and aggregation of occupations (ISCO level 1-3). More
                              0.4               0.5
                                                                           specifically, we match skills to occupations at an aggregated level.
                                      Y                                       As we’ve seen in the previous section, different occupations may
                                                                           share skills. Several skills, such as teamwork, are commonly required
                                                                           for a large number of occupations, which can be considered generic
Figure 8: Distance between the occupations {𝑊 , 𝑋 , 𝑌 , 𝑍 }. Black lines   or sector-independent skills. At the other side, we may have highly
denote distances lower than 0.8. Red lines denote distances higher         specialized skills, that are only required for specific occupations
than 0.8.                                                                  or occupation groups. Whether a skill is specifically or generically
                                                                           important can be quantified in different ways. For a skill to be
   In this example we start at node 𝑊 and want to go to node 𝑍 .           specific to an occupation or occupational group, we define two
We are not able to directly transition between 𝑊 and 𝑍 because             criteria:
the occupations are not similar enough (0.9 > 0.8).                              • A skill needs to be frequently required within its context
4.2.1 Method. Finding the most efficient path in an undirected                     (occupation or occupation group).
weighted graph can be done by applying shortest path algorithms.                 • A skill needs to be characteristic for its context.
For this paper we turn to Dijkstra’s algorithm [7], because of its
proven speed and widespread availability of implementations. Ac-           5.1     Method
cording to Dijkstra’s algorithm, the shortest allowed path between         The two criteria described above fit naturally to the Term Fre-
𝑊 and 𝑍 in Figure 8 is via node 𝑋 .                                        quency–Inverse Document Frequency (TF-IDF) weighting scheme
Job Posting-Enriched Knowledge Graph for Skills-based Matching                                          RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


                                                                                                                                         Professionals (2)
for terms [21]. This statistic is chosen as it directly models the de-                                                          Network marketing
sired criteria described in the previous section, more specifically,                                                            Manage online communications

TF-IDF is used to assign weights to words in a corpus of documents,                                                             Communication


where a word is deemed more important if it (i) is observed fre-
quently within the document but (ii) not frequently across different                                           Health professionals (22)                      Teaching professionals (23)

documents in the corpus.                                                                                 Coordinate care                                     Communication sciences

                                                                                                         Have computer literacy                              Communication studies

                                                                                                       Citizen involvement in healthcare                   ICT communications protocols
                                                 𝑁
                𝑇 𝐹 − 𝐼 𝐷𝐹 (𝑡, 𝑑) = 𝑡 𝑓𝑡,𝑑 × log       ,           (2)
                                              𝑑 𝑓𝑡 + 1
                                                                                       Nursing and midwifery                               Other health                                 Other teaching
where 𝑡 𝑓𝑡,𝑑 denotes the Term Frequency of 𝑡 in 𝑑, 𝑑 𝑓𝑡 denotes the                     professionals (222)                             professionals (226)                           professionals (235)
                                                                                    Coordinate care                              Radiofarmaceutica                                Communication
number of documents containing 𝑡, and 𝑁 denotes the total number                    Have computer literacy                       Work analytically                                Communication disorders
of documents in the corpus.                                                         Solve problems in healthcare                 Analytical chemistry                             Microsoft Visio

   We “transplant” this TF-IDF weighting scheme from terms in
documents to skills associated to occupations. TF-IDF consists of            Nursing professionals
                                                                                                                      Dentists (2261)                        Pharmacists (2262)
                                                                                                                                                                                               Special needs teachers
                                                                                    (2221)                                                                                                             (2352)
two parts: Term Frequency (TF) is the frequency of a word (skill)        Coordinate care                       Dental studies                         Radiopharmaceuticals                   Education law

used in a given document (observed with an occupation), Inverse          Have computer literacy                Lead the dental team                   Work analytically                      Communication disorders

                                                                         Solve problems in healthcare          Handle payments in dentistry           Analytical chemistry                   Pedagogy
Document Frequency (IDF) is a way to discount highly common
terms, i.e., it is high when a word (skill) appears in a smaller num-
ber of documents (observed with a low number of occupations).            Figure 10: Three most relevant skills for multiple levels in major
Common terms (skills) will thus yield a lower IDF score.                 ISCO group 2
   For our TF-IDF-based model, we consider skills identified in job
postings terms, and documents can be modeled as a collection of job
                                                                         studies,” are more commonly observed in level 4 ISCO groups. A
postings belonging to an ISCO group. The counts of skills, which
                                                                         possible explanation for this is that specialized skills do only appear
model term frequency, correspond to the number of times a skill is
                                                                         at specialized occupations.
found in a job posting associated to a certain ISCO code.

5.2     Results and analysis                                             6       CONCLUSION
5.2.1 Level 1 ISCO groups. The resulting score provides us with          In recent years the labor market has changed drastically. This is
skills that are common for a given occupation (group) but uncom-         mostly due to increased globalization, a growing working popu-
mon in all other occupation(s) (groups). Table 6 shows the top 5         lation and disappearing jobs due to digitalization. The COVID-19
skills for the level 1 ISCO groups.                                      pandemic has accelerated this change. This paper aims to explore
   In this table Microsoft Office appears both in the Managers and       algorithmic and data-driven methods for exploring and improving
Clerical support workers groups. For this skill to score high in mul-    the fit between job seekers and vacancies by modeling skills and
tiple contexts (occupation groups) the frequencies need to be sub-       occupation data in a knowledge graph. Modeling and leveraging
stantial in both, to be able to compensate for the IDF component of      relationships between occupations and skills can provide insights
the metric. In the Managers group, Microsoft Office has a TF of 9%       for job seekers with existing skill sets.
and in Clerical support workers a TF of 5%.                                 After constructing our knowledge graph by relying on the ex-
                                                                         isting ISCO and ESCO taxonomies for occupations and skills, we
5.2.2 Multiple ISCO levels. ISCO level 1 helps us to understand          enrich our KG by relying on job posting data.
which skills are relevant for the least granular level; to deepen our       We explore our final KG using three different applications.
understanding we look at the development of multiple layers of              First, we study link prediction methods for quantifying the relat-
ISCO group 2 in Figure 10. Here, we show the 3 most relevant skills      edness between skills and occupations in Section 3. We compare
for several ISCO levels of the “Professionals” ISCO group.               and evaluate two different link prediction methods, and find that
   We notice the following: First, communication-related skills ap-      “Node2Vec” performs best. Next to quantifying relatedness between
pear in multiple forms across occupation groups. The terms com-          occupations and skills for, e.g., ranking skills for an occupation or
munication, communication sciences, communication studies, ICT           using as edge weights in our KG, we explore Node2Vec for identi-
communication protocols, manage online communications and com-           fying skills-to-occupation links that are not present in the original
munication disorders seem to be closely related. Because these skills    KG.
are defined as distinct skills, each skill receives its own ranking.        Next, in Section 4 we explore our KG for finding efficient job
This concept can appear multiple times.                                  transitions. When an individual is searching for a job, knowing
   Next, “Nursing professionals” and “Nursing and midwifery pro-         which occupations can help the search process. In our next appli-
fessionals” share the same set of relevant skills, which are highly      cation we explore shortest path finding algorithms for identifying
similar to those of their parent group “Health professionals”. Skills    potential careerpath prediction. We use a skills-based Jaccard simi-
that appear in those groups are the most frequent skills in the parent   larity metric to model distance between occupations. Furthermore,
group.                                                                   we show examples of job transitions and study properties of our
   Finally, the further down the figure we go, the more specialized      KG by analyzing the distribution of distances between skills and
the skills appear to be, and more specialized skills, such as “dental    occupations.
RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands                                                                              de Groot, et al.


    Managers                                        Professionals                                    Technicians and associate professionals
1   Microsoft Office                                Network Marketing                                Marker Making
2   Service-oriented Modelling                      Manage Online Communications                     Electronic Communication
3   Communication Principles                        Communication                                    Service-oriented Modelling
4   Electronic Communication                        Explain Accounting Records                       Education Administration
5   Coordinate Patrols                              Accounting                                       Manage Standard ERP System
    Clerical support workers                        Service and sales workers                        Skilled agricultural, forestry and fishery workers
1   Execute Administration                          Security Panels                                  Leadership Principles
2   Perform Clerical Duties                         Electronic Communication                         Agricultural Information Systems and Databases
3   Microsoft Office                                Create Solutions to Problems                     Pruning Techniques
4   Education Administration                        Execute Administration                           Spray Pesticides
5   Human Resource Management                       Recreation Activities                            Lop Trees
    Craft and related trades workers                Plant and machine operators, and assemblers      Elementary occupations
1   Attend to Detail in Casting Processes           Mechatronics                                     Inventory Management Rules
2   Attention to Detail                             Mechanical Engineering                           Have Computer Literacy
3   Adobe Illustrator                               Electrical Engineering                           Carpentry
4   Adobe Photoshop                                 Operate Soldering Equipment                      Place Concrete Forms
5   ML (computer programming)                       Act Reliably                                     Operate on-board Computer Systems
                                  Table 6: Five most relevant skills per major ISCO group based in the TF-IDF matric


   Finally, in Section 5 we study a method to determine which skills               associated occupations, etc.). In general, this problem of matching
are most relevant to different levels of aggregated occupations,                   can be considered an entity linking task, which is considered out of
using the ISCO taxonomy. The skill relevance to an ISCO (group)                    scope for this application paper. Having a flawed knowledge graph
is calculated by taking the frequency of the skills being required                 as a result of sub-optimal prepossessing does not invalidate the
for an ISCO (group) with the uniqueness of the skill in the overall                methods used. Whichever approach is used to create a knowledge
ISCO taxonomy. Here, the uniqueness is high if a skill occurs more                 graph, the outcome will never be perfect [3].
often in one group compared to the other groups. The metric that                      Finally, two out of three applications of our KG are not validated
reflects this intuition is called “TF-IDF.” By doing so we construct a             empirically: for both our shortest path finder (Section 4) and identi-
birds-eye view of the labor market.                                                fying the most relevant skill per ISCO group (Section 5, we focused
   The findings from the three sections described above are all                    on the analysis and interpretation of results, omitting a more formal
variations to the same theme, of finding or enabling the perfect fit               evaluation methodology. For future research it would be interesting
between a job seeker and a vacancy, by leveraging skills.                          to benchmark the current against different career path prediction
                                                                                   models. Validating if, e.g., the discovered paths between occupa-
7    DISCUSSION & FUTURE RESEARCH                                                  tions indeed are practically the shortest one, requires additional
In this paper we present different KG-driven applications for skills-              data. Unfortunately, no such data was available at the time of writ-
based job matching. In principle, the methods presented are data-                  ing. One place to acquire such data, is, e.g., by collecting data of
agnostic, as long as similar concepts (occupations and skills) and                 historic career paths. However, collecting such data and composing
data (job postings with identified occupations and skills) are avail-              was determined out of scope for this work. The same arose for the
able. More specifically, we leverage the ISCO and the ESCO tax-                    method for quantifying the relevance of skills per ISCO group; these
onomies, which are available in a large number of languages, and                   aggregated insights were difficult to validate. We could imagine
are considered standards that are freely available. Other frameworks               involving human expert annotators to annotate which skills they
could be used as well, where ESCO is widely used in Europe, the                    deem (most) relevant to a certain ISCO (group). However, similarly
O*NET framework [18] is often referred to as the de facto standard                 to the above, collecting and analyzing such data did not fit in the
in the United States.                                                              scope of our present work. In summary, our paper revolves around
   The outcome of any research is heavily dependent on the avail-                  studying algorithmic methods that aim to help both jobseekers and
able data. In the case of this research this data is preprocessed                  recruiters find a better match between individuals and occupations,
in a number of steps, one of which is the skill matching step de-                  we consider studies with actual end-users out of scope [13].
scribed in Section 2.2.3. We opted for a naive character 𝑛-gram
based method for matching surface forms found in a job posting
with skill names in ESCO. Obviously, more refined methods can be                   ACKNOWLEDGMENTS
employed, e.g., by considering additional representation of the skill              A special thanks to the thesis supervisors for the project Niels van
in the job posting (contextual words, occupation), and at the same                 Weeren and Prof. Aske Plaat as well as everybody at Randstad
time additional context at the side of the KG (e.g., skill descriptions,           involved in this project.
Job Posting-Enriched Knowledge Graph for Skills-based Matching                                             RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands


REFERENCES                                                                              [11] Burning Glass. [n.d.]. Vacancy data. https://www.jobdigger.nl/ https://www.
 [1] İ. Semih Akçomak, Lex Borghans, and Bas ter Weel. 2011. Measuring and In-               jobdigger.nl/.
     terpreting Trends in the Division of Labour in the Netherlands. De Econo-          [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for
     mist 159, 4 (01 Dec 2011), 435–482. https://doi.org/10.1007/s10645-011-9168-3           networks. In Proceedings of the 22nd ACM SIGKDD international conference on
     https://doi.org/10.1007/s10645-011-9168-3.                                              Knowledge discovery and data mining. 855–864.
 [2] Pol Antràs, Luis Garicano, and Esteban Rossi-Hansberg. 2005. Offshoring in         [13] Francisco Gutiérrez, Sven Charleer, Robin De Croon, Nyi Nyi Htun, Gerd
     a Knowledge Economy. Working Paper 11094. National Bureau of Economic                   Goetschalckx, and Katrien Verbert. 2019. Explaining and Exploring Job Rec-
     Research. https://doi.org/10.3386/w11094 http://www.nber.org/papers/w11094.             ommendations: A User-Driven Approach for Interacting with Knowledge-Based
 [3] Antoine Bordes and Evgeniy Gabrilovich. 2014. Constructing and Mining Web-              Job Recommender Systems. In Proceedings of the 13th ACM Conference on Recom-
     Scale Knowledge Graphs: KDD 2014 Tutorial. In Proceedings of the 20th ACM               mender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing
     SIGKDD International Conference on Knowledge Discovery and Data Mining (New             Machinery, New York, NY, USA, 60–68. https://doi.org/10.1145/3298689.3347001
     York, New York, USA) (KDD ’14). Association for Computing Machinery, New           [14] international labour office. [n.d.]. International Standard Classification of Occupa-
     York, NY, USA, 1967. https://doi.org/10.1145/2623330.2630803 https://doi.org/10.        tions. https://www.ilo.org/public/english/bureau/stat/isco/ https://www.ilo.org/
     1145/2623330.2630803.                                                                   public/english/bureau/stat/isco/.
 [4] Lex Borghans, Bas Ter Weel, and Bruce A. Weinberg. 2014. People Skills             [15] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for
     and the Labor-Market Outcomes of Underrepresented Groups. ILR Re-                       social networks. Journal of the American society for information science and
     view 67, 2 (2014), 287–334.        https://doi.org/10.1177/001979391406700202           technology 58, 7 (2007), 1019–1031.
     arXiv:https://doi.org/10.1177/001979391406700202        https://doi.org/10.1177/   [16] OECD. [n.d.]. Employment by job tenure intervals - average tenure. https://stats.
     001979391406700202.                                                                     oecd.org/Index.aspx?DataSetCode=TENURE_AVE https://stats.oecd.org/Index.
 [5] Nicole Bosch and Bas Weel. 2013. Labour-Market Outcomes of Older Workers in             aspx?DataSetCode=TENURE_AVE.
     the Netherlands: Measuring Job Prospects Using the Occupational Age Structure.     [17] OECD. [n.d.]. FTPT employment based on national definitions. https://stats.
     De Economist 161 (06 2013). https://doi.org/10.1007/s10645-013-9202-8                   oecd.org/Index.aspx?DataSetCode=FTPTN_D https://stats.oecd.org/Index.aspx?
 [6] centraal bureau voor de statistiek. [n.d.]. De arbeidsmarkt in cijfers. https:          DataSetCode=FTPTN_D.
     //www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf https://         [18] O*NET. [n.d.]. O*NET OnLine. https://www.onetonline.org/ https://www.
     www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf.                        onetonline.org/.
 [7] Edsger W Dijkstra et al. 1959. A note on two problems in connexion with graphs.    [19] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches
     Numerische mathematik 1, 1 (1959), 269–271.                                             and evaluation methods. Semantic web 8, 3 (2017), 489–508.
 [8] european commission. [n.d.]. ESCO handbook. https://ec.europa.eu/esco/             [20] Gang Peng. 2017. Do computer skills affect worker employment? An empirical
     portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24 https://ec.europa.              study from CPS surveys. Computers in Human Behavior 74 (2017), 26 – 34.
     eu/esco/portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24.                        https://doi.org/10.1016/j.chb.2017.04.013 http://www.sciencedirect.com/science/
 [9] Eurostat. [n.d.]. Labour market transitions – annual data. https://ec.europa.           article/pii/S0747563217302510.
     eu/eurostat/web/lfs/data/database https://ec.europa.eu/eurostat/web/lfs/data/      [21] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro-
     database.                                                                               duction to information retrieval. Vol. 39. Cambridge University Press Cambridge.
[10] World Economic Forum. 2020. The Future of Jobs Report 2020. World Economic         [22] Textkernel. [n.d.]. Extract. https://www.textkernel.com/nl/solution/extract/
     Forum, Geneva, Switzerland.                                                             https://www.textkernel.com/nl/solution/extract/.

</pre>