Job Posting-Enriched Knowledge Graph for Skills-based Matching Maurits de Groot∗ Jelle Schutte David Graus maurits.degroot@live.nl jelle.schutte@randstad.com david.graus@randstadgroep.nl Leiden University Randstad Randstad Groep Nederland Leiden, The Netherlands Diemen, The Netherlands Diemen, The Netherlands ABSTRACT In addition, with demand of skills changing over time, having The labor market is constantly evolving. Occupations are changing, the correct skills for specific occupations is more crucial than ever. being added, or disappearing to fit the needs of today’s market. In The increasing amount of digitization has made computer skills recent years the pace of this change has accelerated, due to factors more valuable [20]. The COVID-19 pandemic has resulted in a such as globalization, digitization, and the shift to working from double-disruption effect where technological adoption is acceler- home. Different factors are relevant when selecting employment, ated and companies lay off employees [10]. Most aging workers e.g., cultural fit, compensation, provided degree of freedom. To do not posses the newly required technical skills which leads to successfully fulfill an occupation the gap between required (by the lower job opportunities [5]. Not only technical skills are important, job) and possessed (by the job seeker) skills needs to be as small as having good people skills is becoming increasingly important as possible. Decreasing this skill-gap improves the fit between a job well [4]. candidate and occupation. The volatility in the labor market results in a change of occu- In this paper we propose a custom-built Skills & Occupation pations with new required skills, and being able to keep up with Knowledge Graph (KG) that fits the above described dynamic nature the latest developments is a challenge. To find relevant vacancies of the labor market, by leveraging existing skills and occupation and job postings, individuals can use external services to match taxonomies enriched with external job posting data. their skills with their desired work. In 2019, employment agen- We leverage this KG and explore several applications for skills- cies were responsible for fulfilling 10% of the available jobs in the based matching of jobs to job seekers. First, we study link prediction Netherlands [6]. as a means to quantify relevance of skills to occupations, which can As explained above, in recent years the labor market has become help in prioritizing learning and development of employees. Next, more competitive, and requirements more dynamic. As a result of we study node similarity methods and shortest path algorithms for this, there is a rising interest in skill-based matching of candidates career pathfinding. Finally, we leverage a term weighting method to jobs [10], as the desired profiles for a given occupation are no for identifying which skills are most “distinctive” for different (types longer static and unambiguous. of) occupations. 1.1 Problem Statement CCS CONCEPTS To facilitate candidate to job posting matching, it is important to • Computing methodologies → Ontology engineering; • Theory know which skills are relevant, in demand, and in supply. Here, the of computation → Graph algorithms analysis; • Information need for a flexible data representation for skills arises. This repre- systems → Content analysis and feature selection. sentation should facilitate various tasks, such as a skills similarity metric to be able to quantify likeliness between skills, skills-to- KEYWORDS occupation similarity metrics, to help people navigate the labor market and find new occupations, and understanding which skills labor market, skill matching, knowledge graphs relate to which occupations to inform which skills are needed for desired occupations. And since relations between skills and occupa- 1 INTRODUCTION tions are not static and need robust and accurate updating methods In recent years the number of people that change their job is in- to ensure the information does not get outdated. creasing [9], the average duration of a position is shorter [16] and In this paper we address the task of skills and occupation graph the total working population is growing [17]. Due to increasing construction which we describe in Section 2, and apply this data globalization, the number of possible job candidates per position representation to the following set of use-cases: link prediction is higher. And candidates enjoy, on average, a higher level of ed- for identifying novel skills-occupation relations in Section 3, skills- ucation compared to a number of years ago [1]. This results in a based occupational similarity for career pathfinding in Section 4, rapidly increasing number of potential job candidates and the labor and identifying distinctive skills per occupational group for learning market is more competitive than it has ever been [2]. & development in Section 5. ∗ Work done while on internship at Randstad Groep Nederland. 2 KNOWLEDGE GRAPH CONSTRUCTION Our Skills & Occupational KG is based on existing structured data, RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Copyright 2021 for this paper by its authors. Use permitted under Creative Commons more specifically, we combine the ISCO (occupations) and ESCO License Attribution 4.0 International (CC BY 4.0). (skills) taxonomies (bottom row in Figure 1). Next, we enrich this RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al. existing data with information from noisy, unstructured job post- ings (top row in Figure 1) to ensure our KG represents the current state of the labor market. Job Postings Extractor Skills Skill Matching Knowledge Graph Occupations Merge Skills + Occupations Skills Figure 1: Knowledge Graph creation flow Figure 2: The structure of the occupations pillar [8] 2.1 Occupations (ISCO) and skills (ESCO) The first step involves constructing a shared Skills & Occupational Knowledge Graph, through combining the existing ISCO and ESCO Skill “the ability to apply knowledge and use know-how to com- taxonomies. plete tasks and solve problems” 2.1.1 ISCO (occupations). The International Standard Classifica- The ESCO covers 13,485 skills, connected to 2,942 occupations tion of Occupations (ISCO) is ordered as a taxonomy of occupa- (in 27 languages). tional groups with four granularity levels across ten different major We link our ISCO occupations to ESCO by using the direct groups. An occupation is defined as “a set of jobs whose main tasks links that are defined between ISCO level 4 groups (most fine- and duties are characterized by a high degree of similarity”, where grained/lowest level of the taxonomy) and ESCO concepts, in the a job is defined as “a set of tasks and duties performed, or meant to ESCO. These links between ESCO and ISCO are not (necessarily) be performed, by one person, including for an employer or in self- 1-to-1, as multiple ESCO occupations can be linked to a single (level employment.” [14] Take, for example: the occupation “computer 4) ISCO group. programmer,” which is defined by the level 4 ISCO code: 2132. The In Figure 2 we illustrate this connection between ISCO and ESCO. occupation then belongs to the the level 3 group “computing profes- ESCO occupations are shown in blue, with ISCO occupation groups sionals” (ISCO-code 213), which in turn belongs the level 2 group in purple. In addition to the ESCO occupations shown in the image, “computing, engineering and science professionals” (ISCO-code 21), ESCO also defines skills (not shown), e.g., the ESCO occupation which, finally, falls in the level 1 group “professionals” (ISCO-code “Cattle breeder,” has skills linked to them such as “feed livestock” 2). and “assist animal birth.” Group Number Major Group Name 2.2 KG enrichment through job posting data Now that we have our high-level KG structure based on ISCO and 1 Managers ESCO, which defines occupations and skills as nodes, and edges as 2 Professional links between ESCO and ISCO objects, we turn to job posting data 3 Technicians and associate professionals to account for the dynamic nature of associations between skills and 4 Clerical support workers occupations, as described in Section 1. To make sure our KG reflects 5 Service and sales workers the current status of the labor market, we use information from 6 Skilled agricultural, forestry and fishery workers job postings to enrich the structure of our KG. More specifically, 7 Craft and related trades workers we create additional edges by identifying and extracting ESCO 8 Plant and machine operators, and assemblers skills for each job posting’s ISCO occupation group, and assign 9 Elementary occupations weights to edges by relying on co-occurrence statistics of skills and 10 Armed forces occupations occupations. Table 1: The 10 major job groups of the ISCO-08 This second step of our process revolves around extracting skills from job postings. We describe our job posting dataset in Section 2.2.2, our approach for skill extraction in Section 2.2.2, and how we match extracted skills to ESCO skills in 2.2.3. 2.1.2 ESCO (skills). We define our initial high-level occupation groups by using the ISCO standard. For skills, we turn to The Euro- 2.2.1 Vacancy data. Our vacancy dataset consists of sample of pean Skills, Competences, Qualifications and Occupations (ESCO) 600,000 Dutch vacancies collected by Jobdigger [11], each job post- taxonomy [8]. ESCO defines a skill as follows: ing is labeled with a level 4 ISCO code. Our sample was chosen Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Skill from Job Normalized n-gram of likely incomplete coverage of the TextKernel Extract method we Posting candidate skill candidate skill use for skill extraction, and (ii) our skills matching methodology further reducing the number of identified skills. As the focus of this paper is on downstream applications, we consider matching out n-gram of Calculate jaccard Normalize skill Create n-gram distance normalized ESCO skill of scope, and rely on our naive but solid character 𝑛-grams-based method. 3 KG COMPLETION USING LINK No Distance greater Yes Match than threshold? No Match PREDICTION One of the challenges of modeling skills and occupations is the dynamic nature of the labor market. In this section we explore Figure 3: Overview of skill matching process our first down-stream application of our data-driven dynamically constructed Skills & Occupation Knowledge Graph: matching oc- cupations to skills. We focus on discovering novel connections by selecting a uniform distribution of ISCO level 1 occupations, to between skills and occupations through leveraging the structure of make sure our set covers the entire breadth of the labor market. our knowledge graph enriched with job posting data. Prior to sampling our set at the ISCO level 1, the initial dataset was More specifically, in this section we compare link prediction cleaned by discarding low quality and noisy job postings, such as algorithms, to quantify the relatedness between a skill and occu- postings that represented multiple occupations, or job postings that pation node, in order to discover novel connections between skills contained a low number of sentences. Here, we treat vacancy data and occupations, not present in our initial KG. We describe our two as a proxy for the demand in the job market. By doing so, internal link prediction methods in the following sections, the first, Prefer- promotions and career paths and informal channels are not taken ential Attachment, is described in Section 3.2.1, next, Node2Vec is into account. described in Section 3.2.2 2.2.2 Skill Extraction. For skill extraction we rely on the industry- standard Textkernel Extract [22] parser. For each vacancy text, 3.1 Experimental setup Textkernel Extract returns a json object with corresponding skills, We employ link prediction to estimate the relatedness between skills represented by the surface form identified in the job posting (skill and occupation nodes. To evaluate and reliably compare different mention), a unique identifier representing the skill (skill id), and methods, we first split our KG into train, test, and validation sets. finally, a confidence score that quantifies the likelihood of the ex- More specifically, we sample 55% of all edges for training the link tracted skill to be correct. prediction algorithms (where applicable), leaving leave 30% for testing, and 15% for validation. For each existing pair of occupation 2.2.3 Skill Matching. Given the skills extracted by Textkernel, we and skills node — which we consider a positive sample in our train, match them to the skill nodes in our KG, by relying on the surface test and validation sets — we randomly generate a negative sample forms of the skills (skill mentions). More specifically, we leverage (i.e., a pair of skills and occupation nodes that do not exist in our character 𝑛-grams Jaccard similarity between the normalized skill KG). An overview of the number of edges in each set is shown in mention and the normalized ESCO skill names. We set the similarity Table 2. threshold to 0.66, which was empirically determined to be optimal using a smaller set of our 39, 758, 827 Textkernel skills to ESCO Positive Negative skill-mappings. The high-level process is shown in Figure 3. Training edges 2151 2151 2.3 Final Skills & Occupational Knowledge Validation edges 586 586 Graph Test edges 1173 1173 Our final KG, resulting from the process shown in Figure 1 and Total 3910 3910 described in the previous section, consists of 1,220 nodes, of which Table 2: Number of positive and negative edges with a training (55%), 983 represent (ESCO) skills, and 237 (ISCO) occupations. These validation (15%), test (30%) split nodes are connected through 3, 910 edges, with an average node degree of 6.4. This KG is a subset of the full ESCO (13.485 skills), and ISCO (436 occupations) taxonomies. There are several reasons why our KG is a subset and does not span the entirety of the ISCO and ESCO 3.2 Link Prediction Methods taxonomies. 3.2.1 Method 1: Preferential Attachment (PA). The first link predic- First, it is conceivable that not all ISCO occupations are in cur- tion method is preferential attachment [15]. This method takes a rent demand, e.g., we found that there were no vacancies for ISCO set of nodes, i.e. node 𝑣 and node 𝑢, and calculates a closeness (𝐶) occupation code 8111: “mining-plant operators,” which is not sur- between two nodes: prising with currently no mines in operation in The Netherlands. Next, it is likely we are dealing with coverage issues, from (i) the 𝐶 (𝑢, 𝑣) = |Γ(𝑢)| × |Γ(𝑣)|, (1) RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al. where Γ(𝑢) denotes the neighbors of 𝑢. A higher score here corresponds to a larger probability the nodes 1.0 Preferential Attachment are connected. The intuition behind this is that if both nodes have Node2Vec a high amount of neighbors the nodes might function as a hub. 0.8 Most graphs have the property that hubs have a higher chance to be connected. F1­Score of the postive class To compute all scores, we represent our KG as a matrix, where 0.6 each node is represented as a row and a column. Note that this matrix is symmetric since the value for row 𝑢 and column 𝑣 is equal 0.4 to the value at row 𝑣 and column 𝑢. At the intersecting cell of two nodes, we store the preferential attachment. We normalize this matrix by dividing each score by the maximum Closeness score, to 0.2 ensure that each value is between 0 and 1. We consider the resulting normalized Closeness score as the probability the corresponding 0.0 nodes are related. 1 2 3 4 5 6 7 8 9 10 Ratio negative edges / positive edges 3.2.2 Method 2: Node2Vec (N2V). The second link prediction method we use is the Node2Vec algorithm [12]. This algorithm can have a number of configurations. For this paper we use the following Figure 4: Comparison of Node2Vec and Preferential Attachment for parameters: different ratio’s negative edges / positive edges • dimensions = 1024 • walk length = 4 • number of walks = 2500 3.4 Analysis • 𝑝 (return parameter) = 1 Now that it has been established that N2V is more suitable for our • 𝑞 (in-out parameter) = 1 task, we aim to employ this algorithm to predict the relationships These parameters were selected after a grid search on a large between occupations and skills. When doing so we need to realize number of possible combinations of parameters. that the graph which we use as input is imperfect in terms of correctness and completeness [19]. 3.3 Results Looking at the false positives of the algorithm, skills that are — according to our dataset — incorrectly linked to occupations can Table 3 shows the performance of both Preferential Attachment be identified. For KG completion, we aim to identify those skills (PA) and Node2Vec (N2V). that are not linked to occupations, but should be. Table 4 shows a random sample of False Positives: it reinforces our intuition that class precision recall f1-score link prediction can be employed for KG completion, as some of the 0.0 0.83 0.64 0.72 predicted edges make sense, e.g., the skill: “preparing materials for PA 1.0 0.71 0.87 0.78 dental procedures” is shown as a relevant skill for the occupation: 0.0 0.66 0.90 0.76 “dentist.” By consulting domain experts, skills can be efficiently N2V added to enrich the current graph. 1.0 0.84 0.53 0.65 To further explore these intuitions, in Figure 5 we show the edges Table 3: Precision, recall and F1-scores of multiple link prediction to skill nodes predicted by N2V, for the node representing ISCO algorithms with an equal number of positive and negative edges code 2611: “Lawyers.” The y-axis shows skills edges, and the x-axes used for training show the link prediction probabilities, for all predictions with a probability>0.5 (i.e., positive predictions by the method). The green bars denote True Positives (i.e., correctly predicted edges between When the number of positive and negative edges in the test set the skill and occupation), and blue bars depict False Positives (skills is equal, PA outperforms the more complex N2V method, with an that are predicted to have an edge with the occupation, but do not f1-score for the positive class of 0.78 against 0.65. In most realistic exist in our KG). The figure shows “education law” and “investiga- situations however, we may want to explore how a node can be tion research methods” as newly identified skills for lawyers, not linked to any other node, making the number of comparisons, or found in the original ESCO taxonomy nor in co-occurrences in job edges to predict 1-to-(N-1), i.e., for each node we compare each other postings. node (excluding self). To approximate this real world performance the ratio of negative to positive edges should reflect these more realistic proportions. To do so we compute F1-score at increasing 4 CAREER PATHFINDING USING SHORTEST ratios of positive-to-negative edges, ranging from 1 (as shown in PATH ALGORITHMS Table 3) to 7. Results are shown in Figure 4. The figure shows that According to recent data (2019) 1.1 million people switched occupa- up to ratio of 3:1, N2V is on par with PA, but as ratios increase, tion in the Netherlands [6]. When transitioning between one job to N2V outperforms PA, suggesting N2V is better suited for most real another, the gap between both jobs cannot be too large. This gap world situations. can be considered too large if the required skills for one, differs too Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands ISCO-Code Occupation Predicted Skill 1341 Child care services managers children’s physical development 2261 Dentists prepare materials for dental procedures 3251 Dental assistants and therapists dentistry science 4110 General office clerks demonstrate professional attitude to clients 5411 Fire fighters safety engineering 6121 Livestock and dairy producers promote animal welfare 7132 Spray painters and varnishers spray pesticides 8344 Lifting truck operators hazardous materials transportation 9111 Domestic cleaners and helpers provide lawn care Table 4: False positives: edges predicted by N2V that do not exist in our KG 0.5 international law C E intellectual property law provide legal advice civil process order environmental legislation 0.5 joint ventures A B property law moderate in negotiations commercial law D F employment law contract law think analytically international trade mergers and acquisitions Figure 6: Jaccard distance in a graph where nodes {A, B} are occu- observe confidentiality negotiate in legal cases pations and nodes {C, D, E, F} are skills. Solid lines denote direct legal case management connections, dashed lines denote Jaccard distance. show responsibility education law tax legislation investigation research methods 17500 0.0 0.2 0.4 0.6 0.8 1.0 15000 Skills prediction for a lawyer 12500 10000 Figure 5: Predictions of the Node2Vec algorithm for ISCO group 2611 Count Type occupation (Lawyers) 7500 skill 5000 2500 much from the other. Consequently, occupations that share a large 0 number of skills should be easier to transfer between. In this chap- 0.0 0.2 0.4 Jaccard Distance 0.6 0.8 1.0 ter we focus on leveraging skills for better informing transitions between occupations. More specifically, we aim to leverage the KG Figure 7: Distribution of the jaccard distance where the orange color structure for matching occupations with occupations, to identify represents the skills and the blue color represent the occupations how an individual can change jobs in the most optimal way. 4.1 Skills-based Occupation Similarity 0.88. Over 99% of occupations have a Jaccard distance between To determine the feasibility of an occupation transfer, we propose to 0.8 and 1, meaning that occupations require distinct skillsets. Both model the distance between occupations with Jaccard distance. We distributions are skewed to the left, meaning that the mean (average compute Jaccard distances between occupations by representing of the observations) is left of the mode (most observed value). each occupation as the set of its required skills (which we extract In the distribution we see a number of spikes, which can be from our KG), and computing the overlap between two sets of skills. explained by the prevalence of some fractions over others, e.g., if See Figure for an illustration 6. half of the neighbors are shared, the Jaccard distance will be 12 , In our KG a total of 120, 952 links can be made between pairs of which can be achieved in a number of different ways. Other spikes skills and pairs of occupations. From these pairs 89.3% is between occur at additional common fractions such as 23 and 43 . skills and 10.7% between occupations. To gain insight in the overall In Table 5 we show a description of the distance distributions. similarity of skills and occupations, we study the distribution of For both skills and occupations the minimum distance is 0, meaning jaccard distances in Figure 7. that a skill is shared by every occupation where the skill is con- Looking at the distribution of Jaccard distance one can see that nected to or that two occupations share every skill. An example is on average, skills are more similar to one another than occupations. “Food service counter attendants” and “Hotel receptionists,” both share This becomes apparent when looking at the mean value of both the same skillset and thus have a Jaccard distance of 0. Skills with distributions: for occupations the mean is 0.96, and for skills around a distance of 0 are for example “Lop trees” and “Pruning techniques.” RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al. Skill Occupation Total We show a real world example in Figure 9. Due to the COVID-19 pandemic a lot of people find themselves out of a job, especially count 107959 12993 120952 individuals that work in restaurants. Using the described model mean 0.825 0.938 0.837 we can calculate which occupation has the smallest distance to std 0.163 0.070 0.160 the occupation: “cook.” Dijkstra’s algorithm yields “bakers, pastry- min 0.000 0.000 0.000 cooks and confectionery makers” as most feasible transition. 25% 0.800 0.928 0.800 50% 0.875 0.960 0.888 75% 0.923 0.977 0.933 max 0.985 0.993 0.993 Table 5: Statistics of the jaccard distribution The highest distance found in the dataset is 0.993, this corresponds with the occupations “Electronics engineers” and “Policy administra- tion professionals.” They share at least one skill but are — next to the shared skill — completely different. The common skill in this example is “perform project management.” 4.2 Career Pathfinding using Dijkstra’s algorithm With the distances between each occupation and between skills, we can proceed to identify the most efficient transition between every pair of occupations. This is done by assigning the Jaccard Figure 9: The shortest path between the occupation “Cook” and the distance scores as edge weights between nodes in our graph, to closest connected occupation, in this case “Bakers, pastry-cooks and enable computational methods for finding the most efficient path confectionery makers.” between a start node (the current occupation) and an end node (the desired occupation). We show an example of such a transition in Figure 8: here we set a threshold for the maximum possible distance at 0.8. This threshold was determined to be optimal based on eye- 5 MOST RELEVANT SKILLS PER balling and comparing a different cutoff points. If two occupations OCCUPATION GROUP are further apart than 0.8 we consider the step too large. Next to fine-grained analysis of occupations and skills, gaining macro-level insights is an important task for monitoring and under- standing the labor market. The ISCO taxonomy provides multiple X 0.2 0.6 levels of granularity, which allows us to aggregate the information contained in our KG at different levels, too. In this section we ex- W 0.9 Z plore a method for identifying the most relevant skills occupations (ISCO level 4) and aggregation of occupations (ISCO level 1-3). More 0.4 0.5 specifically, we match skills to occupations at an aggregated level. Y As we’ve seen in the previous section, different occupations may share skills. Several skills, such as teamwork, are commonly required for a large number of occupations, which can be considered generic Figure 8: Distance between the occupations {𝑊 , 𝑋 , 𝑌 , 𝑍 }. Black lines or sector-independent skills. At the other side, we may have highly denote distances lower than 0.8. Red lines denote distances higher specialized skills, that are only required for specific occupations than 0.8. or occupation groups. Whether a skill is specifically or generically important can be quantified in different ways. For a skill to be In this example we start at node 𝑊 and want to go to node 𝑍 . specific to an occupation or occupational group, we define two We are not able to directly transition between 𝑊 and 𝑍 because criteria: the occupations are not similar enough (0.9 > 0.8). • A skill needs to be frequently required within its context 4.2.1 Method. Finding the most efficient path in an undirected (occupation or occupation group). weighted graph can be done by applying shortest path algorithms. • A skill needs to be characteristic for its context. For this paper we turn to Dijkstra’s algorithm [7], because of its proven speed and widespread availability of implementations. Ac- 5.1 Method cording to Dijkstra’s algorithm, the shortest allowed path between The two criteria described above fit naturally to the Term Fre- 𝑊 and 𝑍 in Figure 8 is via node 𝑋 . quency–Inverse Document Frequency (TF-IDF) weighting scheme Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands Professionals (2) for terms [21]. This statistic is chosen as it directly models the de- Network marketing sired criteria described in the previous section, more specifically, Manage online communications TF-IDF is used to assign weights to words in a corpus of documents, Communication where a word is deemed more important if it (i) is observed fre- quently within the document but (ii) not frequently across different Health professionals (22) Teaching professionals (23) documents in the corpus. Coordinate care Communication sciences Have computer literacy Communication studies   Citizen involvement in healthcare ICT communications protocols 𝑁 𝑇 𝐹 − 𝐼 𝐷𝐹 (𝑡, 𝑑) = 𝑡 𝑓𝑡,𝑑 × log , (2) 𝑑 𝑓𝑡 + 1 Nursing and midwifery Other health Other teaching where 𝑡 𝑓𝑡,𝑑 denotes the Term Frequency of 𝑡 in 𝑑, 𝑑 𝑓𝑡 denotes the professionals (222) professionals (226) professionals (235) Coordinate care Radiofarmaceutica Communication number of documents containing 𝑡, and 𝑁 denotes the total number Have computer literacy Work analytically Communication disorders of documents in the corpus. Solve problems in healthcare Analytical chemistry Microsoft Visio We “transplant” this TF-IDF weighting scheme from terms in documents to skills associated to occupations. TF-IDF consists of Nursing professionals Dentists (2261) Pharmacists (2262) Special needs teachers (2221) (2352) two parts: Term Frequency (TF) is the frequency of a word (skill) Coordinate care Dental studies Radiopharmaceuticals Education law used in a given document (observed with an occupation), Inverse Have computer literacy Lead the dental team Work analytically Communication disorders Solve problems in healthcare Handle payments in dentistry Analytical chemistry Pedagogy Document Frequency (IDF) is a way to discount highly common terms, i.e., it is high when a word (skill) appears in a smaller num- ber of documents (observed with a low number of occupations). Figure 10: Three most relevant skills for multiple levels in major Common terms (skills) will thus yield a lower IDF score. ISCO group 2 For our TF-IDF-based model, we consider skills identified in job postings terms, and documents can be modeled as a collection of job studies,” are more commonly observed in level 4 ISCO groups. A postings belonging to an ISCO group. The counts of skills, which possible explanation for this is that specialized skills do only appear model term frequency, correspond to the number of times a skill is at specialized occupations. found in a job posting associated to a certain ISCO code. 5.2 Results and analysis 6 CONCLUSION 5.2.1 Level 1 ISCO groups. The resulting score provides us with In recent years the labor market has changed drastically. This is skills that are common for a given occupation (group) but uncom- mostly due to increased globalization, a growing working popu- mon in all other occupation(s) (groups). Table 6 shows the top 5 lation and disappearing jobs due to digitalization. The COVID-19 skills for the level 1 ISCO groups. pandemic has accelerated this change. This paper aims to explore In this table Microsoft Office appears both in the Managers and algorithmic and data-driven methods for exploring and improving Clerical support workers groups. For this skill to score high in mul- the fit between job seekers and vacancies by modeling skills and tiple contexts (occupation groups) the frequencies need to be sub- occupation data in a knowledge graph. Modeling and leveraging stantial in both, to be able to compensate for the IDF component of relationships between occupations and skills can provide insights the metric. In the Managers group, Microsoft Office has a TF of 9% for job seekers with existing skill sets. and in Clerical support workers a TF of 5%. After constructing our knowledge graph by relying on the ex- isting ISCO and ESCO taxonomies for occupations and skills, we 5.2.2 Multiple ISCO levels. ISCO level 1 helps us to understand enrich our KG by relying on job posting data. which skills are relevant for the least granular level; to deepen our We explore our final KG using three different applications. understanding we look at the development of multiple layers of First, we study link prediction methods for quantifying the relat- ISCO group 2 in Figure 10. Here, we show the 3 most relevant skills edness between skills and occupations in Section 3. We compare for several ISCO levels of the “Professionals” ISCO group. and evaluate two different link prediction methods, and find that We notice the following: First, communication-related skills ap- “Node2Vec” performs best. Next to quantifying relatedness between pear in multiple forms across occupation groups. The terms com- occupations and skills for, e.g., ranking skills for an occupation or munication, communication sciences, communication studies, ICT using as edge weights in our KG, we explore Node2Vec for identi- communication protocols, manage online communications and com- fying skills-to-occupation links that are not present in the original munication disorders seem to be closely related. Because these skills KG. are defined as distinct skills, each skill receives its own ranking. Next, in Section 4 we explore our KG for finding efficient job This concept can appear multiple times. transitions. When an individual is searching for a job, knowing Next, “Nursing professionals” and “Nursing and midwifery pro- which occupations can help the search process. In our next appli- fessionals” share the same set of relevant skills, which are highly cation we explore shortest path finding algorithms for identifying similar to those of their parent group “Health professionals”. Skills potential careerpath prediction. We use a skills-based Jaccard simi- that appear in those groups are the most frequent skills in the parent larity metric to model distance between occupations. Furthermore, group. we show examples of job transitions and study properties of our Finally, the further down the figure we go, the more specialized KG by analyzing the distribution of distances between skills and the skills appear to be, and more specialized skills, such as “dental occupations. RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands de Groot, et al. Managers Professionals Technicians and associate professionals 1 Microsoft Office Network Marketing Marker Making 2 Service-oriented Modelling Manage Online Communications Electronic Communication 3 Communication Principles Communication Service-oriented Modelling 4 Electronic Communication Explain Accounting Records Education Administration 5 Coordinate Patrols Accounting Manage Standard ERP System Clerical support workers Service and sales workers Skilled agricultural, forestry and fishery workers 1 Execute Administration Security Panels Leadership Principles 2 Perform Clerical Duties Electronic Communication Agricultural Information Systems and Databases 3 Microsoft Office Create Solutions to Problems Pruning Techniques 4 Education Administration Execute Administration Spray Pesticides 5 Human Resource Management Recreation Activities Lop Trees Craft and related trades workers Plant and machine operators, and assemblers Elementary occupations 1 Attend to Detail in Casting Processes Mechatronics Inventory Management Rules 2 Attention to Detail Mechanical Engineering Have Computer Literacy 3 Adobe Illustrator Electrical Engineering Carpentry 4 Adobe Photoshop Operate Soldering Equipment Place Concrete Forms 5 ML (computer programming) Act Reliably Operate on-board Computer Systems Table 6: Five most relevant skills per major ISCO group based in the TF-IDF matric Finally, in Section 5 we study a method to determine which skills associated occupations, etc.). In general, this problem of matching are most relevant to different levels of aggregated occupations, can be considered an entity linking task, which is considered out of using the ISCO taxonomy. The skill relevance to an ISCO (group) scope for this application paper. Having a flawed knowledge graph is calculated by taking the frequency of the skills being required as a result of sub-optimal prepossessing does not invalidate the for an ISCO (group) with the uniqueness of the skill in the overall methods used. Whichever approach is used to create a knowledge ISCO taxonomy. Here, the uniqueness is high if a skill occurs more graph, the outcome will never be perfect [3]. often in one group compared to the other groups. The metric that Finally, two out of three applications of our KG are not validated reflects this intuition is called “TF-IDF.” By doing so we construct a empirically: for both our shortest path finder (Section 4) and identi- birds-eye view of the labor market. fying the most relevant skill per ISCO group (Section 5, we focused The findings from the three sections described above are all on the analysis and interpretation of results, omitting a more formal variations to the same theme, of finding or enabling the perfect fit evaluation methodology. For future research it would be interesting between a job seeker and a vacancy, by leveraging skills. to benchmark the current against different career path prediction models. Validating if, e.g., the discovered paths between occupa- 7 DISCUSSION & FUTURE RESEARCH tions indeed are practically the shortest one, requires additional In this paper we present different KG-driven applications for skills- data. Unfortunately, no such data was available at the time of writ- based job matching. In principle, the methods presented are data- ing. One place to acquire such data, is, e.g., by collecting data of agnostic, as long as similar concepts (occupations and skills) and historic career paths. However, collecting such data and composing data (job postings with identified occupations and skills) are avail- was determined out of scope for this work. The same arose for the able. More specifically, we leverage the ISCO and the ESCO tax- method for quantifying the relevance of skills per ISCO group; these onomies, which are available in a large number of languages, and aggregated insights were difficult to validate. We could imagine are considered standards that are freely available. Other frameworks involving human expert annotators to annotate which skills they could be used as well, where ESCO is widely used in Europe, the deem (most) relevant to a certain ISCO (group). However, similarly O*NET framework [18] is often referred to as the de facto standard to the above, collecting and analyzing such data did not fit in the in the United States. scope of our present work. In summary, our paper revolves around The outcome of any research is heavily dependent on the avail- studying algorithmic methods that aim to help both jobseekers and able data. In the case of this research this data is preprocessed recruiters find a better match between individuals and occupations, in a number of steps, one of which is the skill matching step de- we consider studies with actual end-users out of scope [13]. scribed in Section 2.2.3. We opted for a naive character 𝑛-gram based method for matching surface forms found in a job posting with skill names in ESCO. Obviously, more refined methods can be ACKNOWLEDGMENTS employed, e.g., by considering additional representation of the skill A special thanks to the thesis supervisors for the project Niels van in the job posting (contextual words, occupation), and at the same Weeren and Prof. Aske Plaat as well as everybody at Randstad time additional context at the side of the KG (e.g., skill descriptions, involved in this project. Job Posting-Enriched Knowledge Graph for Skills-based Matching RecSys in HR 2021, October 1, 2021, Amsterdam, the Netherlands REFERENCES [11] Burning Glass. [n.d.]. Vacancy data. https://www.jobdigger.nl/ https://www. [1] İ. Semih Akçomak, Lex Borghans, and Bas ter Weel. 2011. Measuring and In- jobdigger.nl/. terpreting Trends in the Division of Labour in the Netherlands. De Econo- [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for mist 159, 4 (01 Dec 2011), 435–482. https://doi.org/10.1007/s10645-011-9168-3 networks. In Proceedings of the 22nd ACM SIGKDD international conference on https://doi.org/10.1007/s10645-011-9168-3. Knowledge discovery and data mining. 855–864. [2] Pol Antràs, Luis Garicano, and Esteban Rossi-Hansberg. 2005. Offshoring in [13] Francisco Gutiérrez, Sven Charleer, Robin De Croon, Nyi Nyi Htun, Gerd a Knowledge Economy. Working Paper 11094. National Bureau of Economic Goetschalckx, and Katrien Verbert. 2019. Explaining and Exploring Job Rec- Research. https://doi.org/10.3386/w11094 http://www.nber.org/papers/w11094. ommendations: A User-Driven Approach for Interacting with Knowledge-Based [3] Antoine Bordes and Evgeniy Gabrilovich. 2014. Constructing and Mining Web- Job Recommender Systems. In Proceedings of the 13th ACM Conference on Recom- Scale Knowledge Graphs: KDD 2014 Tutorial. In Proceedings of the 20th ACM mender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing SIGKDD International Conference on Knowledge Discovery and Data Mining (New Machinery, New York, NY, USA, 60–68. https://doi.org/10.1145/3298689.3347001 York, New York, USA) (KDD ’14). Association for Computing Machinery, New [14] international labour office. [n.d.]. International Standard Classification of Occupa- York, NY, USA, 1967. https://doi.org/10.1145/2623330.2630803 https://doi.org/10. tions. https://www.ilo.org/public/english/bureau/stat/isco/ https://www.ilo.org/ 1145/2623330.2630803. public/english/bureau/stat/isco/. [4] Lex Borghans, Bas Ter Weel, and Bruce A. Weinberg. 2014. People Skills [15] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for and the Labor-Market Outcomes of Underrepresented Groups. ILR Re- social networks. Journal of the American society for information science and view 67, 2 (2014), 287–334. https://doi.org/10.1177/001979391406700202 technology 58, 7 (2007), 1019–1031. arXiv:https://doi.org/10.1177/001979391406700202 https://doi.org/10.1177/ [16] OECD. [n.d.]. Employment by job tenure intervals - average tenure. https://stats. 001979391406700202. oecd.org/Index.aspx?DataSetCode=TENURE_AVE https://stats.oecd.org/Index. [5] Nicole Bosch and Bas Weel. 2013. Labour-Market Outcomes of Older Workers in aspx?DataSetCode=TENURE_AVE. the Netherlands: Measuring Job Prospects Using the Occupational Age Structure. [17] OECD. [n.d.]. FTPT employment based on national definitions. https://stats. De Economist 161 (06 2013). https://doi.org/10.1007/s10645-013-9202-8 oecd.org/Index.aspx?DataSetCode=FTPTN_D https://stats.oecd.org/Index.aspx? [6] centraal bureau voor de statistiek. [n.d.]. De arbeidsmarkt in cijfers. https: DataSetCode=FTPTN_D. //www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf https:// [18] O*NET. [n.d.]. O*NET OnLine. https://www.onetonline.org/ https://www. www.cbs.nl/-/media/_pdf/2020/18/dearbeidsmarktincijfers2019.pdf. onetonline.org/. [7] Edsger W Dijkstra et al. 1959. A note on two problems in connexion with graphs. [19] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches Numerische mathematik 1, 1 (1959), 269–271. and evaluation methods. Semantic web 8, 3 (2017), 489–508. [8] european commission. [n.d.]. ESCO handbook. https://ec.europa.eu/esco/ [20] Gang Peng. 2017. Do computer skills affect worker employment? An empirical portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24 https://ec.europa. study from CPS surveys. Computers in Human Behavior 74 (2017), 26 – 34. eu/esco/portal/document/en/0a89839c-098d-4e34-846c-54cbd5684d24. https://doi.org/10.1016/j.chb.2017.04.013 http://www.sciencedirect.com/science/ [9] Eurostat. [n.d.]. Labour market transitions – annual data. https://ec.europa. article/pii/S0747563217302510. eu/eurostat/web/lfs/data/database https://ec.europa.eu/eurostat/web/lfs/data/ [21] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro- database. duction to information retrieval. Vol. 39. Cambridge University Press Cambridge. [10] World Economic Forum. 2020. The Future of Jobs Report 2020. World Economic [22] Textkernel. [n.d.]. Extract. https://www.textkernel.com/nl/solution/extract/ Forum, Geneva, Switzerland. https://www.textkernel.com/nl/solution/extract/.