Modeling, Measuring and Exploiting Concept Drift in the Labour Market Domain Panos Alexopoulos Spyretta Leivaditi Textkernel B.V. Kentivo B.V. Nieuwendammerkade 26a5 Kerksteeg 1 1022 AB, Amsterdam, The Netherlands 3582 CV, Utrecht, The Netherlands alexopoulos@textkernel.com spyretta.leivaditi@kentivo.com ABSTRACT questions like "What are the most important skills for a certain pro- The Labour Market domain is a relatively narrow domain in terms fession?", "What professions are specializations of Profession X?" or of concept types that appear in it (as it typically consists of pro- "What qualifications do I need in order to acquire skill Y". Moreover, fessions, skills and qualifications) but a very broad one in terms of we use the graph within our systems for a) performing entity recog- actual concepts (as these professions and skills can be in all kinds of nition and disambiguation in CVs and vacancies and b) determining domains such as Technology, Education, Finance, etc). More impor- the semantic similarity between these entities when searching or tantly, it is a quite volatile domain in the sense that the meaning of matching CVs and vacancies. many concepts changes (at different rates) over time. This phenom- Constructing the knowledge graph in an efficient and cost-effective enon, known as semantic or concept drift, poses a challenge for way is a quite challenging task, not only because the labour market the maintenance and evolution of knowledge graphs that represent domain is quite broad but also because it is very heterogeneous such domains, and requires dedicated approaches for tackling it so (different industries and business areas, languages, labour markets, as to prevent such graphs from becoming irrelevant. With that in educational systems etc.). What is equally challenging, however, is mind, in this paper we describe our experiences from dealing with dealing with the concept drift that happens to the domain’s concepts concept drift in an in-house developed labour market knowledge as the time goes by, and causes changes to their meaning [15]. graph, and provide insights on: i) how concept drift can be effec- In particular, drift in our graph is mainly observed in Professions, tively defined and modeled for labour market concepts, and ii) how Skills and Qualifications. Take for example journalists. Before the it can be detected, measured and effectively incorporated in the proliferation of the Internet and social media, a reporter would knowledge graph lifecycle. have to research stories through contacts, speaking to people, door knocking and visiting the local library to consult past publications. KEYWORDS She would also most likely not know how to do her own video production editing but would rely on experts to do that for her. Knowledge Graphs, Concept Drift, Labour Market Nowadays, however, it’s more likely to meet a reporter who can use effectively Google, Twitter and other modern information chan- 1 INTRODUCTION nels, and, to a still low yet increasing extent, data analysis and A few years after Google announced that their knowledge graph visualization tools [10]. Similar arguments can be made for other allowed searching for things, not strings1 , knowledge graphs have professions but also for qualifications and skills. A contemporary been gaining momentum in the world’s leading organisations as degree in Finance, for example, has definitely different content and a means to integrate, share and exploit data and knowledge that even somewhat different learning objectives than it had 30 years they need in order to stay competitive [11]. Apart from Google, ago. Similarly, being expert in Marketing nowadays is highly asso- prominent examples of companies that develop knowledge graphs ciated to being expert in Search Engine Optimization and Social include Microsoft2 , LinkedIn3 , BBC4 and Thomson Reuters5 . A Media. similar knowledge graph, for the recruitment and labour market These changes can be bigger or smaller, faster or slower, and domain, we have been developing and using for the last couple of more or less profound, depending on the concept type and of course years at Textkernel, aiming to significantly improve the way our the real-world dynamics. In any case, such changes can affect the semantic software modules parse, retrieve and match CVs and Job quality of a knowledge graph and, therefore, dedicated frameworks Vacancies. for modeling, measuring and exploiting semantic drift in the context Our knowledge graph defines and interrelates concepts and enti- of knowledge graph maintenance and evolution are needed [14]. ties about the labour market and recruiting domain, such as profes- In this short paper, we corroborate this argument and we extend sions, skills and qualifications, for multiple languages and countries. it with the following two arguments: Using the graph, an agent (human or computer system) can answer (1) The definition and modeling of semantic drift for a given 1 googleblog.blogspot.com 2 https://arstechnica.com/information-technology/2012/06/inside-the-architecture- knowledge graph should take into account the graph’s of-googles-knowledge-graph-and-microsofts-satori/ content, domain and application context, and adapted 3 https://engineering.linkedin.com/blog/2016/10/building-the-linkedin-knowledge- accordingly. While generic formalizations of concept drift graph are very useful (like for example modeling drift in terms of 4 http://www.bbc.co.uk/ontologies 5 https://www.scribd.com/document/288608104/Creating-the-Thomson-Reuters- label, intension and extension [15]), these are not necessar- knowledge-graph-and-open-permID-ODI-Summit-2015 ily directly or completely applicable to all domains and/or graphs, the reason being that not all aspects of a concept’s is broader than "Java Developer" and "Economics" is broader than meaning contribute to its drift in the same way and to the "Microeconomics"). same extent. Additional relations are defined per concept type. In particular, (2) There is not a unique optimal way to measure concept professions are linked to skills and activities they involve, as well drift for a given knowledge graph, but rather multi- as the locations, organizations and industries where they are found. ple ways whose outcomes can have different interpre- They are also linked to qualifications that are (formally or infor- tations and usages. Indeed, the values one gets when mea- mally) required for their exercise (e.g., the BAR exam for practicing suring concept drift can be quite different, depending on the law in the United States), and, of course, to other professions that metrics, data sources and methods/algorithms used for the are similar to them. measurement. Therefore, it is important that a) for a given Skills, in turn are linked to similar skills and activities, profes- drift measurement approach, the drift values it produces can sions and industries they are mostly demanded by, and qualifica- be clearly interpreted and used, and b) for a desired interpre- tions that develop and verify them. Finally, qualifications are linked, tation/usage, an appropriate drift measurement method can apart from skills, to organizations that provide them as well as the be selected. educational levels they cover. In the rest of the paper we further explain and exemplify these Most of the above relations are extracted and incorporated into arguments by describing how we model and measure concept drift the knowledge graph in a semi-automatic way from a variety of in our Labour Market Knowledge Graph, as well as how we apply structured and unstructured data sources, including CVs, Job Va- the measurement results, not only for improving the graph but also cancies and Wikipedia [16] [17], as well as Search Query Logs gaining business benefits. [3]. Moreover, many of these relations are vague, i.e., there are (or could be) pairs of concepts for which it is indeterminate whether 2 DRIFT MODELING FOR LABOUR MARKET they stand in the relation or not (e.g., the similarity between dif- CONCEPTS ferent skills or the importance of a skill for a profession) [2]. The problem with vague relations is that their interpretation is highly 2.1 Concept Representation subjective, context-dependent, and usually a matter of degree, thus The Textkernel knowledge graph consists primarily of the following making it hard to achieve a global consensus over their veracity. concept types: For this reason, in our graph, such relations have the following • Professions: Concepts that represent groupings of jobs that three properties: involve similar tasks and require similar skills and compe- • Strength: A number (typically from 0 to 1) indicating the tencies. strength/confidence of the relation. • Skills: Concepts that represent tools, techniques, methodolo- • Applicability Context: The contexts (location, language, gies, areas of knowledge, activities, and generally anything industry etc) in which the relation has been discovered and that a person can "have knowledge of", "be experienced in" considered to be true. or "be expert at" (e.g., Economics, Software Development, • Provenance: Information about how the relation has been "doing sales in Africa", etc). Also concepts that represent added to the graph (source, method, process). personality traits, including communication abilities, per- These properties do not remove of course vagueness, but help sonal habits, cognitive or emotional empathy, time manage- towards making the relations better interpretable by both humans ment, teamwork and leadership traits (usually referred as and systems and reducing disagreements [1]. Moreover, as we show soft skills). below, these properties play an important role in the measurement • Qualifications: Concepts that represent "formal outcomes of concept drift. of assessment and validation processes which are obtained when a competent body determines that an individual has 2.2 Concept Drift achieved learning outcomes to given standards" (European Qualifications Framework6 ). Concept drift in the semantic knowledge representation literature • Organizations: Concepts that represent organizations of is usually modeled (and measured) with respect to three aspects different types, including public organizations and institutes, of a concept’s meaning, namely its labels (i.e., the words used to private companies and enterprises, educational institutes (of express the concept), its intension (i.e., the concept’s characteristics all educational levels) and others. as expressed via its properties and relations), and its extension • Industries: Concepts that represent industrial groupings of (i.e., the set concept’s of the concept’s instances) [15] [13]. The companies based on similar products and services, technolo- extension’s role in drift is disputed by [5], suggesting that it depends gies and processes, markets and other criteria. on the kind of concepts under consideration. In our knowledge graph, we adopt this latter perspective, by The different ways a concept can be expressed in a text (sur- not considering extensions as part of our concepts’ meaning and face forms) are represented in the graph via the well-known SKOS drift. One reason for that is that concepts like skills and professions relations prefLabel and altLabel [9]. Moreover, concepts can be are rather abstract and do not have straightforward instances (e.g., taxonomically related to other concepts of the same type via the professions do not refer to specific persons or jobs). One could SKOS relations broader and narrower (e.g., "Software Developer" consider as profession instances the people that exercise them or 6 http://ec.europa.eu/eqf/home_en.htm the vacancies that are available for them, but then a change in the 2 workforce size does not alter the profession’s meaning. Instead, and labeling, they define corresponding similarity functions for it’s the qualitative characteristics of this workforce that signify each of these aspects. In particular, they employ string similarity a change, and that’s exactly what we capture via the concepts’ metrics for measuring labeling drift, and set similarity metrics for intension. measuring intension and extension drift. For our graph, we follow Nevertheless, we do not consider all properties and relations of a similar approach, but with some important differences. our concepts to be part of their meaning and drift, nor to the same First, for labeling we don’t use string similarity to measure extent. In particular: change, one reason being that we don’t consider spelling or mor- • We do consider as drift changes in a concept’s labels, yet only phosyntactic change as a drift. Instead, we consider labels as part of when these changes are not merely additions or removals of the concept’s intension and we use set similarity metrics to measure spelling and/or morphosyntactic variations of existing labels the difference between a concept’s changing label sets. (e.g., part-of-speech or plural form). Moreover, we consider Second, since many of the concept relations are vague and with changes in preferred labels as slightly more important than their validity quantified by some strength score, when we calculate alternative labels, as the former are typically more suggestive similarity based on them we use metrics that can take in consid- of the concept’s meaning. eration this strength. One approach that we use, for example, is • We do consider as drift changes in a concept’s broader and as follows: Given two versions of the same concept and a (vague) narrower relations, with broader changes suggesting in gen- relation that influences drift, we derive the top-N related concepts eral a more fundamental drift in the concept’s meaning than for each version (based on the strength score), and we calculate the narrower ones. their similarity using the generalized Kendall’s tau [4] that can • For profession concepts meaning is primarily defined by measure distance between rankings. In that way, for example, if the skills and activities they involve (see the example of the "Data Scientist" profession continues having the same top 10 journalist above). Essential skills for a profession are more related skills but differently ranked, a drift will be detected. important than optional skills, though that can be hard to Third, in order to be able to understand and interpret concept distinguish. Profession meaning also changes, though to a drift better, we need a versatile measurement framework that en- lesser extent, when the industries it is found in change (e.g., ables the dynamic and highly configurable measurement and pre- journalists start working in the tech sector). On the other sentation of drift. Such a framework should take as input a set of hand, a profession concept does not drift when the locations parameters, specifying the scope, type and other characteristics of or companies it is most popular in, change. the drift we want to measure, and generate corresponding output. • For skill concepts meaning is primarily defined by their sim- Examples of parameters we consider are: ilar skills and activities, as these describe for what tasks and in what contexts a skill is used. It also changes, though to a • Target concept types (Professions, Skills, etc.) lesser extent, when it starts being applied in different pro- • Time scope (either as a specific time period or as specific fessions and industries, as part of possessing a skill includes releases to be included). having experience in its application contexts. • Relations and properties to be included. • For qualification concepts meaning is primarily defined by • Relation applicability context and provenance. the skills they develop and/or verify. Secondarily, by the professions they regulate and/or are useful for (especially The reason we need all these parameters, is that different values in some countries, qualifications are the main criterion for of them can yield different drift, not only in terms of intensity entering a profession). but also in terms of interpretation. For example, if we calculate a It’s worth noting that we are aware of the distinction between concept’s drift using only CVs as a data source, then the drift we will concept drift and concept replacement (i.e., change in the concept’s measure will reflect the change in the way the workforce side of the core meaning) [8], but we don’t really tackle this issue in our graph, labour market interprets and uses the concept. On the other hand, if because a) it can be quite difficult to define the core meaning of a we use only Vacancies, we shall get an idea of how the same concept concept in a way that is easily detectable, and b) it’s a phenomenon changes from the industry’s perspective. Similarly, if we use news that is rather rare, not causing any observable problems to our articles, we will measure the change in the general perception of the graph and its applications so far. concept, while the usage of more encyclopedic and definitional data sources (e.g. Wikipedia or specialized dictionaries) may indicate 3 DRIFT MEASUREMENT FOR LABOUR changes in more core aspects of the concept’s meaning. MARKET CONCEPTS Finally, as suggested in the previous section, different relations have different influence to concept drift, and that difference needs Concept drift is typically detected and quantified by measuring the to be considered when relation-specific drifts are aggregated. A sim- difference in meaning between two or more different versions of ilar argument can be made for other drift aspects like provenance or the same concept in different points in time [13] [12] [7] [6]. The context (e.g., the change of a profession concept in a country with more dissimilar the two versions are to each other, the greater the more advanced economy may be more important/crucial than the drift is. change in less developed country). For that reason, our drift frame- Measuring concept meaning similarity is obviously dependent on work supports the definition of drift aspect importance weights how meaning is modeled. Thus, for example, in [15] and [13] where that are used for combining and aggregating partial drift scores. the authors consider as meaning the concept’s intension, extension 3 4 DRIFT EXPLOITATION crisp ones. Knowl. Inf. Syst. 32, 3 (2012), 667–695. https://doi.org/10.1007/ s10115-011-0457-6 The modeling and measurement of concept drift in our knowledge [3] Khalifeh AlJadda, Mohammed Korayem, and Trey Grainger. 2015. Improving the graph serves mainly two purposes, one engineering related and quality of semantic relationships extracted from massive user behavioral data. In 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, one of business nature. On the engineering side, the measurement USA, October 29 - November 1, 2015. 2951–2953. https://doi.org/10.1109/BigData. and monitoring of drift helps us quantify and understand better 2015.7364133 the dynamics of our domain and our graph’s content. This, in turn, [4] Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing Top K Lists. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algo- enables us to plan and prioritize the maintenance and evolution rithms (SODA ’03). Society for Industrial and Applied Mathematics, Philadelphia, of the knowledge graph much more effectively by, for example, PA, USA, 28–36. http://dl.acm.org/citation.cfm?id=644108.644113 identifying highly volatile graph aspects that need more frequent [5] Antske Fokkens, Serge Ter Braake, Isa Maks, and Davide Ceolin. 2016. On the Semantics of Concept Drift: Towards Formal Definitions of Concept Drift and updates, and allocating more resources for that. This applies not Semantic Change. In Proceedings of the 1st Workshop on Detection, Representation only for computational resources (data storage capacity, data pro- and Management of Concept Drift in Linked Open Data co-located with the 20th International Conference on Knowledge Engineering and Knowledge Management cessing efficiency, etc.) but also human ones (knowledge engineers, (EKAW 2016), Bologna, Italy, November, 2016. 10–17. quality analysts, annotators etc.) [6] Jon Atle Gulla, Geir Solskinnsbakk, Per Myrseth, Veronika Haderlein, and Olga On the business side, the drift in our knowledge graph indicates Cerrato. 2010. Semantic Drift in Ontologies. In WEBIST 2010, Proceedings of the 6th International Conference on Web Information Systems and Technologies, Volume to a large extent the changes that take in place in the labour market, 2, Valencia, Spain, April 7-10, 2010. 13–20. especially the one that we derive from CVs and Vacancies. These [7] Adam Jatowt and Kevin Duh. 2014. A Framework for Analyzing Semantic changes we can then communicate to job seekers, candidate seekers, Change of Words Across Time. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’14). IEEE Press, Piscataway, NJ, USA, education and training providers, policy makers, and generally 229–238. http://dl.acm.org/citation.cfm?id=2740769.2740809 anyone who can gain advantage from knowing the dynamics of [8] Jouni-Matti Kuukanen. 2008. Makinh Sense of Conceptual Change. History and Theory 47, 3 (2008), 351–372. the labour market. [9] Alistair Miles and Sean Bechhofer. 2009. SKOS Simple Knowledge Organization For example, most job holders have a narrow perception of what System Reference. W3C Recommendation 18 August 2009. (2009). http://www. their profession entails and to what extent and rate it evolves over w3.org/TR/2009/REC-skos-reference-20090818/ [10] Nic Newman. 2017. Journalism, Media, and Technology Trends and Predictions time, as they usually operate in a narrow context. As a result, when 2017. Technical Report. Reuters Institute for the Study of Journalism. these people become job seekers, they have to change this percep- [11] Jeff Pan, Guido Vetere, Jose Manuel Gomez-Perez, and Honghan Wu. 2017. Ex- tion, otherwise they may fail to secure a new job that may have the ploiting Linked Data and Knowledge Graphs in Large Organisations. Springer International Publishing Switzerland. https://doi.org/10.1007/978-3-319-45654-6 same title but quite different content. The same applies for organi- [12] Gabriel Recchia, Ewan Jones, Paul Nulty, John Regan, and Peter de Bolla. 2016. zations that need to hire people but fail to do so, mainly because Tracing Shifting Conceptual Vocabularies Through Time. In Drift-a-LOD@EKAW (CEUR Workshop Proceedings), Vol. 1799. CEUR-WS.org, 2–9. their job definitions are too restrictive and not in sync with the [13] Thanos G. Stavropoulos, Stelios Andreadis, Efstratios Kontopoulos, Marina Riga, supply side of the market. Panagiotis Mitzias, and Yiannis. Kompatsiaris. SemaDrift: A Protégé Plugin for Measuring Semantic Drift in Ontologies. In Hollink, L., Darányi, S., Meroño Peñuela, A., and Kontopoulos, E. (eds.) 1st International Workshop on Detection, Rep- 5 CONCLUSION resentation and Management of Concept Drift in Linked Open Data (Drift-a-LOD) in In this short paper we have described how we have been model- conjunction with the 20th International Conference on Knowledge Engineering and Knowledge Management (EKAW). CEUR Workshop Proceedings Vol 1799. Bologna, ing, measuring and exploiting concept drift in a Knowledge Graph Italy, 34–41. for the Labour Market domain, making the case for a more flex- [14] Ljiljana Stojanovic, Alexander Maedche, Boris Motik, and Nenad Stojanovic. 2002. User-Driven Ontology Evolution Management. In Proceedings of the 13th ible, adaptable, and domain/application dependent drift tackling International Conference on Knowledge Engineering and Knowledge Management. approach. We have shown how not all aspects of a concept’s mean- Ontologies and the Semantic Web (EKAW ’02). Springer-Verlag, London, UK, UK, ing contribute to its drift in the same way and to the same extent, 285–300. http://dl.acm.org/citation.cfm?id=645362.650868 [15] Shenghui Wang, Stefan Schlobach, and Michel C. A. Klein. 2011. Concept drift thus requiring a careful analysis and selection of them for the do- and how to identify it. Journal of Web Semantics 9 (2011), 247–265. main and graph at hand. We have also shown how versatile can be [16] Meng Zhao, Faizan Javed, Ferosh Jacob, and Matt McNair. 2015. SKILL: A System the outcome of measuring concept drift (depending on the metrics, for Skill Identification and Normalization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 4012–4017. data sources and methods/algorithms used for the measurement), http://dl.acm.org/citation.cfm?id=2888116.2888273 suggesting nevertheless that this versatility can be actually useful [17] Wenjun Zhou, Yun Zhu, Faizan Javed, Mahmudur Rahman, Janani Balaji, and Matt McNair. 2016. Quantifying skill relevance to job titles. In 2016 IEEE International and, therefore, in need of proper management. Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016. Our parameter-based drift management framework is still work 1532–1541. https://doi.org/10.1109/BigData.2016.7840761 in progress, requiring further research and development on how it can be properly operationalized within our enterprise. This in- cludes full-fledged UI support, additional drift metrics, guidelines for interpreting and acting on the metrics, and a more formal user and data driven evaluation. REFERENCES [1] Panos Alexopoulos, Silvio Peroni, Boris Villazón-Terrazas, Jeff Z. Pan, and José Manuél Gómez-Pérez. 2014. A Metaontology for Annotating Ontology Entities with Vagueness Descriptions. In Uncertainty Reasoning for the Semantic Web III - ISWC International Workshops, URSW 2011-2013, Revised Selected Papers. 100–121. https://doi.org/10.1007/978-3-319-13413-0_6 [2] Panos Alexopoulos, Manolis Wallace, Konstantinos Kafentzis, and Dimitris Ask- ounis. 2012. IKARUS-Onto: a methodology to develop fuzzy ontologies from 4