A Study of Intensional Concept Drift in Trending DBpedia Concepts Albert Meroño-Peñuela Efstratios Kontopoulos Department of Computer Science Information Technologies Institute Vrije Universiteit Amsterdam Thessaloniki, Greece Amsterdam, The Netherlands skontopo@iti.gr albert.merono@vu.nl Sándor Darányi Ioannis Kompatsiaris Swedish School of Library and Information Science Information Technologies Institute University of Borås Thessaloniki, Greece Borås, Sweden ikom@iti.gr sandor.daranyi@hb.se ABSTRACT Oxford Dictionary of English1 shows how definitions attributed to Concept drift refers to the phenomenon that concepts change their words are different in different periods of history. In the Dutch intensional composition, and therefore meaning, over time. It is a historical censuses (1795-1971) [15] the taxonomy of occupations manifestation of content dynamics, and an important problem with shows an extraordinary variation every decade, in line with the regard to access and scalability in the Web of Data. Such drifts go major transformations of labor in the society of that time. We call back to contextual influences due to social embedding as suggested the change of meaning of concepts over time concept drift. Concept by e.g. topic analysis, news detection, and trends in social networks. drift can have drastic effects in the performance of a system, like Using DBpedia as a source of timestamped Linked Open Data, we changing queries and inconsistent analyses. analyze the interaction between a sample of popular keywords, What causes concept drift to occur in these systems? In the spe- as recorded by Google Trends, and their respective concept drifts cific setting of the Semantic Web [3] (now also referred to as Web of in DBpedia. For the latter task, we deploy SemaDrift, an ontology Data), concepts in ontologies and taxonomies are regularly updated evolution platform for detecting and measuring content dislocation by humans in order to “reflect changes in the real world, changes in dependent on context modification. Our hypothesis is that social user requirements, and drawbacks in the initial design” [23]. Hence, embedding and awareness is an important trigger for concept drift concept drift in semantic systems has a traceable and direct origin in crowdsourced knowledge bases on the Web. in humans. However, the more recent trend on Linked Data [9] in the Semantic Web, rather than manually building these ontologies KEYWORDS and taxonomies, has automated the way in which semantic systems obtain their concepts. A canonical example is DBpedia [14], which Concept Drift, Semantic Web, DBpedia, Wikipedia, Google Trends relies largely on automated knowledge extraction methods to cre- ate Linked Data out of Wikipedia2 . In this situation, the causes of concept drift become more difficult to trace. There are various plausible explanations for the origin of concept drift in complex systems. One of them is the interaction of evolv- ing context with evolving content. Social awareness (instigated by events or the media) triggers a process of knowledge sharing on the Web. This process often results in changes in knowledge bases, which may have an impact in the meaning of concepts. Wikipedia, 1 INTRODUCTION the biggest collaboratively-built knowledge base of the Web, has been criticized for “allegedly exhibiting systemic bias, presenting a Rather than remaining stable, permanent, and fixed, the meaning mixture of truths, half truths, and some falsehoods, and, in contro- of concepts changes over time. The Historical Thesaurus of the versial topics, being subject to manipulation and spin” [18]. It is then worth considering whether the controversy, novelty or burst of a topic has an impact on how reality is formally defined in knowledge bases derived from Wikipedia, such as DBpedia [14]. In this paper we propose a framework to measure the influence of user engagement on the Web with its effects on concept drift in Web crowd-sourced databases. We are interested in the process of public © 2017 Copyright held by the author/owner(s). SEMANTiCS 2017 workshop proceedings: Drift-a-LOD 1 http://public.oed.com/historical-thesaurus-of-the-oed/ September 11-14, 2017, Amsterdam, Netherlands 2 https://www.wikipedia.org/ Drift-a-LOD2017, September 2017, Amsterdam, The Netherlands A. Meroño-Peñuela et al. opinion influencing the feature composition of concepts as captured In general terms, another relevant track is research into time by automatic means. Hence, our research question is: what patterns series of content. From a Natural Language Processing (NLP) per- of influence can we discern between trends in queries by Web users, and spective, a typical example is to study diachronic collocations: a concept drift in crowd-sourced databases? To address this question, word’s company (its collocates) may change over time, reflecting we propose a tool chain that quantifies the trendiness of Web queries, changes in that word’s meaning and/or in the focus of the discourse and confronts it with measures of concept drift for Linked Data. This in which it is embedded. However, traditional collocation extrac- tool chain consists of SemaDrift [20], a concept drift measuring tors treat the underlying text corpus as a homogenous whole, and platform; Google Trends3 , an index of the popularity of Web user thus cannot adequately account for such diachronic changes in queries over time; and the different versions of DBpedia accessible a word’s collocation behavior, hence the need for a combination via Linked Data Fragments (LDF) [27]. of diachrony and contextuality [10]. From an information science Concretely, the contributions of this paper are: perspective, the study of conceptual dynamics [4] offers another • An automated and systematic way for retrieving time-specific comprehensive set of considerations. By the mathematical models concept intensions from Linked Data sources (Section 3.1); they exploit, both tracks preserve the underlying contextual depen- • A framework for studying the relationship between the pop- dency of word content or meaning, ultimately going back to Harris’ ularity of Web user queries and the drift of their associated distributional hypothesis [8]. concepts over time (Section 3); • An experimental application of this framework to recent 3 TRENDING CONCEPTS AND CONCEPT Web trending queries and the latest snapshots of DBpedia DRIFT (Section 4). In this section we describe a workflow for studying the relationship between the popularity of Web user queries and the drift in concepts 2 RELATED WORK contained therein: The problems of semantic change and drift concern various research fields. In the areas of Semantic Web and knowledge representation, (1) We use an extended LDF client to systematically retrieve ontology evolution [13] addresses “the timely adaptation of an time-specific concept intensions of a chosen concept C (see ontology and consistent propagation of changes to dependent arti- Section 3.2) from compatible Linked Data sources with the facts” [1]. Features of evolution have been studied [22] and used Linked Data Fragments backend4 ; for prediction using machine learning [17]. Gonçalves et al. [7] (2) Using the concept intensions retrieved in the previous step, use Description Logics to calculate differences between ontologies we use SemaDrift [19] to measure intensional concept drift (so-called semantic diffs). Wang et al. [28] define the semantics of over int(C). This represents how much the concept C has concept change and drift, and how to identify them. General sur- drifted in a certain time period; veys of semantic change in other fields, including language, have (3) Finally, we confront values of Trend and Drift, and we recently appeared [20]. On the use of trends of Web user queries observe the relationship between measurements of concept and changing semantics, the work by Tiddi et al. [26] illustrates drift for the concept C, and measurements of popularity for the use of knowledge from the Semantic Web to explain patterns in a Web user query q(C) that matches C. data, in particular on finding causes for trending queries in Google Trends. To the best of our knowledge, no previous work addresses 3.1 Temporal DBpedia Concepts with LDF the cause-effect relationship between trends and concept drift. Wikipedia is “a free online encyclopedia with the aim to allow any- Standard means of observing changes in content include e.g. one to edit articles”5 . Aligning with the mission of Linked Data recognizing news in texts by topic detection and tracking [2], and and the Semantic Web, DBpedia [14] aims at extracting structured new event or burst detection [16], which are in essence similar to content from Wikipedia, providing a means for semantically query- time series analysis. Significant solutions range from extracting ing relationships and properties of its content. We assume this time-varying features from texts [24] to constructing timelines structured content of DBpedia resources to formally represent the for event classification based on word usage statistics [25] and meaning of their associated concepts. In this first step, we select a personalized newsfeeds based on information novelty [6]. In the concept of interest C, and we query DBpedia to get the intension latter, the inter- and intra-document dynamics of documents is of C, int(C) (i.e. its defining properties), at various points in time. considered to model how information evolves over time from article Querying massive Linked Data sources like DBpedia entails to article, as well as within individual articles. Such methods can be various challenges. One approach includes submitting processing- applied to the analysis of temporal dynamics in online text streams intensive queries to SPARQL endpoints; another approach is to such as newsfeed or e-mail [11, 12], or chronologically ordered download and locally query massive data dumps that are possibly documents [5]. These are models typically based on graph theory not up-to-date. Linked Data Fragments (LDF) provide a conceptual vs. vector space methods vs. probability theory, capturing local vs. framework that delivers a uniform view on RDF interfaces, aim- global context of content as a basis of the results, therefore our ing to minimize server resource usage while still enabling clients current models of content are context-dependent. However, this dependency, although acknowledged, is typically not quantified, a 4 To the best of our knowledge, currently DBpedia is the only Linked Data source precondition for improved models. with such support. 3 https://trends.google.com/ 5 https://en.wikipedia.org/wiki/Wikipedia A Study of Intensional Concept Drift in Trending DBpedia Concepts Drift-a-LOD2017, September 2017, Amsterdam, The Netherlands to query data sources efficiently [27]. In this work, we have de- ployed an openly available Java LDF client6 , which we extended for measuring intensional drift via the SemaDrift API. 3.2 Concept Drift and SemaDrift To measure concept change between two versions of an ontology, we use the concept drift framework proposed by Wang et al. [28], which quantifies the change of meaning of concepts over time. In this framework, the meaning of a concept C is defined as the combination of its intension, extension, and label. The intension of C, int(C), is the set of formal, explicit properties that axiomatically define C. The extension of C, ext(C), is the set of its instances. The label of C, label(C), is a human-readable string representing C. Figure 1: Chosen concepts and GT scores (2014-01 – 2016-04). Over time, int(C), ext(C) and label(C) can change, and compro- mise the identity and traceability of C. To address this, the frame- work assumes that int(C) is the disjoint union of rigid and non-rigid example, the DBpedia concept Terrorism in the European Union only sets of properties, int(C) = intr (C) ∪ intnr (C)). intr (C) uniquely matches the search-term terrorism in europe in GT. In this workflow identifies C by some essential properties that do not change. This we align C and q(C) manually. Next, we normalize the GT scores allows the comparison of two variants of a concept at different by picking a comparatively popular and stable topic over time that points in time, even if intnr (C), ext(C) or label(C) change. sets the maximum score (e.g. iPhone).7 All subsequent trend scores If two variants of C at two different times have identical int(C), for other concepts are relative to this reference concept. We define ext(C) and label(C), then there is no concept drift. Otherwise, the the GT score for a concept C at time t as GT (C, t). Finally, we define framework defines intensional, extensional, and label similarity the two proxies of popularity and trendiness of a concept, p(C), t(C), functions simint 7→ [0, 1], simex t 7→ [0, 1], siml abel 7→ [0, 1] to as the arithmetic mean and standard deviation over the GT scores, quantify meaning similarity. Then, there is extensional (intensional, respectively: label) concept change between two variants of C, C ′ and C ′′ , iff q Í p(C) = n1 GT (C, t), t(C) = n1 (GT (C, t) − p(C))2 Í simex t (C ′, C ′′ ) , 1. Using the above definitions as its foundation, SemaDrift [20] 4 PRELIMINARY EVALUATION constitutes a cutting edge suite of metrics and tools for measuring In order to evaluate our framework, we propose a preliminary concept drift in different versions of an ontology, under an ontol- experiment to measure the relationship of Web user queries in the ogy evolution perspective. As demonstrated in [21], SemaDrift is intensional concept drift of DBpedia concepts between January of 2014 totally domain agnostic, offering the capability of applying the un- and April of 2016. By this we adapt Harris’ distributional hypothesis derlying metrics and methods to any ontology originating from to RDF statements, i.e. we assume that intensional concept drifts go any domain of application. The platform consists of (a) an API back to the social embedding of the detection environment, in other for programmatically accessing the core drift measuring meth- words, the feature composition of concepts is context-dependent. ods, (b) a Protégé plug-in [19], and, (c) a standalone desktop ap- To do so, we sample a small (N = 11) set of DBpedia concepts C plication. The full suite is available at http://mklab.iti.gr/project/ and their equivalent search-terms q(C) GT scores on that period. semadrift-measure-semantic-drift-ontologies. The chosen concepts, together with their GT scores over time, In this work we are deploying the core SemaDrift API, and we are are shown in Figure 1. We chose these concepts considering one particularly monitoring intensional drifts of DBpedia concepts; i.e. interest group, with both trendy and popular concepts (iPhone, each DBpedia entry is essentially a class instance with associated Donald Trump, Pokemon); and a control group, with concepts of properties, thus it makes no sense to measure drifts in its extension scarce trendiness and popularity (Mona Lisa, Colonization of Mars, (instances have no extension) or label (entries in DBpedia maintain Battle of Stalingrad). their labels unaltered). 3.3 Confronting Trends with Drift 4.1 Results We use SemaDrift to calculate the intensional concept drift values of Google Trends (GT) is a Web service that shows how often a par- int(C) for the chosen set of concepts of Figure 18 . Figure 2 confronts ticular search-term is entered relative to the total search-volume of these intensional concept drift values with their popularity/trendi- the Google Search engine. For example, it is possible to compare the ness p(C), t(C) scores derived from GT. relative volume of queries between the search terms Donald Trump In Figure 2 we can observe an expected distribution over the and climate change in a certain time period. These relative volumes x-axis of non-trendy vs. trendy concepts, to the left and the right, of search-terms are given with a measurement from 0 (no volume) respectively. However, the patterns of intensional concept drift to 100 (maximum volume). In order to obtain these, a matching with respect to variations in trends are not as expected. Quite needs to be made between the chosen concept of interest C and its corresponding search-term query, q(C), which is not trivial. For 7 We do this by using GT’s Most searched feature over matching time periods. 8 A detailed table with all drifting values and relevant predicates can be found at 6 https://github.com/LinkedDataFragments/Client.Java https://goo.gl/yQ531r. Drift-a-LOD2017, September 2017, Amsterdam, The Netherlands A. Meroño-Peñuela et al. News Transcription and Understanding Workshop. [3] Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The Semantic Web. Scientific American 284, 5 (2001), 34–43. [4] S. Darányi and P. Wittek. 2013. Demonstrating Conceptual Dynamics in an Evolving Text Collection. Journal of the American Society for Information Science and Technology 64, 12 (2013), 2564–2572. DOI:http://dx.doi.org/10.1002/asi.22940 [5] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu. 2005. Parameter free bursty events detection in text streams. In Proceedings of VLDB-05, 31st International Conference on Very Large Data Bases. Trondheim, Norway, 181–192. [6] E. Gabrilovich, S. Dumais, and E. Horvitz. 2004. Newsjunkie: providing personal- ized newsfeeds via analysis of information novelty. In Proceedings of WWW-04, 13th Int. Conf. on the World Wide Web. New York City, NY, USA, 482–490. [7] R. S. Gonçalves, B. Parsia, and U. Sattler. 2011. Analysing Multiple Versions of an Ontology: A Study of the NCI Thesaurus. In Proceedings of the 24th Int. Workshop on Description Logics (DL 2011), Vol. 745. CEUR Workshop Proceedings. [8] Z. Harris. 1970. Distributional structure. In Papers in structural and transforma- tional Linguistics, Z. Harris (Ed.). Humanities Press, NY, USA, 775–794. Figure 2: Trendiness vs. intensional concept drift. [9] Tom Heath and Christian Bizer. 2011. Linked Data: Evolving the Web into a Global Data Space (1st ed.). Morgan and Claypool. 1–136 pages. [10] Bryan Jurish. 2016. Diachronic Collocations and Genre: a case for DiaCollo?. In Diachronic Corpora, Genre, and Language Change, Richard Jason Whitt (Ed.). the contrary: the highest concept drift measurements correspond 22–24. http://kaskade.dwds.de/~jurish/pubs/jurish2016genre.pdf to concepts with the lowest popularity/trend scores. In particular, [11] J. Kleinberg. 2003. Bursty and hierarchical structure in streams. Data Mining and concepts like Mona Lisa, climate change and Battle of Stalingrad Knowledge Discovery 7, 4 (2003), 373–397. [12] J. Kleinberg. 2006. Temporal dynamics of on-line information streams. Data have very low t(C) scores (0.13, 0.33, 0.09) but very high concept Stream Management: Processing High-Speed Data Streams (2006). drift (1.56, 1.73, 1.74). Contrarily, concepts with the highest t(C) [13] P. De Leenheer and T. Mens. 2008. Ontology Evolution: State of the Art and scores, such as Donald Trump (8.99), Pokemon (2.29) and iPhone Future Directions. In Ontology Management for the Semantic Web, Semantic Web Services, and Business Applications. Springer. (6.75)), have increasing values of concept drift (1.53, 1.16, 1.36) but [14] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. never reach that of the non-trendy concepts. Less popular, but very Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. 2014. DBpedia - A Large- scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web – trendy concepts such as Donald Trump change their relevance when Interoperability, Usability, Applicability (2014). http://www.semantic-web-journal. observing p(C), but the tendency to score less concept drift prevails. net/system/files/swj558.pdf. These two unexpected patterns could be explained by the experts [15] Albert Meroño-Peñuela, Christophe Guéret, Ashkan Ashkpour, and Stefan Schlobach. 2015. CEDAR: The Dutch Historical Censuses as Linked Open Data. vs crowds hypothesis. Under this hypothesis, most significant edits Semantic Web – Interoperability, Usability, Applicability (2015). In press. in Wikipedia in a concept C (which derive in high drift scores) are [16] R. Papka. 1999. On-line new event detection, clustering, and tracking. Ph.D. poorly explained by querying trends over C, but much related to Dissertation. University of Massachusetts Amherst. [17] Catia Pesquita and Francisco M. Couto. 2012. Predicting the Extension of Biomed- a tiny amount of Wikipedia curators (the “experts”) taking care ical Ontologies. PLoS Computational Biology 8, 9 (2012), e1002630. of domain-expert content (i.e. Mona Lisa, Battle of Stalingrad). So, [18] Michael Petrilli. 2008. Wikipedia or Wickedpedia? http://educationnext.org/ wikipedia-or-wickedpedia/, Education Next 8, 2 (2008). experts would be responsible of concept drift in less trendy topics. [19] T. G. Stavropoulos, S. Andreadis, E. Kontopoulos, M. Riga, P. Mitzias, and I. Kom- However, the “crowds” seem to be able to influence concept drift patsiaris. 2017. The SemaDrift Protégé Plugin to Measure Semantic Drift in approximately linearly (Pokemon, iPhone, Donald Trump) beyond a Ontologies: Lessons Learned. In Knowledge Engineering and Knowledge Manage- ment (EKAW 2016), Vol. 10180. Springer, Cham, 29–39. certain trendiness threshold. This would explain high-quantity/low- [20] T. G. Stavropoulos, S. Andreadis, M. Riga, E. Kontopoulos, P. Mitzias, and I. quality edits in Wikipedia derived from controversy and popularity, Kompatsiaris. 2016. A Framework for Measuring Semantic Drift in Ontologies. and relate to the popularity required to score some increasing con- In Proceedings of SuCCESS-16, 1st Int. Workshop on Semantic Change & Evolving Semantics, co-located with the 12th European Conference on Semantics Systems cept drift by non-experts. Despite this, highest trend values do not (SEMANTiCS-16). seem to involve deep intensional changes in concepts, which only [21] T. G. Stavropoulos, E. Kontopoulos, A. Meroño Peñuela, S. Tachos, S. Andreadis, and I. Kompatsiaris. 2017. Cross-domain Semantic Drift Measurement in Ontolo- occur in expert curated, low-trendiness concepts. gies Using the SemaDrift Tool and Metrics. In 3rd Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW 2017). 5 CONCLUSION AND FUTURE WORK [22] Ljiljana Stojanovic. 2004. Methods and Tools for Ontology Evolution. Ph.D. Disser- tation. University of Karlsruhe. In this paper, we study the influence of trending Web queries over [23] Ljiljana Stojanovic and Boris Motik. 2002. Ontology Evolution within Ontology the fundamental properties of collaborative Web knowledge bases. Editors. In Evaluation of Ontology-based Tools Workshop, 13th Int. Conf. on Knowl- edge Engineering and Knowledge Management (EKAW 2002), Vol. 62. CEUR-WS. In the period of 2014 January-2016 April and a small sample of [24] R. Swan and J. Allan. 1999. Extracting significant time varying features from concepts with variable popularity, we find patterns that fit the text. In Proceedings of CIKM-99, 8th International Conference on Information and possible explanation of two conflicting trends (“experts vs. crowds”) Knowledge Management. Kansas City, MO, USA, 38–45. [25] R. Swan and D. Jensen. 2000. Timemines: Constructing timelines with statistical with competing influence on intensional concept drift. We plan models of word usage. In Proceedings of KDD-2000 Workshop on Text Mining. to add scalability to our framework in order to confirm the above Boston, MA, USA, 73–80. [26] Ilaria Tiddi. 2016. Explaining Data Patterns using Knowledge from the Web of Data. findings, and to investigate automatic mapping methods between Ph.D. Dissertation. Knowledge Media Institute, The Open University. concepts and their corresponding search-term queries. [27] R. Verborgh, M. van der Sande, O. Hartig, J. Van Herwegen, L. De Vocht, B. De Meester, G. Haesendonck, and P. Colpaert. 2016. Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for the Web. Journal of Web Semantics REFERENCES 37–38 (2016), 184–206. DOI:http://dx.doi.org/doi:10.1016/j.websem.2016.03.003 [1] Alexander Mäedche, Boris Motik, and Ljiljana Stojanovic. 2003. Managing multi- [28] S. Wang, S. Schlobach, and M. C. A. Klein. 2010. What Is Concept Drift and How ple and distributed ontologies in the Semantic Web. The VLDB Journal — The to Measure It?. In Knowledge Engineering and Management by the Masses - 17th International Journal on Very Large Data Bases 12, 4 (2003), 286–300. Int. Conf., EKAW 2010. Proceedings. LNCS 6317, Springer, 241–256. [2] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast