=Paper=
{{Paper
|id=Vol-2063/dal-paper4
|storemode=property
|title=Exploring Concept Representations for Concept Drift Detection
|pdfUrl=https://ceur-ws.org/Vol-2063/dal-paper4.pdf
|volume=Vol-2063
|authors=Oliver Becher,Laura Hollink,Desmond Elliott
|dblpUrl=https://dblp.org/rec/conf/i-semantics/BecherHE17
}}
==Exploring Concept Representations for Concept Drift Detection==
Exploring Concept Representations for Concept Drift Detection Oliver Becher Laura Hollink Desmond Elliott Centrum Wiskunde & Informatica Centrum Wiskunde & Informatica University of Edinburgh Amsterdam, The Netherlands Amsterdam, The Netherlands United Kingdom becher@cwi.nl hollink@cwi.nl d.elliott@ed.ac.uk ABSTRACT be measured from changes in how it appears in the context of a We present an approach to estimating concept drift in online news. document collection. This is different from other work on concept Our method is to construct temporal concept vectors from topic- change in KOSs in the sense that we ignore changes in the structure annotated news articles, and to correlate the distance between the of the KOS. temporal concept vectors with edits to the Wikipedia entries of This paper is an initial step towards understanding how the con- the concepts. We find improvements in the correlation when we text of a concept can be represented to effectively capture concept split the news articles based on the amount of articles mentioning change. Our representation is based on the co-occurrence between a concept, instead of calendar-based units of time. concepts that appear as annotations of documents in a diachronic collection: if two concepts co-occur if they are annotations of the KEYWORDS same document. Hence, a concept can be seen as a vector of co- occurrence counts with other concepts in the KOS. Concept change Concept drift, Vector representations, News, Wikipedia edits can then be measured by comparing vectors created for different time spans in the collection. We experiment with various versions of this basic idea, and apply it to detect change in an annotated document collection: the ION dataset of 300k online news articles, annotated with Wikipedia pages [4]. To evaluate our method, we use Wikipedia edit counts. This is based on the idea that a Wikipedia article is edited when a change 1 INTRODUCTION to the page was needed; hence, a higher number of edits may signify Concepts in Knowledge Organisation Systems (KOSs) are used a change in the underlying concept. Generally speaking, evaluation to provide structured annotations and background knowledge in of concept drift detection methods is hampered by a lack of large a wide variety of applications. They enhance interoperability be- scale evaluation datasets. Wikipedia edits are not to be seen as a tween datasets and enable structured access to annotated document gold standard of concept drift. While some edits might be due to collections. These benefits, however, are compromised when con- a change in the concept, others might be, for example, additions cept change (or drift) occurs. Wang et al. [8] define three types of of missing information or corrections of previous mistakes. Our concept drift: (1) change in the intension of the concept, defined assumption is that even though Wikipedia edit counts are a noisy as the definition or the properties of the concept; (2) change in the signal with respect to concept change, a correlation between our extension, or the instances, of a concept; and (3) change in the label change scores and the edits counts does say something about the of the concept. Each type of concept drift may lead to problems effectiveness of our method. for applications working with KOSs. For example, an annotation of a document may become invalid if the intension of the concept 2 REPRESENTING CONCEPTS changes. Correspondences between two concepts in different KOSs may become incorrect if the extension of one of them changes. 2.1 Creating Concept Vectors A user’s keyword query on a historic corpus may be interpreted Given a concept vocabulary C with N concepts, we create vector incorrectly if the (prevalent) label to refer to a concept has changed. representations of the concepts through their usage in a document Significant progress has been made in the detection of meaning collection. change of words (e.g., [3, 9]). They are based on distributional We assume there is a collection of time-ordered documents D. methods, where the meaning of a word is defined as the context in A document di is annotated with M topic annotations t 1 , . . . , t M , which it appears. A change in context over time may then signify a drawn from a total of T topics. Each document in the collection change in meaning. In this paper, we study change in the meaning of can be represented as a binary document topic vector, di ∈ R 1xT . concepts in a KOS. Drawing inspiration from work on word-change An element in the document topic vector takes a value of 1 if the detection, we aim to explore whether the change of a concept can document has been annotated with that topic. We also assume a function f: T → C that maps between the topic annotation and concept vocabulary. We construct a concept vector cj for each concept in our vocabu- © 2017 Copyright held by the author/owner(s). lary c 1 , . . . , c N from co-occurrence counts of the topic annotations in documents in the document collection. The set of concept vec- SEMANTiCS 2017 workshop proceedings: Drift-a-LOD tors forms a sparse matrix C ∈ R N ×N , where each row defines a September 11-14, 2017, Amsterdam, Netherlands concept through co-occurrence with other concepts. Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Oliver Becher, Laura Hollink, and Desmond Elliott Our concept vectors are co-occurrence counts. We reduce the We define our concept vocabulary C as a subset of TextRazor’s effect of frequently occurring concepts by re-weighting the vec- topic vocabulary T: we retain only topics that are associated with at tors using a TF-IDF-like weighting scheme, so that tf-idf(c i , c j ) = least 2 articles. In preliminary experiments, we found that concepts t f (c i , c j ) ∗ id f (c i ), where t f (c i , c j ) is the number of times that con- that are associated with too few articles have sparse representations cept c i co-occurs with concept c j and id f (c i ) = loд d f (C,c N + 1, resulting in unrealistic change scores between the representations. i) with d f (C, c i ) as the count of c i concept annotations in the entire This leaves us with N=70,000 concepts. The mapping function f: T → concept vocabulary C. C is trivial in this case. However, the structured nature of Wikipedia, and the links that it provides to other concept vocabularies, provide 2.2 Temporal Concept Vectors starting points for other mapping functions, allowing us to explore other concept vocabularies in the future. Recall that we are interested in measuring the change in the mean- We construct concept vectors using the method outlined in Sec- ing of a concept over time. We redefine C to include a temporal tion 2. The vocabulary of the concept vectors is defined over the dimension, V ∈ R N x N x K , where the third dimension represents Wikipedia entries, therefore it is trivial to map the topic annotations K units of time, and k Vk = V ∈ R N x N . There are many ways Í to the concept vectors. to define K: the document collection can be split into days, weeks, months, or any other valid approach to splitting the collection ac- cording to the sequential ordering of the documents. Note that the 3.2 Visualization co-occurrence statistics over topic annotations needs to be calcu- To visualize the change that a concept c has undergone, we create lated such that only documents timestamped between consecutive a stream graph [1] of the temporal concept vectors of c. Figure 1, units of time are used in the calculation, i.e. t=s 1 and t=s 2 are used for example, plots the temporal vectors of the Wikipedia concept to define a temporal concept vector vj,s2 at t=s 2 . Police. Each ‘stream’ represents a concept that co-occurs with Police in the document collection. The thickness of the line represents 2.3 Temporal Vector Distance the co-occurrence count at a certain time period. Since stream We measure the change in the meaning of concepts by comparing graphs are suited to convey changes over time of only a limited the vectors in the temporal concept matrix between subsequent number of concepts, we select only those that occur most frequently. units of time. Specifically, we measure the change in a concept cj Specificaly, we create ’streams’ for only those concepts that are between time k and k − 1 using a similarity metric sim(·, ·): among the top 5 most frequently co-occurring concepts in any of the temporal concept vectors of concept c. distance(vj , s, s-1) = sim(vj,s , vj,s−1 ) (1) In figures 1, 2, and 3 we plot two concepts for which the average We experiment with two similarity metrics: cosine similarity, pre- change is low (measured as a high average cosine similarity between viously used to detect concept drift [7], and KL-divergence (when 12 temporal vectors) and one where the average change score is the vectors represent distributions). high. Figure 1 shows that Police is a stable concept: the top five most frequently occurring concepts remain frequent troughout the 3 APPLICATION TO AN ANNOTATED NEWS year, and the volume of documents in which they co-occur hardly COLLECTION fluctuates. However, a concept might change on a larger time scale than given in the data. Nonetheless, Police seems to be more stable 3.1 Dataset and Model Application than other concepts in the time span. We explore our method for constructing concept representations The concept Labour_Party (Figure 2) is stable as well: although and measuring concept change with a dataset of online news ar- there is a burst in the volume of documents about this concept, there ticles [4]. This data set contains news articles together with topic is hardly a change in which concepts co-occur in these documents. annotations and images in their natural textual context. The rich- In other words, there is change in how much reporting there is ness of information and meta data in this dataset can give many about the Labour_Party, but not in how they are reported. ways to define and explore concepts, while a defined structure of Figure 3 shows the streamgraph of the New York University. We the data helps to use it reliably and consistently. can see that the most co-occurring topics are constantly chang- The dataset contains articles published online between August ing in the streamgraph, both in periods with a high volume of 2014 – August 2015. In total, it includes more than 300K articles from documents and in periods with a low volume of documents. This five publishers across British and US English sources: Daily Mail, suggests changes in how much and how New York University has The Independent, New York Times, Huffington Post, and the Wash- been reported in the news. ington Post. The articles are annotated with topics using TextRazor1 . TextRazor uses Wikipedia as a topic vocabulary. This vocabulary 4 TOWARDS A QUANTITATIVE EVALUATION ranges from narrowly defined concepts, e.g., The United States Women’s Soccer Team or Electromagnetism, to broader concepts, 4.1 Measuring Concept Change e.g., Sport or Science. The average number of topic annotations per Concept change detection is hard to evaluate for a lack of gold article is 25 broad ’Category pages’ and 5 specific (non-category) standard datasets [5]. Kenter et al. [6] use a small sets of 21 human- pages, giving in total 122,000 distinct topic annotations on all arti- judged change scores. Frermann and Lapata [2] indirectly evaluate cles. change detection by using it in an application for which a gold 1 http://www.textrazor.com standard exists, namely the SemEval task for dating text. To the best Exploring Concept Representations for Concept Drift Detection Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Figure 1: WP:Police streamgraph shows a stable set of top-5 concepts in its temporal vectors. (See Section 3.2 for more details.) Figure 4: Scatter plot of average cosine simi- larities and annotation count of all concepts We perform an experiment on 964 concepts. Since it seems likely that the number of articles that a concept is related to plays a role, we draw a stratified random sample from our concept vocabulary to include both frequently and infrequently used concepts. We select three different strati of even size. Group 1 contains concepts which are related to more than 500 articles. Group 2 contains concepts Figure 2: WP:Labour_Party_(UK) streamgraph has a sta- which are related to at least 200 articles but not more than 500. ble set of top-5 concepts but a lot of activity centered Group 3 contains concepts with at least 24 articles but less than 200. around a specific time. The sample includes only concepts that map to ‘regular’ Wikipedia pages and not Category pages. Figure 4 plots the number of articles that a concept is related to against the average cosine similarity between the temporal vectors of that concept. This shows that the more frequent a concept is used as an annotation, the higher the average cosine similarity, i.e., the lower the change. This is analogous to the change of meaning of words [3], where the semantic changes of words scale with inverse frequency, known as the law of conformity. We compare four models, each with different settings regarding the way that time units are set, the use of TF-IDF, and the choice of similarity measure (either cosine similarity or KL-divergence). Figure 3: WP:New_York_University streamgraph under- goes constantly shifting concept representation in our 4.2 Models dataset. 4.2.1 Fixed Time Bins (Cosine). Starting with the most basic setup of our method, we calculate temporal concept vectors for time frames (or bins) of a fixed duration. With n time frames, each frame of our knowledge, large scale datasets to directly valuate change covers an n/year th of the dataset. For example, with 52 frames, detection, do not exist. each frame covers exactly one week. We use the cosine similarity For our application, we explore the use of Wikipedia edit rates to to calculate change scores between each temporal concept vector. evaluate our method of concept change representation. We believe that the act of editing a Wikipedia page can signal a change in the 4.2.2 Flexible Time Bins (Cosine). In this model, we calculate information that is relevant to that entry. temporal concept vectors for time periods that each cover a fixed Specifically, given a concept c and a pre-defined K units of time, amount of articles. Thus, time frames differ in length of days rather we measure changes scores as the consecutive temporal vector dis- than amount of data. The amount of articles per bin depends on the tances for concept c (Section 2.3); then, we count the number of total amount of articles available per concept. Analogous to Fixed Wikipedia edits to the aligned article during each of the K units of Time Bins, we create n bins, therefore we assign a nth of the total time. We evaluate our method by measuring the Spearman correla- amount of articles to each bin. However, a concept may have such tion between the change scores (i.e. the temporal vector distances) an amount of articles that does not split evenly into n bins. Thus, it and the Wikipedia edit counts. The higher the correlation, the more may be split into more than n bins. With these vectors, we use the accurately the temporal concept vectors can estimate the rate of cosine similarity to calculate change scores. We use the same time change of the Wikipedia entries. frames to bin the Wikipedia edits and estimate a correlation. Drift-a-LOD’17, September 2017, Amsterdam, Netherlands Oliver Becher, Laura Hollink, and Desmond Elliott Run / n bins > 100 > 52 > 24 > 12 > 6 Fixed Time Bins (Cos) 0.07 0.18 0.22 0.36 0.26 Flexible Time Bins (Cos) -0.2 -0.19 -0.14 0.03 0.33 - TF-IDF (Cos) -0.2 -0.2 -0.13 0.0 0.26 Flexible Time Bins (KL) 0.23 0.25 0.29 0.19 -0.3 Table 1: Average Spearman correlation between concept sim- ilarity scores and Wikipedia edits. Negative correlations are good for cosine similarity; positive correlations are good for KL-divergence. 4.2.3 Flexible Time Bins (No TF-IDF, Cosine). Exactly the same as Flexible Time Bins (Cosine) except we do not re-weight the (a) n=12 (b) n=100 temporal concept vectors using TF-IDF. 4.2.4 Flexible Time Bins (KL-divergence). This model is identical Figure 5: Distribution of Spearman correlations for Flexi- to Flexible Time Bins except we measure the distance between ble Time Bins (KL) with n=12 (left) or n=100 (right) time temporal concept vectors using Kullback-Leibner divergence (KL) bins. The ratio of positive/negative correlations is much im- instead of cosine similarity. proved by having more time bins. 4.3 Results explored to what extend concept change can be evaluated by corre- We collect Spearman correlation coefficients for 964 concepts using lating the distance between its temporal concept vectors and edits different numbers of time frames (6, 12, 24, 52, and 100). Table 1 to the Wikipedia article corresponding to the concept. shows the average correlation over concepts that are significantly We found that a flexible approach to defining a window of time correlated with Wikipedia edits. Note that the experiments with Co- was more successful than using calendar-based windows of time. sine similarity measure between temporal concepts should return a We also found that having more windows of time resulted in better negative correlation, while the experiments with the KL-divergence correlations between the temporal vector distances and Wikipedia distance should return a positive correlation. The results in Table 1 article edits. show that the performance of the models decreases as we decrease Future work includes an analysis of which types of concepts the number of time bins. correlate to Wikipedia edits counts, to get more insights into the The Fixed Time Bins (Cosine) model only returns positive cor- use of Wikipedia as an evaluation tool. Similarly, we could look into relations, indicating that fixed units of time (in this case, splitting the types of edits made on Wikipedia to distinguish actual change the articles into months) does not act as a reliable proxy for con- from simple growth of an article. cept change in our dataset. The Flexible Time Bin experiments (Cosine) and (-TF-IDF) are better correlated with Wikipedia edits REFERENCES than the Fixed Time Bin model. We do not find a difference in not [1] Lee Byron and Martin Wattenberg. 2008. Stacked Graphs - Geometry and Aes- thetics. IEEE transactions on visualization and computer graphics 14 6 (2008), re-weighting the concept vectors using TF-IDF. Finally, we find 1245–52. a small improvement from using KL-divergence as the temporal [2] Lea Frermann and Mirella Lapata. 2016. A Bayesian Model of Diachronic Meaning vector distance metric instead of Cosine similarity. Throughout, we Change. Transactions of the ACL 4 (2016), 31–45. [3] William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic can see that the number of temporal bins n is a crucial parameter word embeddings reveal statistical laws of semantic change. arXiv preprint in our experiment. arXiv:1605.09096 (2016). [4] Laura Hollink, Adriatik Bedjeti, Martin van Harmelen, and Desmond Elliott. 2016. We performed a follow-up analysis of the effect of the number A Corpus of Images and Text in Online News. (2016). of temporal bins. The histograms in Figures 5a to 5b show the [5] Laura Hollink, Sándor Darányi, Albert Meroño Peñuela, and Efstratios Kontopou- distributions of the Spearman correlations for the Flexible Time Bins los. 2017. First Workshop on Detection, Representation and Management of Concept Drift in Linked Open Data: Report of the Drift-a-LOD2016 Workshop: (KL) model with n=12 or n=100. We find that the ratio of positively Front Matter.. In Knowledge Engineering and Knowledge Management. EKAW 2016 correlations to negative correlations is substantially reduced by (Lecture Notes in Computer Science), Vol. 10180. 15–18. having more time bins. More time bins clearly improves the quality [6] Tom Kenter, Melvin Wevers, Pim Huijnen, and Maarten de Rijke. 2015. Ad hoc monitoring of vocabulary shifts over time. In Proceedings of the 24th International of the concept vectors. Conference on Information and Knowledge Management. 1191–1200. [7] Astrid van Aggelen, Laura Hollink, and Jacco van Ossenbruggen. 2016. Combining distributional semantics and structured data to study lexical change. In European 5 CONCLUSION AND FUTURE WORK Knowledge Acquisition Workshop. Springer, 40–49. We explored concept change using vector space concept repre- [8] Shenghui Wang, Stefan Schlobach, and Michel Klein. 2011. Concept drift and how to identify it. Web Semantics: Science, Services and Agents on the World Wide Web sentations. The concept vectors were constructed from topic co- 9, 3 (2011), 247 – 265. https://doi.org/10.1016/j.websem.2011.05.003 Semantic occurrence in a large collection of online news articles. We in- Web Dynamics Semantic Web Challenge, 2010. [9] Yating Zhang, Adam Jatowt, and Katsumi Tanaka. 2016. Towards understand- troduced a temporal aspect to the vectors by requiring the co- ing word embeddings: Automatically explaining similarity of terms. 2016 IEEE occurrences to happen within pre-defined windows of time. We International Conference on Big Data (Big Data) (2016), 823–832.