Tracing Shifting Conceptual Vocabularies Through Time Gabriel Recchia*, Ewan Jones*, Paul Nulty*, John Regan*, and Peter de Bolla* * The Concept Lab, CRASSH, University of Cambridge, Cambridge, United Kingdom {glr29,ejj25,pgn26,jjr35,pld20}@cam.ac.uk Abstract. This paper presents work in progress on an algorithm to track and identify changes in the vocabulary used to describe particular concepts over time, with emphasis on treating concepts as distinct from changes in word meaning. We apply the algorithm to word vectors generated from Google Books n-grams from 1800-1990 and evaluate the induced networks with respect to their flexibility (robustness to changes in vocabulary) and stability (they should not leap from topic to topic). Finally, we describe work in progress using the British National Biography Linked Open Data Serials to construct a “ground truth” evaluation dataset for algorithms which aim to detect shifts in the vocab- ulary used to describe concepts. Keywords: concepts · word embeddings · Linked Open Data 1 Introduction Some influential theories of conceptual structure, such as the so-called name priority view [1] and some interpretations of the classical theory of concepts [2], treat con- cepts 1 as essentially in one-to-one correspondence to word senses [3,4,5]. On this view, one word might have several different senses and thereby correspond to several different concepts, but it is nonetheless possible to identify concepts via a careful examination of word meanings. Some modern philosophers and psychologists have made convincing arguments that this view is overly simplistic or flat-out wrong [1,6]. Even if one does believe in a direct correspondence between word senses and con- cepts, however, it is clear that a change in word sense does not necessarily entail a change in the concept that was originally associated with it. For example, the word 1 Rather than thinking of concepts in a way that strongly links them to a particular lexeme (e.g., “the concept of justice”), we have argued elsewhere that it is preferable to think of concepts (at least insofar as they are expressed in discourse) in terms of their functions, one of which is to permit two interlocutors to sense that they have arrived at a common understanding of the matter under discussion. This is rather different and more abstract than the notion of a concept as being equivalent to a class in a classical ontology, and more specific than a theme or topic. However, for purposes of clarity and compatibility with the way related work speaks about “concepts,” our use of the word in this paper roughly conforms to the vague OED definition of “a general idea or notion.” We are explicitly not using it to refer to “the meaning that is realized by a word or expression.” adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 broadcast started to change from having the meaning of “scattering [seed] abroad over the whole surface, instead of being sown in drills or rows” to being associated with the transmission of radio or television signals in the 1920s [7,8]. However, the fact that the primary sense of broadcast changed did not mean that the concept of sowing seeds over a wide area went away. Similarly, it seems clear that a culture could possess a particular concept even if no corresponding word or collocation exists in the primary language spoken by members of that culture. The distinction between word senses and concepts is an important one to draw be- cause, as pointed out by Wevers et al. [9], some computational approaches described as methods for detecting changes in “concepts” are often actually methods for detect- ing changes in the use of a single word or an unchanging group of words over time. Because word senses change over time, a change in the frequency or lexical associa- tions of a particular word does not necessarily entail a change in the concept of inter- est. Being able to track concepts over time in a way that is robust to shifting vocabu- laries is therefore essential. Methods for detecting conceptual change in time-varying textual sources are particu- larly relevant in the context of Linked Open Data (LOD). To assist in the maintenance of LOD ontologies, knowledge engineers may wish to use time-varying text corpora, such as academic journals or news sources, to monitor conceptual change over time. Consider someone who maintains an ontology intended to represent relationships between various concepts in the neuroscience literature, who notices that this year there has been a marked uptick in the frequency of particular words that previously occurred only rarely. Does this merit the addition of a new class to the ontology? Or is this simply novel language for describing an old idea? Ultimately, this must come down to human judgment, but automatic methods for assisting with the decision could highlight important related classes already represented in the knowledgebase. This paper presents work in progress toward an algorithm to track vocabulary associ- ated with particular concepts over time in a flexible and stable way. In the next sec- tion, we describe related work, particularly a promising model recently developed by [10]. In Section 3 we implement a model which avoids one of the weaknesses of pre- vious work while retaining the most important benefits. Finally, we describe work in progress using the British National Biography Linked Open Data Serials to construct a “ground truth” evaluation dataset for algorithms of this sort. 2 Related Work To address the problem of word sense change described in the Introduction, the Con- cepts Through Time model advocates an alternative approach to tracking concepts, using a set of Dutch newspapers from 1890-1990 as a corpus [9]. Rather than select- ing a static set of terms and monitoring its frequency over the entire century, they select an initial term or terms of interest and find a cluster of words that are highly similar, according to a word embedding model trained on a specific timeslice (e.g., articles from the years 1890-1900). The cluster is updated from timeslice to subse- quent timeslice in a manner which acknowledges that “the set of words used to dis- cuss a particular concept might not show any overlap at all between different periods of time” [9]. However, treating time-shifted collections of words with no overlap whatsoever as the ‘same’ has its own drawbacks. As [11] points out, “Imagine a sub- set of documents containing strong co-occurrence patterns across time: first between birds and aerodynamics, then aerodynamics and heat, then heat and quantum mechan- ics—this could lead to a single topic that follows this trajectory, and lead the user to inappropriately conclude that birds and quantum mechanics are time-shifted versions of the same topic.” Perhaps for this reason, subsequent work by the developers of Concepts Through Time notes that “a successful system… should strike a balance between an adaptive strategy that responds to changes in vocabulary, and a more con- servative approach that keeps the vocabulary stable” [10]. Their revised model re- quires a user to select an initial set of seed terms and an algorithm to construct vocab- ularies of related terms for each timeslice: adaptive, nonadaptive, or hybrid. The adaptive method is most relevant to the present work. This method starts with an input vocabulary (initially the user’s set of seed terms), expands that by adding words ex- ceeding some minimum similarity threshold to the set, constructs a network from this set such that all pairs of nodes (words) exceeding the threshold are assigned an edge, and then prunes nodes that are low in degree centrality. The resulting words are used as the input vocabulary for the next timeslice, and the process repeats until the final timeslice. Topic models have been another popular approach for monitoring groups of related words over time [11,12,13,14]. However, these often either do not explicitly model changes in vocabulary within a particular topic/concept, or do not pay explicit atten- tion whether the method allows topics to drift far afield from their original conceptual content. One contribution of the present work is that it does both, while resolving an important difficulty with the most similar approach we are aware of. Finally, much work has been done on automatically tracing changes in a given word’s meaning over time, e.g. [8,15,16]. Although this clearly differs from our aim of trac- ing changes in the vocabulary used to describe particular concepts, these methods are extremely useful for our purposes. For example, a word may need to be excluded from a core of tightly interrelated terms if its meaning drifts too far afield from the rest. We therefore here make extensive use of the HistWords vectors [8] developed by applying skip-grams with negative sampling (one of the algorithms available in word2vec) to n-grams distributed by Google Books. HistWords contains a separate vector for each of a very large number of terms for every decade from 1800 to 1990, such that words that appear in similar contexts within a given timeslice have similar vectors. Such vectors successfully capture shifts in word meaning over time [8], and we use the same approach and data to quantify semantic similarity. We describe how we use these vectors in more detail in the following section. 3 Time-Varying Relationships in Text Recall that the adaptive method of [10] involves an expansion step in which words related to any word in the input vocabulary are added to the network as nodes, and a pruning step in which nodes low in network centrality (in-degree or out-degree) are pruned. Although this is an excellent way to pull in novel vocabulary while also pre- venting the overall network from drifting too far afield, it has one unintended conse- quence. When the input vocabulary contains a word linked to two densely connected but unrelated clusters (e.g., a polysemous word), unrelated clusters of words will be added during the expansion step (Figure 1). Because nodes in each cluster have high degree, they will not be eliminated in the pruning step. The consequence is that unre- lated, weakly connected clusters can persist as part of the same “concept.” The exam- ple in Figure 1 makes this particularly clear by illustrating two clusters so unrelated that they would become disconnected if the node connecting them were pruned, but it is important to recognize that this phenomenon remains a problem even if a constraint were imposed requiring the graph to be fully connected. The other two methods de- scribed by [10] (nonadaptive and hybrid) suffer from the same difficulty. Our method addresses this by allowing two nodes to be treated as part of the same “conceptual network” only if all words in the network are highly related to all other words in the network. Because we first describe how this method can be used to track concepts in diachronic text corpora, we use “nodes” and “words” interchangeably, and “relatedness,” “similarity,” and “edge weight” as synonymous with “cosine simi- larity” (e.g., similarity between word vectors in the HistWords data). Like [10], we treat documents from every timeslice (in our case, decades from 1800-1990) as a separate subcorpus and build a separate vector space corresponding to each. Fig. 1. A graph representing two unrelated clusters of words connected by a single polysemous word. Even if the center node is eliminated in the pruning step described in [10], the nodes in each 8-clique may not be, due to their high degree. Both clusters thus erroneously continue to be interpreted as part of the same “concept.” 3.1 Algorithm Given a size k and a seed set of words W, the algorithm begins by finding the fully connected graph of size k containing all words in W such that the minimum edge weight (in the earliest timeslice) is as high as possible. This can be done efficiently by attempting to find a subgraph of size k containing all words in W such that every edge exceeds a very high threshold 2, but then gradually lowering the threshold until an appropriate subgraph is found. Afterwards, the vectors for the second timeslice are loaded, and the subgraph is updated by attempting to answer the question, “Is it pos- sible to increase the minimum edge weight by replacing one of these nodes with a node currently not in the subgraph? If so, which of all possible replacements would increase the minimum edge weight the most? 3” Because typically only one edge is equal to the current minimum edge weight, this can also be computed efficiently. This corresponds to a “drop one, add one” rule where, for any given timeslice, a single word from the network of the previous timeslice will be replaced if and only if doing so increases the minimum similarity between every word pair in the resulting net- work. The process repeats for every subsequent timeslice. Table 1 illustrates an ex- ample of an evolving network built using this method. Our primary concerns were that conceptual networks be traced in such a way that is flexible (words whose meanings shift away from the conceptual core should drop out) but also stable (a network initially about birds should not drift to quantum mechan- ics). We tested the model by initializing it with 500 words randomly selected from the 30,000 most frequent terms in HistWords, which were used as the lexicon. Of these, there were 212 words such that a fully connected network of size 9 existed with a minimum edge weight of 0.2 or greater could be constructed. 1900 affable,cheerful,courteous,gay,genial,humored,natured,sprightly,witty 1910 affable,cheerful,courteous,gay,genial,humored,natured,humoured,witty 1920 affable,cheerful,courteous,gay,genial,jovial,natured,humoured,witty 1930 affable,cheerful,courteous,gay,mannered,jovial,natured,humoured,witty 1940 affable,cheerful,courteous,amiable,mannered,jovial,natured,humoured,witty 1950 affable,cheerful,courteous,amiable,mannered,vivacious,natured,humoured,witty 1960 affable,cheerful,courteous,amiable,mannered,charming,natured,humoured,witty 1970 affable,cheerful,courteous,amiable,mannered,charming,natured,gentle,witty 1980 affable,cheerful,courteous,amiable,humored,charming,natured,gentle,witty 1990 affable,cheerful,courteous,amiable,clever,charming,natured,gentle,witty Table 1. Evolution over time of the network constructed from the seed word “gay.” A potential criticism of this model is that while it purports to be ‘flexible’ in the sense that it traces a group of conceptually related words (rather than merely words associ- 2 Because the threshold is initially set so high that no such subgraph can be found, this method ensures that the first subgraph discovered which meets these criteria is the one desired. 3 Note that every node in the subgraph must correspond to a unique word. ated with the seed term alone), the fact that it is initialized with words closely related to the seed term may mean that in practice the seed term always ends up as a perma- nent part of the network. Table 1 illustrates that in at least one case of radical seman- tic change (the word “gay”), the seed term does successfully drop out by the 1940s. However, it is possible that this virtually never happens. Flexibility was therefore evaluated by quantifying the proportion of the 212 initial networks (year 1800) in which the seed term did drop out by 1990. Another potential criticism is the reverse: Because every timeslice offers an opportunity to jettison one term and incorporate a new one, networks might drift to completely different topics. For example, if each iteration caused a random word to be replaced with a word not previously in the net- work, then after 19 timesteps a typical graph of size 9 would be expected to retain only (8/9)19 = 10.7% of its initial vocabulary. We therefore also computed the propor- tion of the vocabulary shared in the 1800 vs. 1990 clusters, with qualitative analysis of the clusters with the least shared vocabulary, to evaluate whether they exhibited less drift than this random baseline. 3.2 Results With respect to flexibility, the seed word used to generate the initial size-9 network in 1800 was no longer present in the 1990 network in 147 of 212 cases (69%). In 91% of these cases, the seed word never re-entered the network once it had dropped out, sug- gesting that the seed word can indeed be permanently ejected from the conceptual core if its meaning or associations drift in a different direction. With respect to stabil- ity, the average overlap in vocabulary between the initial 1800s network and the final 1990s network was 33%, with all 212 cases sharing at least one word (11%) in com- mon with the original 9-word network. Even when only one word was shared, the network typically did not drift too far afield, as in the case of the seed word “uneasy” (1800: anxieties, dejected, dejection, distraction, fits, insupportable, languishing, uneasy, weariness; 1990: anxieties, grief, despair, disappointment, misery, sorrow, anguish, sadness, loneliness). The full set of networks generated in this evaluation may be obtained from http://nowin2d.com/vocabularies.html. 3.3 Discussion The results suggest that even with such a rigid algorithm, the induced conceptual vocabularies are certainly flexible and reasonably resistant to drift. However, there were occasional clear cases in which vocabulary drifted significantly away from the original conceptual core, as in the case of the network generated from the seed word “logical” (1800: abstruse, definitions, disquisition, disquisitions, explanations, expli- cation, grammatical, illustrating, logical), which drifted towards different areas of academic study by the 1990s (abstruse, mathematical, philosophy, theory, metaphys- ics, metaphysical, empirical, theoretical, philosophical). One obvious direction for future work is the optimization of initialization parameters (network size, initial edge weight threshold), which were chosen arbitrarily. However, the fact that even arbitrar- ily chosen initialization parameters resulted in reasonably flexible and stable networks is promising. Other directions could include making use of the information about how a conceptual vocabulary has changed in the past to predict how it will change in the future, using techniques that parallel those applied in predicting ontology evolution, e.g. [17]. In addition, future work should consider a stronger evaluation metric. The final section describes work in progress towards constructing a “ground truth” evalua- tion dataset that could be used to do just that. 4 Constructing ground truth evaluation data from LOD To more fully evaluate an algorithm’s ability to track vocabulary change associated with arbitrary concepts, a “ground truth” dataset is necessary. The only such data of which we are aware are offered by [10]. However, this is limited to 21 concepts span- ning only four decades and is in Dutch, making it incompatible with most large, dia- chronic corpora. However, something very near to a much larger, English-language ground truth dataset already exists as LOD, in the form of the British National Bibli- ography Linked Open Data (BNBLOD) collections. Although we see ‘concepts’ as nonidentical to subjects as defined by the BNB, it is nonetheless likely that there is a high level of conceptual relatedness between all documents to which the British Li- brary has assigned the subject http://bnb.data.bl.uk/id/concept/lcsh/Engineering, even if the vocabulary of such documents differs markedly from year to year. Particularly useful are the BNBLOD Serials, which in addition to including the year of each jour- nal’s first publication, very commonly contain “Journal of X” in the title, where “X” corresponds to a short phrase describing a particular subject. The vocabulary of such ‘title phrases’ is often tied to a particular moment in time. Consider, for example, the phrases so extracted from the earliest “journal of X” journals in the BNBLOD Serials assigned the subject of Psychiatry (1876: ‘nervous and mental disease’), Engineering (1921: ‘applied mathematics and mechanics’), Entrepreneurship (1985: ‘business venturing’), and Tourism (1972: ‘travel research’). These phrases are no longer com- monly used to describe these subjects. As a first step in constructing an evaluation dataset, therefore, we are first simply extracting title phrases, publication dates, and subjects from the BNBLOD Serials and structuring them as follows: Given a start year y1, an end year y2, and a phrase extracted from the title of a serial having subject S first published in y1, the algorithm being evaluated must predict which words and phrases are most likely to appear in titles of other journals of subject S which were published in y2. A robust algorithm trained on an appropriate could ideally correctly identify that the cluster of words that contains, e.g., “business venturing” in 1985 ought to include “entrepreneurship” by 1995 (rather than, say, “business organiza- tion”), and that “travel research” in 1972 is closer to “tourism” in 1992 than to “edu- cational travel”. It should be noted that this is just a first step, and we hope to include other methods of evaluation with time. It is our hope that such a dataset will allow us not only to better evaluate our own research but move the field of representing dia- chronic conceptual change forward as a whole. 5 Acknowledgments This work was supported by a private donation to the Cambridge Centre for Digital Knowledge (CCDK) at the University of Cambridge. 6 References 1. Seiler, T.B., Wannenmacher, W. (eds.): Concept Development and the Development of Word Meaning (Vol. 12). Springer Science & Business Media, Berlin (2012) 2. Margolis, E., Laurence, S.: Concepts. In: E. N. Zalta (ed.), The Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/archives/spr2014/entries/concepts/ (2014) 3. Fodor, J.A.: The Language of Thought. Crowell, New York (1975) 4. Clark, E.V.: Meaning and Concepts. In: P. H. Mussen (ed.), Handbook of Child Psychology, vol. 3: Cognitive Development, pp. 787–840. Wiley, New York (1983) 5. Murphy, G.: The Big Book of Concepts. MIT Press, Cambridge (2002) 6. Glanzberg, M.: Meaning, Concepts, and the Lexicon. Croatian Journal of Philosophy 11(1), 1-29 (2011) 7. OED Online.: "Broadcast". Oxford University Press, http://www.oed.com 8. Hamilton, W. L., Leskovec, J., Jurafsky, D.: Diachronic Word Embeddings Reveal Statis- tical Laws of Semantic Change. arXiv preprint arXiv:1605.09096 (2016) 9. Wevers, M., Kenter, T., Huijnen, P.: Concepts Through Time: Tracing Concepts in Dutch Newspaper Discourse (1890-1990) Using Word Embeddings. In: Digital Humanities 2015, Sydney (2015) 10. Kenter, T., Wevers, M., Huijnen, P.: Ad Hoc Monitoring Of Vocabulary Shifts Over Time. In: Proc. 24th ACM International on Conference on Information and Knowledge Man- agement (pp. 1191-1200). ACM, New York (2015) 11. Wang, X., McCallum, A.: Topics Over Time: A Non-Markov Continuous-Time Model Of Topical Trends. In: Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 424-433). ACM, New York (2006) 12. Blei, D.M., Lafferty, J.D.: Dynamic Topic Models. In: Proc. 23rd International Conference on Machine Learning (pp. 113-120). (2006) 13. Hall, D., Jurafsky, D., Manning, C.D.: Studying The History of Ideas Using Topic Models. In: Proc. Conference on Empirical Methods on Natural Language Processing (EMNLP) (pp. 363-371). Association for Computational Linguistics, East Stroudsburg, Pennsylvania. (2008) 14. Sigrist, R., Rawat, V.: Topic Evolution In A Stream Of Documents. In: Proc. SIAM Inter- national Conference on Data Mining. SIAM, Philadelphia (2009). 15. Gulordava, K., Baroni, M.: A Distributional Similarity Approach To The Detection of Se- mantic Change in the Google Books N-gram Corpus. In: Proc. of the EMNLP 2011 Geo- metrical Models for Natural Language Semantics (GEMS) Workshop. Association for Computational Linguistics, East Stroudsburg, Pennsylvania. (2011) 16. Wijaya, D.T., Yeniterzi, R. Understanding Semantic Change Of Words Over Centuries. In: Proc. DETECT (International Workshop on DETecting and Exploiting Cultural diversiTy on the social web) (pp. 35-40). ACM, New York (2011) 17. C. Pesquita, F. M. Couto.: Predicting the extension of biomedical ontologies. In: PLoS Computational Biology, 8(9):e1002630. (2012)