Cross-lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models Andrey Kutuzov University of Oslo Postboks 1080 Blindern 0316, Oslo, Norway andreku@ifi.uio.no Elizaveta Kuzmenko National Research University Higher School of Economics Moscow, Russia eakuzmenko_2@edu.hse.ru new textual data arriving daily and draw conclusions about events based on changes in word vectors induced Abstract by new contexts. In other words, subtle semantic shifts which the words undergo over time, influenced by real- This paper presents an approach to detect world events, are detected by the presented method. real-world events as manifested in news texts. Detecting semantic shifts can be of use in a variety We use vector space models, particularly of linguistic applications. First, this method can be of neural embeddings (prediction-based distribu- help in the problem of automatically monitoring events tional models). The models are trained on through the stream of texts [AGK01]. Detected se- a large ‘reference’ corpus and then succes- mantic shifts can potentially be used as additional fea- sively updated with new textual data from tures in the algorithms aimed at extracting the course daily news. For given words or multi-word of events. Without unsupervised approaches, it is im- entities, calculating difference between their possible to process all the continuously generated data. vector representations in two or more models This is the primary motivation factor for our research. allows to find out association shifts that hap- Second, the developed approach can be used to study pen to these words over time. The hypothesis language shift and compare temporal corpora slices. is tested on country names, using news cor- This language area is traditionally studied by linguists, pora for English and Russian language. We who put a lot of efforts into describing semantic shifts show that this approach successfully extracts with the help of dictionaries, corpora and sociolinguis- meaningful temporal trends for named entities tic research. At the same time, it is impossible to regardless of a language. grasp all the language vocabulary and describe every lexical shift manually. Distributional semantic models 1 Introduction facilitate this task. The approaches to events detection and modeling We propose an approach to track changes happen- of language shifts have a lot in common. First tech- ing to real-world entities (in our case, countries) with niques employed various frequency metrics [JS09] and the help of constantly updated distributional semantic shallow semantic modeling [KNR15], [HBB10]. With models. We show how one can train such models on the emergence of distributive semantic models detec- Copyright c 2016 for the individual papers by the paper’s au- tion of semantic shifts acquired new potential, as it was thors. Copying permitted for private and academic purposes. shown that word embeddings significantly improve the This volume is published and copyrighted by its editors. performance of algorithms [KARPS15]. In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- gartner, R. Campos and D. Albakour (eds.): Proceedings of the The rest of the paper is organized as follows. In NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, Section 2 we introduce the basics of prediction-based published at http://ceur-ws.org vector models of semantics. Section 3 describes the principles of comparing such models, trained on pieces The idea of employing changes in distributional se- of text which follow each other in time. Specifics of our mantic models to track semantic shifts is not in itself datasets are covered in Section 4, followed by the de- new. [KCH+ 14] proposed to detect language change scription of experimental setting in Section 5. Section with chronologically trained models. However, they 6 evaluates the results and in Section 7 we conclude. used rather simplified measure of ‘distance’ between word vectors at different time slices, namely, raw co- 2 Distributed Semantic Models sine distance. We employ more sophisticated methods as described further. [POL10] developed an approach Vector space models (VSMs) are well established in the to the First Story Detection in Twitter posts. Their field of computational linguistics and have been stud- research is similar to ours in that it deals with stream- ied for decades (see [TP+ 10], [Reh11]). Essentially, ing data. The authors explore the space of documents a model is a set of words and corresponding vectors, and compare new tweets to the existing ones. However, which are produced from typical contexts for a given the algorithm is developed specifically for short texts word. The most widespread type of contexts is other like tweets, which differ radically from news pieces an- words co-occurring with a given one, which means that alyzed in the presented paper. the set of all possible contexts generally equals the size Updating a neural model with new texts (in addi- of the vocabulary of the corpus. The dimensionality tion to the base training corpus used for initial train- of the resulting count model can be reduced with well- ing) is technically straightforward. After that, we have known techniques like Principal Components Analysis two models M1 and Mn , where the former is the ‘base- (PCA) or Singular Value Decomposition (SVD). But line’ reference model, and the latter is the updated one in turn, this effectively forbids online training (contin- (or a sequence of n updated models, each correspond- uously updating the model with new data), because ing to the next time period), probably bringing new after each update one has to perform computationally semantic shifts. This dynamic model in a way tries expensive dimensionality reduction over the whole co- to imitate human brain learning new things, gradually occurrence matrix. ‘updating’ its state with new input data every day. To overcome this, we employ a type of VSMs What are the possible ways to extract these called prediction-based models: particularly, Con- changes? Suppose there is a set S of named entities tinuous Bag-of-Words (CBOW) algorithm ([BDV03], (organizations, locations or persons we are interested [MSC+ 13])1 . Predictive models rather approximate in). Initially in the model M1 , each element of S can co-occurrence data, instead of counting it directly, and be thought of as possessing a number of topical ‘as- show a promising set of properties. Using them, one sociates’ or ‘nearest neighbors’: words with their re- directly learns dense lexical vectors (embeddings). Vec- spective vectors closest to this element vector, ranked tors are initialized randomly and then, as we move by their closeness or similarity. The exact number of through the training corpus with a sliding window of a nearest neighbors we consider in the simplest case is pre-defined width, gradually converge to values max- defined arbitrarily (for example, 10 nearest words). As imizing the likelihood of correctly predicting lexical we update the model with new data, co-occurrence neighbors. Such models as a rule use artificial neu- counts for the elements of S are gradually growing (the ral networks to train; this is why they are sometimes model sees them in new contexts). It means than in called neural models. each successive model Mn learned vectors for elements For our task, it is important that predictive models of S can be different. can be updated with new co-occurrence data in a quite straightforward way. As already said, this is usually If contexts for these words remain pretty much the not the case with count models which demand compu- same throughout the training data, the list of asso- tationally expensive calculations each time a new text ciates (nearest neighbors) in Mn will also remain in- is added. tact. However, if a word acquires new typical contexts or loses some previous ones, its neural embedding will change: a semantic shift happens. Accordingly, we 3 Introducing Temporal Dimension to will see a new list of associates. For example, the vec- Vector Models tor representation for the word president may change Detecting semantic shifts which words undergo over so that its nearest neighbor is the vector for the name time demands the ability to somehow compare ref- of the actual president of a country, instead of the pre- erence (‘baseline’) and updated models, representing vious one. later periods of time. In this way, lists of nearest neighbors can be com- 1 The well-known word2vec tool also implements SkipGram, pared across models trained on different corpora or which is another predictive algorithm. However, it is more com- across one and the same model after an incremen- putationally expensive, and we leave its usage for future work. tal update (as in the presented research). Substantial changes or bursts in such lists for the named entities We employ Stanford POS tagger [TKMS03] to extract we are interested in may signal that these entities have lemmas and to assign each lemma a part-of-speech tag. undergone or are undergoing semantic shifts, which in In order to test whether extracted semantic shifts turn reflects real-world events. We dub this approach are consistent across languages, we use a corpus of ‘dynamic neural embedding models’. news articles in Russian published in September 2015 Sets of neighbors in different models can be com- (unfortunately, not available publicly due to copyright pared in many ways. Approaches to this range from restrictions). It contains about 500,000 texts extracted simple Jaccard index [Jac01] to complex graph-based from about 1000 Russian-language news sites. The algorithms. We test two methods: size of the corpus (after lemmatizing and removing stop-words) is 59,167,835 words. We employ Mystem 1. Kendall’s τ coefficient [Ken48], which measures [Seg03], a state-of-the art tagger for Russian to pro- similarity of item rankings in two sets. Intuitively, duce lemmas and part-of-speech tags. it is important to pay attention not only to raw appearance of some words in the nearest neigh- 5 Experimental setting bors set, but also to their rankings in it. News texts from September 2015 do not seem to be a 2. Relative Neighborhood Tree (RNT), introduced by good training set alone. This is because such a cor- [CGS15]. It essentially produces a tree graph pus is inevitably limited in language coverage, lacking with the target word as its root, nearest neigh- relations to events that happened earlier. Therefore, bors as vertexes and similarities between them as we first train a ‘reference’ or ‘baseline’ model which weighted edges. We then select the immediate aims to mimic some background knowledge, which is neighbors of the target word in this tree and rank then exposed to daily updates. For English, we used them according to their cosine similarity to the British National Corpus3 (about 50 million words) to target word. These rankings are then compared train this reference model, while for Russian it was across models using the same Kendall’s τ . the corpus of news articles published in the months preceding September 2015, precisely June, July and The reason behind the second method is that it theo- August (taken from the same source as the Septem- retically allows a deeper analysis of nearest neighbors’ ber articles). This corpus contains about 250 million sets structure. Obviously, the neighbors participate words. in similarity relations not only with the target word We acknowledge it is not quite correct to employ but also between themselves. These relations convey different types of corpora for ‘reference’ models in En- meaning as well, making it possible to find the most glish and Russian. However, in a way, we compensate ‘important’ neighbors. Graph-based methods to ana- the quality and balance of BNC with the larger size of lyze relations between words in distributional models the reference corpus in Russian. In the future we plan were also used in [KWHdR15]; note, however, that to eliminate this inconsistency by using an analogous the problem they deal with is inverse to ours – they set of English news published in summer months or by attempt to trace changes in surface words for a stable employing Wikipedia dumps as reference corpora for set of concepts, while we attempt to trace semantic both languages. shifts (changes in underlying concepts for a stable set Both corpora were merged with same-language of words). texts released in the first half of September 2015 (be- We hoped that this graph-supported ‘pre-selection’ fore 14th of September), in order to seed baseline mod- would allow Kendall’s τ to improve the performance els with some initial ‘knowledge’ of events and entities of the model. However, these expectations failed and belonging to this month. Then, Continuous Bag-of- simple ranking turned out to be more efficient than Words models were trained for both corpora, using graph-based methods; see Section 6. negative sampling with 10 samples, vector size 300, symmetric window size 5 and 5 iterations. Words with 4 Data Description frequency less than 10 were ignored during training. We test our approach on lemmatized corpora of En- After that, we successively updated these models glish and Russian news texts. The English corpus con- with texts released in the following September time sists of The Signal Media Dataset 2 , which contains periods: 14th–15th, 16th–17th, 18th–20th, 21th–22th, 265,512 blog articles and 734,488 news articles from 23th–24th, 25th–27th, and 28th–30th. Granularity of September 2015. The size of the corpus (after lemma- 2 or 3 days was chosen in order to enlarge the amount tizing and removing stop words) is 222,928,287 words. of data fed to models: for example, some one-day Rus- sian corpora corresponding to weekends contained only 2 http://research.signalmedia.co/newsir16/ signal-dataset.html 3 http://www.natcorp.ox.ac.uk/ several thousand words. For this reason, we addition- Table 1: Change in Chile’s neighbor set ally tried to include week-ends in 3-days periods, to make news stream more evenly distributed. As a re- 14th–15th September 16th–17th September sult, average time period size in tokens was 18,774,000 English Russian English Russian for English data and 5,332,000 for Russian data. peru бачелет quake аргентина We once again emphasize that our baseline mod- bolivia аргентина earthquake бачелет els were not re-trained from scratch with new texts (bachelet) added from new corpora. Instead, we continued train- colombia коста-рика santiago никарагуа ing the same model, gradually updating word vectors argentina перчик chilean мексика with new contexts. All interim states were saved as honduras никарагуа tremor бельгия separate models, and in the end we had 8 successive brazil швейцария tsunami исландия models for each language. ecuador бельгия aftershock тунис We extracted English and Russian countries names nicaragua исландия chileans магнитуда from Wikipedia list of all world countries4 and manu- (magnitude) ally checked and normalized it, bringing all name vari- paraguay аргентин temblor землетрясение ants to one lexeme. Then we filtered out the entities (earthquake) with frequency less than 30 per million words in either enchiladas гватемала kyushu коста-рика of our two reference corpora (English and Russian), producing a set CS of 36 frequent country names5 . days in question, with standard deviation 0.12. Thus, Finally, for each of the successive models, we found in the case of English, the change to the neighbors’ nearest neighbor sets for each entity in CS and com- set can be considered a significant burst, well above pared them to the sets from the model state at the simple chance. In the case of Russian, Kendall’s τ previous time period. Kendall’s τ and Relative Neigh- lies only 1 point below the average value of 0.57. It borhood Tree (RNT) were used to compute similar- is obvious that Russian mass media paid less atten- ity coefficients for each country within the given pair tion to the earthquake (they are more concerned with of models. This provided us with two lists of coun- Michelle Bachelet, Chile’s president), but the event is tries (for each language) ranked by their similarity to still reflected in the nearest neighbors set. the same country in the ‘previous’ model. Supposedly, The next section describes how we employed cross- countries in which some major events happened dur- linguality of the data to evaluate the presented ap- ing the last days have to position low in these lists, be- proach. cause their associations in news texts drifted towards the recent event or an opinion burst. Let’s illustrate how news texts and changes in the 6 Cross-Lingual Evaluation of Events models reflect the real-life events by comparing 10 Detection nearest associates for Chile in the English and Russian There is no ‘golden standard’ or ground truth which corpora. On the 16th of September 2015 there was an would allow to evaluate precision and recall of our earthquake in Chile, and we can detect its ‘echo’ in the events and associations extraction, and to tune hy- changes between our models for 14th–15th and 16th– perparameters of the algorithms. However, there is a 17th of September (see Table 1). way to indirectly estimate their performance in a kind Before the 16th of September, associates for Chile of intrinsic evaluation. in both models were mostly the neighboring countries. We hypothesize that the better is an algorithm of However, after the earthquake things have completely detecting semantic shifts, the closer should be its re- changed: there was a strong bias towards this topic in sults on model sequences trained on different language news and blogs, and this is reflected in vectors for the corpora. Obviously, national media focus on different word. 60% of English and 20% of Russian associates topics, but this mostly concerns the domestic news. are now related to the event. As for the world news, the worst scenario could be Kendall’s τ coefficient between these two neighbors that a news story is not covered in national media of lists is as low as 0 (neighbors are completely replaced) a particular country. However, such scenarios should for English and 0.56 for Russian. Average Kendall’s be rare. In other cases, the perspective on a story can τ for CS is 0.56 in the English models for the two differ, but the ‘burst’ should remain the same6 . 4 https://en.wikipedia.org/wiki/List_of_sovereign_ Thus, English and Russian countries lists ranked by states their ‘burstiness’ can be compared using Spearman’s ρ 5 Low-frequency country names bring in noise, because their vectors are susceptible to wild fluctuations when exposed to even 6 Analyzing the degree to which the vision of events is differ- a small amount of new contexts. ent in national media is beyond the scope of the present research. the Table 1, the baseline approach almost does not re- Table 2: 5 countries with most changed neighbors’ sets veal any differences between neighbors sets: average (of total 36) between September 18–20 and 21–22 Kendall’s τ is 0.92 for English and 0.99 for Russian. Rank English Russian (translated) Thus, if in the case of English the earthquake event is 1 Italy Japan at least detected (we observe the emergence of 4 new 2 Georgia Brazil related neighbors), in the case of Russian the neigh- 3 Malaysia China bor set remained strictly the same. It seems that the 4 Japan Spain raw co-occurrences approach suffers from overestimat- 5 China Georgia ing the influence of the reference corpora, which are much larger than the daily updates. Dynamic neural [Spe04] for each time period. As there are 7 shifts from embedding models overcome this problem. one time period to another, we use median of ρ values Interestingly, the wider sets of neighbors taken into for these 7 cases as a tentative measure of algorithm’s account results in better performance only for CBOW performance. The Table 2 gives an example of such with Kendall’s τ . For the baseline and for CBOW with country rankings for the changes between 18–20 and RNT, increasing the size of processed neighbor sets 21–22 of September. One can see that the top lists are actually results in poorer performance. The reason highly similar, with 3 of 5 countries appearing in both for this behavior in RNT can be that the algorithm (actual Sperman’s ρ for the total lists of 36 countries begins to ‘roam’ in the graph attracting more far-away between these periods is 0.5). associates as immediate tree neighbors to the target Overall results of applying this approach to the word. In the baseline method it simply leads to much whole dataset using two our algorithms (with different language-dependent noise, which semantically aware sizes of nearest neighbors’ sets to consider) are pre- models filter out at the training stage. sented in the Table 3. We also applied it to a simple baseline method, where nearest neighbors are words 7 Conclusions which most frequently occurred in the window of 5 to- We presented a method of detecting semantic shifts kens to the right and to the left of the target entity in for countries in news texts with the help of dynamic the given corpus. neural embedding models. We explored the difference between entities’ vector representations in the mod- Table 3: Cross-lingual evaluation els from different temporal stages and discovered as- Algorithm Neighbors’ Median Spear- sociation shifts that happen to these words over time. set size man’s ρ This can be employed to trace trends and events in streaming news texts using a completely unsupervised Raw co- 5 0.26 (p = 0.12) approach. occurrences 10 0.15 We showed that distributional semantic models are baseline 100 0.06 rather efficient when detecting associations shifts and 5 0.25 are in most cases language-independent. In our test CBOW and 10 0.25 sets, there is a statistically significant correlation be- Kendall’s τ 100 0.28 (p = 0.09) tween lists of ‘semantically shifted’ countries in En- glish and Russian sequences of models for the same CBOW and Rel- 5 0.20 time period. ative Neighbor- 10 0.16 However, there is still room for improvement. First hood Tree 100 0.14 of all, some ways to evaluate semantic shifts extraction have to be developed (including creation of ground Kendall’s τ consistently renders better results with- truth datasets). Additionally, we plan to test other out additional selection of ‘important’ associates by ways of comparing neighbor sets and tune algorithms’ a relative neighborhood tree (additionally, it is much hyperparameters. It would be also useful to improve faster). This once again raises questions about the quality of corpora (e.g. eliminate more noise and whether vector models can be efficiently processed stop words). Finally, we plan to experiment with us- with graph representations. Kendall’s τ also outper- ing different algorithms or parameter sets for different forms the baseline approach: the margin is as small as languages: preliminary tests show promising results. two points, but it is supported by higher significance (p < 0.1). Note that qualitative analysis of the baseline results References shows that they are mostly inappropriate for any prac- [AGK01] James Allan, Rahul Gupta, and Vikas tical task. For the time period which is described in Khandelwal. Temporal summaries of new topics. In Proceedings of the 24th Dirichlet Allocation model. Information Annual International ACM SIGIR Con- Processing & Management, 51(6):809– ference on Research and Development in 833, 2015. Information Retrieval, SIGIR ’01, pages [KWHdR15] Tom Kenter, Melvin Wevers, Pim Huij- 10–18, New York, USA, 2001. nen, and Maarten de Rijke. Ad hoc mon- [BDV03] Yoshua Bengio, Rejean Ducharme, and itoring of vocabulary shifts over time. Pascal Vincent. A neural probabilistic In Proceedings of the 24th ACM Inter- language model. Journal of Machine national on Conference on Information Learning Research, 3:1137–1155, 2003. and Knowledge Management, CIKM ’15, pages 1191–1200, New York, NY, USA, [CGS15] Amaru Cuba Gyllensten and Magnus 2015. ACM. Sahlgren. Navigating the semantic hori- zon using relative neighborhood graphs. [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai In Proceedings of the 2015 Conference Chen, Greg S Corrado, and Jeff Dean. on Empirical Methods in Natural Lan- Distributed representations of words and guage Processing, pages 2451–2460, Lis- phrases and their compositionality. Ad- bon, Portugal, September 2015. vances in Neural Information Processing Systems 26, pages 3111–3119, 2013. [HBB10] Matthew Hoffman, Francis R. Bach, and [POL10] Saša Petrović, Miles Osborne, and Victor David M. Blei. Online learning for la- Lavrenko. Streaming first story detec- tent dirichlet allocation. In Neural In- tion with application to twitter. In Hu- formation Processing Systems 23, pages man Language Technologies: The 2010 856–864, Vancouver, Canada, 2010. Annual Conference of the North Ameri- [Jac01] Paul Jaccard. Distribution de la Flore can Chapter of the Association for Com- Alpine: dans le Bassin des dranses et putational Linguistics, pages 181–189. dans quelques régions voisines. Rouge, Association for Computational Linguis- 1901. tics, 2010. [JS09] David Jurgens and Keith Stevens. Event [Reh11] Radim Rehurek. Scalability of semantic detection in blogs using temporal ran- analysis in natural language processing. dom indexing. In Proceedings of the PhD thesis, Masaryk University, 2011. Workshop on Events in Emerging Text [Seg03] Ilya Segalovich. A fast morphological al- Types, pages 9–16, Borovets, Bulgaria, gorithm with unknown word guessing in- 2009. duced by a dictionary for a web search engine. In MLMTA, pages 273–280. Cite- [KARPS15] Vivek Kulkarni, Rami Al-Rfou, Bryan seer, 2003. Perozzi, and Steven Skiena. Statistically significant detection of linguistic change. [Spe04] Charles Spearman. The proof and mea- In Proceedings of the 24th International surement of association between two Conference on World Wide Web, pages things. The American journal of psychol- 625–635, Florence, Italy, 2015. ogy, 15(1):72–101, 1904. [KCH+ 14] Yoon Kim, Yi-I Chiu, Kentaro Hanaki, [TKMS03] Kristina Toutanova, Dan Klein, Christo- Darshan Hegde, and Slav Petrov. Tem- pher D Manning, and Yoram Singer. poral analysis of language through neu- Feature-rich part-of-speech tagging with ral language models. In Proceedings of a cyclic dependency network. In the 52nd Annual Meeting of the Asso- Proceedings of the 2003 NAACL-HLT ciation for Computational Linguistics, Conference-Volume 1, pages 173–180. page 61, Baltimore, USA, 2014. Association for Computational Linguis- tics, 2003. [Ken48] Maurice George Kendall. Rank correla- tion methods. Griffin, 1948. [TP+ 10] Peter Turney, Patrick Pantel, et al. From frequency to meaning: Vector space [KNR15] Manika Kar, Sérgio Nunes, and Cristina models of semantics. Journal of artifi- Ribeiro. Summarization of changes in cial intelligence research, 37(1):141–188, dynamic text collections using Latent 2010.