Cross-lingual Trends Detection for Named Entities in News
     Texts with Dynamic Neural Embedding Models

                                            Andrey Kutuzov
                                           University of Oslo
                              Postboks 1080 Blindern 0316, Oslo, Norway
                                           andreku@ifi.uio.no
                                          Elizaveta Kuzmenko
                        National Research University Higher School of Economics
                                            Moscow, Russia
                                      eakuzmenko_2@edu.hse.ru


                                                                new textual data arriving daily and draw conclusions
                                                                about events based on changes in word vectors induced
                       Abstract                                 by new contexts. In other words, subtle semantic shifts
                                                                which the words undergo over time, influenced by real-
    This paper presents an approach to detect                   world events, are detected by the presented method.
    real-world events as manifested in news texts.                 Detecting semantic shifts can be of use in a variety
    We use vector space models, particularly                    of linguistic applications. First, this method can be of
    neural embeddings (prediction-based distribu-               help in the problem of automatically monitoring events
    tional models). The models are trained on                   through the stream of texts [AGK01]. Detected se-
    a large ‘reference’ corpus and then succes-                 mantic shifts can potentially be used as additional fea-
    sively updated with new textual data from                   tures in the algorithms aimed at extracting the course
    daily news. For given words or multi-word                   of events. Without unsupervised approaches, it is im-
    entities, calculating difference between their              possible to process all the continuously generated data.
    vector representations in two or more models                This is the primary motivation factor for our research.
    allows to find out association shifts that hap-             Second, the developed approach can be used to study
    pen to these words over time. The hypothesis                language shift and compare temporal corpora slices.
    is tested on country names, using news cor-                 This language area is traditionally studied by linguists,
    pora for English and Russian language. We                   who put a lot of efforts into describing semantic shifts
    show that this approach successfully extracts               with the help of dictionaries, corpora and sociolinguis-
    meaningful temporal trends for named entities               tic research. At the same time, it is impossible to
    regardless of a language.                                   grasp all the language vocabulary and describe every
                                                                lexical shift manually. Distributional semantic models
1    Introduction                                               facilitate this task.
                                                                   The approaches to events detection and modeling
We propose an approach to track changes happen-
                                                                of language shifts have a lot in common. First tech-
ing to real-world entities (in our case, countries) with
                                                                niques employed various frequency metrics [JS09] and
the help of constantly updated distributional semantic
                                                                shallow semantic modeling [KNR15], [HBB10]. With
models. We show how one can train such models on
                                                                the emergence of distributive semantic models detec-
Copyright c 2016 for the individual papers by the paper’s au-   tion of semantic shifts acquired new potential, as it was
thors. Copying permitted for private and academic purposes.     shown that word embeddings significantly improve the
This volume is published and copyrighted by its editors.        performance of algorithms [KARPS15].
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-
gartner, R. Campos and D. Albakour (eds.): Proceedings of the
                                                                   The rest of the paper is organized as follows. In
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016,        Section 2 we introduce the basics of prediction-based
published at http://ceur-ws.org                                 vector models of semantics. Section 3 describes the
principles of comparing such models, trained on pieces               The idea of employing changes in distributional se-
of text which follow each other in time. Specifics of our         mantic models to track semantic shifts is not in itself
datasets are covered in Section 4, followed by the de-            new. [KCH+ 14] proposed to detect language change
scription of experimental setting in Section 5. Section           with chronologically trained models. However, they
6 evaluates the results and in Section 7 we conclude.             used rather simplified measure of ‘distance’ between
                                                                  word vectors at different time slices, namely, raw co-
2    Distributed Semantic Models                                  sine distance. We employ more sophisticated methods
                                                                  as described further. [POL10] developed an approach
Vector space models (VSMs) are well established in the
                                                                  to the First Story Detection in Twitter posts. Their
field of computational linguistics and have been stud-
                                                                  research is similar to ours in that it deals with stream-
ied for decades (see [TP+ 10], [Reh11]). Essentially,
                                                                  ing data. The authors explore the space of documents
a model is a set of words and corresponding vectors,
                                                                  and compare new tweets to the existing ones. However,
which are produced from typical contexts for a given
                                                                  the algorithm is developed specifically for short texts
word. The most widespread type of contexts is other
                                                                  like tweets, which differ radically from news pieces an-
words co-occurring with a given one, which means that
                                                                  alyzed in the presented paper.
the set of all possible contexts generally equals the size
                                                                     Updating a neural model with new texts (in addi-
of the vocabulary of the corpus. The dimensionality
                                                                  tion to the base training corpus used for initial train-
of the resulting count model can be reduced with well-
                                                                  ing) is technically straightforward. After that, we have
known techniques like Principal Components Analysis
                                                                  two models M1 and Mn , where the former is the ‘base-
(PCA) or Singular Value Decomposition (SVD). But
                                                                  line’ reference model, and the latter is the updated one
in turn, this effectively forbids online training (contin-
                                                                  (or a sequence of n updated models, each correspond-
uously updating the model with new data), because
                                                                  ing to the next time period), probably bringing new
after each update one has to perform computationally
                                                                  semantic shifts. This dynamic model in a way tries
expensive dimensionality reduction over the whole co-
                                                                  to imitate human brain learning new things, gradually
occurrence matrix.
                                                                  ‘updating’ its state with new input data every day.
    To overcome this, we employ a type of VSMs
                                                                     What are the possible ways to extract these
called prediction-based models: particularly, Con-
                                                                  changes? Suppose there is a set S of named entities
tinuous Bag-of-Words (CBOW) algorithm ([BDV03],
                                                                  (organizations, locations or persons we are interested
[MSC+ 13])1 . Predictive models rather approximate
                                                                  in). Initially in the model M1 , each element of S can
co-occurrence data, instead of counting it directly, and
                                                                  be thought of as possessing a number of topical ‘as-
show a promising set of properties. Using them, one
                                                                  sociates’ or ‘nearest neighbors’: words with their re-
directly learns dense lexical vectors (embeddings). Vec-
                                                                  spective vectors closest to this element vector, ranked
tors are initialized randomly and then, as we move
                                                                  by their closeness or similarity. The exact number of
through the training corpus with a sliding window of a
                                                                  nearest neighbors we consider in the simplest case is
pre-defined width, gradually converge to values max-
                                                                  defined arbitrarily (for example, 10 nearest words). As
imizing the likelihood of correctly predicting lexical
                                                                  we update the model with new data, co-occurrence
neighbors. Such models as a rule use artificial neu-
                                                                  counts for the elements of S are gradually growing (the
ral networks to train; this is why they are sometimes
                                                                  model sees them in new contexts). It means than in
called neural models.
                                                                  each successive model Mn learned vectors for elements
    For our task, it is important that predictive models
                                                                  of S can be different.
can be updated with new co-occurrence data in a quite
straightforward way. As already said, this is usually                If contexts for these words remain pretty much the
not the case with count models which demand compu-                same throughout the training data, the list of asso-
tationally expensive calculations each time a new text            ciates (nearest neighbors) in Mn will also remain in-
is added.                                                         tact. However, if a word acquires new typical contexts
                                                                  or loses some previous ones, its neural embedding will
                                                                  change: a semantic shift happens. Accordingly, we
3    Introducing Temporal Dimension to
                                                                  will see a new list of associates. For example, the vec-
     Vector Models                                                tor representation for the word president may change
Detecting semantic shifts which words undergo over                so that its nearest neighbor is the vector for the name
time demands the ability to somehow compare ref-                  of the actual president of a country, instead of the pre-
erence (‘baseline’) and updated models, representing              vious one.
later periods of time.                                               In this way, lists of nearest neighbors can be com-
   1 The well-known word2vec tool also implements SkipGram,       pared across models trained on different corpora or
which is another predictive algorithm. However, it is more com-   across one and the same model after an incremen-
putationally expensive, and we leave its usage for future work.   tal update (as in the presented research). Substantial
changes or bursts in such lists for the named entities         We employ Stanford POS tagger [TKMS03] to extract
we are interested in may signal that these entities have       lemmas and to assign each lemma a part-of-speech tag.
undergone or are undergoing semantic shifts, which in             In order to test whether extracted semantic shifts
turn reflects real-world events. We dub this approach          are consistent across languages, we use a corpus of
‘dynamic neural embedding models’.                             news articles in Russian published in September 2015
   Sets of neighbors in different models can be com-           (unfortunately, not available publicly due to copyright
pared in many ways. Approaches to this range from              restrictions). It contains about 500,000 texts extracted
simple Jaccard index [Jac01] to complex graph-based            from about 1000 Russian-language news sites. The
algorithms. We test two methods:                               size of the corpus (after lemmatizing and removing
                                                               stop-words) is 59,167,835 words. We employ Mystem
    1. Kendall’s τ coefficient [Ken48], which measures         [Seg03], a state-of-the art tagger for Russian to pro-
       similarity of item rankings in two sets. Intuitively,   duce lemmas and part-of-speech tags.
       it is important to pay attention not only to raw
       appearance of some words in the nearest neigh-          5     Experimental setting
       bors set, but also to their rankings in it.
                                                               News texts from September 2015 do not seem to be a
    2. Relative Neighborhood Tree (RNT), introduced by         good training set alone. This is because such a cor-
       [CGS15]. It essentially produces a tree graph           pus is inevitably limited in language coverage, lacking
       with the target word as its root, nearest neigh-        relations to events that happened earlier. Therefore,
       bors as vertexes and similarities between them as       we first train a ‘reference’ or ‘baseline’ model which
       weighted edges. We then select the immediate            aims to mimic some background knowledge, which is
       neighbors of the target word in this tree and rank      then exposed to daily updates. For English, we used
       them according to their cosine similarity to the        British National Corpus3 (about 50 million words) to
       target word. These rankings are then compared           train this reference model, while for Russian it was
       across models using the same Kendall’s τ .              the corpus of news articles published in the months
                                                               preceding September 2015, precisely June, July and
The reason behind the second method is that it theo-
                                                               August (taken from the same source as the Septem-
retically allows a deeper analysis of nearest neighbors’
                                                               ber articles). This corpus contains about 250 million
sets structure. Obviously, the neighbors participate
                                                               words.
in similarity relations not only with the target word
                                                                  We acknowledge it is not quite correct to employ
but also between themselves. These relations convey
                                                               different types of corpora for ‘reference’ models in En-
meaning as well, making it possible to find the most
                                                               glish and Russian. However, in a way, we compensate
‘important’ neighbors. Graph-based methods to ana-
                                                               the quality and balance of BNC with the larger size of
lyze relations between words in distributional models
                                                               the reference corpus in Russian. In the future we plan
were also used in [KWHdR15]; note, however, that
                                                               to eliminate this inconsistency by using an analogous
the problem they deal with is inverse to ours – they
                                                               set of English news published in summer months or by
attempt to trace changes in surface words for a stable
                                                               employing Wikipedia dumps as reference corpora for
set of concepts, while we attempt to trace semantic
                                                               both languages.
shifts (changes in underlying concepts for a stable set
                                                                  Both corpora were merged with same-language
of words).
                                                               texts released in the first half of September 2015 (be-
   We hoped that this graph-supported ‘pre-selection’
                                                               fore 14th of September), in order to seed baseline mod-
would allow Kendall’s τ to improve the performance
                                                               els with some initial ‘knowledge’ of events and entities
of the model. However, these expectations failed and
                                                               belonging to this month. Then, Continuous Bag-of-
simple ranking turned out to be more efficient than
                                                               Words models were trained for both corpora, using
graph-based methods; see Section 6.
                                                               negative sampling with 10 samples, vector size 300,
                                                               symmetric window size 5 and 5 iterations. Words with
4      Data Description                                        frequency less than 10 were ignored during training.
We test our approach on lemmatized corpora of En-                 After that, we successively updated these models
glish and Russian news texts. The English corpus con-          with texts released in the following September time
sists of The Signal Media Dataset 2 , which contains           periods: 14th–15th, 16th–17th, 18th–20th, 21th–22th,
265,512 blog articles and 734,488 news articles from           23th–24th, 25th–27th, and 28th–30th. Granularity of
September 2015. The size of the corpus (after lemma-           2 or 3 days was chosen in order to enlarge the amount
tizing and removing stop words) is 222,928,287 words.          of data fed to models: for example, some one-day Rus-
                                                               sian corpora corresponding to weekends contained only
   2 http://research.signalmedia.co/newsir16/

signal-dataset.html                                                3 http://www.natcorp.ox.ac.uk/
several thousand words. For this reason, we addition-
                                                                             Table 1: Change in Chile’s neighbor set
ally tried to include week-ends in 3-days periods, to
make news stream more evenly distributed. As a re-                       14th–15th September        16th–17th September
sult, average time period size in tokens was 18,774,000                  English      Russian      English     Russian
for English data and 5,332,000 for Russian data.                          peru         бачелет           quake         аргентина
    We once again emphasize that our baseline mod-                       bolivia      аргентина        earthquake      бачелет
els were not re-trained from scratch with new texts                                                                    (bachelet)
added from new corpora. Instead, we continued train-                    colombia     коста-рика         santiago       никарагуа
ing the same model, gradually updating word vectors                     argentina      перчик            chilean       мексика
with new contexts. All interim states were saved as                     honduras     никарагуа           tremor        бельгия
separate models, and in the end we had 8 successive                       brazil     швейцария          tsunami        исландия
models for each language.                                                ecuador       бельгия         aftershock      тунис
    We extracted English and Russian countries names                    nicaragua     исландия          chileans       магнитуда
from Wikipedia list of all world countries4 and manu-                                                                  (magnitude)
ally checked and normalized it, bringing all name vari-                 paraguay       аргентин          temblor       землетрясение
ants to one lexeme. Then we filtered out the entities                                                                  (earthquake)
with frequency less than 30 per million words in either              enchiladas       гватемала          kyushu        коста-рика
of our two reference corpora (English and Russian),
producing a set CS of 36 frequent country names5 .                  days in question, with standard deviation 0.12. Thus,
    Finally, for each of the successive models, we found            in the case of English, the change to the neighbors’
nearest neighbor sets for each entity in CS and com-                set can be considered a significant burst, well above
pared them to the sets from the model state at the                  simple chance. In the case of Russian, Kendall’s τ
previous time period. Kendall’s τ and Relative Neigh-               lies only 1 point below the average value of 0.57. It
borhood Tree (RNT) were used to compute similar-                    is obvious that Russian mass media paid less atten-
ity coefficients for each country within the given pair             tion to the earthquake (they are more concerned with
of models. This provided us with two lists of coun-                 Michelle Bachelet, Chile’s president), but the event is
tries (for each language) ranked by their similarity to             still reflected in the nearest neighbors set.
the same country in the ‘previous’ model. Supposedly,                   The next section describes how we employed cross-
countries in which some major events happened dur-                  linguality of the data to evaluate the presented ap-
ing the last days have to position low in these lists, be-          proach.
cause their associations in news texts drifted towards
the recent event or an opinion burst.
    Let’s illustrate how news texts and changes in the              6     Cross-Lingual Evaluation of Events
models reflect the real-life events by comparing 10                       Detection
nearest associates for Chile in the English and Russian             There is no ‘golden standard’ or ground truth which
corpora. On the 16th of September 2015 there was an                 would allow to evaluate precision and recall of our
earthquake in Chile, and we can detect its ‘echo’ in the            events and associations extraction, and to tune hy-
changes between our models for 14th–15th and 16th–                  perparameters of the algorithms. However, there is a
17th of September (see Table 1).                                    way to indirectly estimate their performance in a kind
    Before the 16th of September, associates for Chile              of intrinsic evaluation.
in both models were mostly the neighboring countries.                  We hypothesize that the better is an algorithm of
However, after the earthquake things have completely                detecting semantic shifts, the closer should be its re-
changed: there was a strong bias towards this topic in              sults on model sequences trained on different language
news and blogs, and this is reflected in vectors for the            corpora. Obviously, national media focus on different
word. 60% of English and 20% of Russian associates                  topics, but this mostly concerns the domestic news.
are now related to the event.                                       As for the world news, the worst scenario could be
    Kendall’s τ coefficient between these two neighbors             that a news story is not covered in national media of
lists is as low as 0 (neighbors are completely replaced)            a particular country. However, such scenarios should
for English and 0.56 for Russian. Average Kendall’s                 be rare. In other cases, the perspective on a story can
τ for CS is 0.56 in the English models for the two                  differ, but the ‘burst’ should remain the same6 .
   4 https://en.wikipedia.org/wiki/List_of_sovereign_                  Thus, English and Russian countries lists ranked by
states                                                              their ‘burstiness’ can be compared using Spearman’s ρ
   5 Low-frequency country names bring in noise, because their

vectors are susceptible to wild fluctuations when exposed to even      6 Analyzing the degree to which the vision of events is differ-

a small amount of new contexts.                                     ent in national media is beyond the scope of the present research.
                                                            the Table 1, the baseline approach almost does not re-
Table 2: 5 countries with most changed neighbors’ sets
                                                            veal any differences between neighbors sets: average
(of total 36) between September 18–20 and 21–22
                                                            Kendall’s τ is 0.92 for English and 0.99 for Russian.
       Rank     English      Russian (translated)           Thus, if in the case of English the earthquake event is
       1         Italy             Japan                    at least detected (we observe the emergence of 4 new
       2       Georgia             Brazil                   related neighbors), in the case of Russian the neigh-
       3       Malaysia            China                    bor set remained strictly the same. It seems that the
       4        Japan              Spain                    raw co-occurrences approach suffers from overestimat-
       5        China             Georgia                   ing the influence of the reference corpora, which are
                                                            much larger than the daily updates. Dynamic neural
[Spe04] for each time period. As there are 7 shifts from    embedding models overcome this problem.
one time period to another, we use median of ρ values          Interestingly, the wider sets of neighbors taken into
for these 7 cases as a tentative measure of algorithm’s     account results in better performance only for CBOW
performance. The Table 2 gives an example of such           with Kendall’s τ . For the baseline and for CBOW with
country rankings for the changes between 18–20 and          RNT, increasing the size of processed neighbor sets
21–22 of September. One can see that the top lists are      actually results in poorer performance. The reason
highly similar, with 3 of 5 countries appearing in both     for this behavior in RNT can be that the algorithm
(actual Sperman’s ρ for the total lists of 36 countries     begins to ‘roam’ in the graph attracting more far-away
between these periods is 0.5).                              associates as immediate tree neighbors to the target
   Overall results of applying this approach to the         word. In the baseline method it simply leads to much
whole dataset using two our algorithms (with different      language-dependent noise, which semantically aware
sizes of nearest neighbors’ sets to consider) are pre-      models filter out at the training stage.
sented in the Table 3. We also applied it to a simple
baseline method, where nearest neighbors are words          7    Conclusions
which most frequently occurred in the window of 5 to-       We presented a method of detecting semantic shifts
kens to the right and to the left of the target entity in   for countries in news texts with the help of dynamic
the given corpus.                                           neural embedding models. We explored the difference
                                                            between entities’ vector representations in the mod-
          Table 3: Cross-lingual evaluation                 els from different temporal stages and discovered as-
   Algorithm         Neighbors’ Median Spear-               sociation shifts that happen to these words over time.
                     set size     man’s ρ                   This can be employed to trace trends and events in
                                                            streaming news texts using a completely unsupervised
   Raw         co-     5            0.26 (p = 0.12)         approach.
   occurrences         10           0.15                       We showed that distributional semantic models are
   baseline            100          0.06                    rather efficient when detecting associations shifts and
                       5            0.25                    are in most cases language-independent. In our test
   CBOW        and     10           0.25                    sets, there is a statistically significant correlation be-
   Kendall’s τ         100          0.28 (p = 0.09)         tween lists of ‘semantically shifted’ countries in En-
                                                            glish and Russian sequences of models for the same
   CBOW and Rel-       5            0.20
                                                            time period.
   ative Neighbor-     10           0.16
                                                               However, there is still room for improvement. First
   hood Tree           100          0.14
                                                            of all, some ways to evaluate semantic shifts extraction
                                                            have to be developed (including creation of ground
   Kendall’s τ consistently renders better results with-
                                                            truth datasets). Additionally, we plan to test other
out additional selection of ‘important’ associates by
                                                            ways of comparing neighbor sets and tune algorithms’
a relative neighborhood tree (additionally, it is much
                                                            hyperparameters. It would be also useful to improve
faster).    This once again raises questions about
                                                            the quality of corpora (e.g. eliminate more noise and
whether vector models can be efficiently processed
                                                            stop words). Finally, we plan to experiment with us-
with graph representations. Kendall’s τ also outper-
                                                            ing different algorithms or parameter sets for different
forms the baseline approach: the margin is as small as
                                                            languages: preliminary tests show promising results.
two points, but it is supported by higher significance
(p < 0.1).
   Note that qualitative analysis of the baseline results
                                                            References
shows that they are mostly inappropriate for any prac-      [AGK01]       James Allan, Rahul Gupta, and Vikas
tical task. For the time period which is described in                     Khandelwal. Temporal summaries of
             new topics. In Proceedings of the 24th                 Dirichlet Allocation model. Information
             Annual International ACM SIGIR Con-                    Processing & Management, 51(6):809–
             ference on Research and Development in                 833, 2015.
             Information Retrieval, SIGIR ’01, pages
                                                        [KWHdR15] Tom Kenter, Melvin Wevers, Pim Huij-
             10–18, New York, USA, 2001.
                                                                  nen, and Maarten de Rijke. Ad hoc mon-
[BDV03]      Yoshua Bengio, Rejean Ducharme, and                  itoring of vocabulary shifts over time.
             Pascal Vincent. A neural probabilistic               In Proceedings of the 24th ACM Inter-
             language model. Journal of Machine                   national on Conference on Information
             Learning Research, 3:1137–1155, 2003.                and Knowledge Management, CIKM ’15,
                                                                  pages 1191–1200, New York, NY, USA,
[CGS15]      Amaru Cuba Gyllensten and Magnus                     2015. ACM.
             Sahlgren. Navigating the semantic hori-
             zon using relative neighborhood graphs.    [MSC+ 13]   Tomas Mikolov, Ilya Sutskever, Kai
             In Proceedings of the 2015 Conference                  Chen, Greg S Corrado, and Jeff Dean.
             on Empirical Methods in Natural Lan-                   Distributed representations of words and
             guage Processing, pages 2451–2460, Lis-                phrases and their compositionality. Ad-
             bon, Portugal, September 2015.                         vances in Neural Information Processing
                                                                    Systems 26, pages 3111–3119, 2013.
[HBB10]      Matthew Hoffman, Francis R. Bach, and
                                                        [POL10]     Saša Petrović, Miles Osborne, and Victor
             David M. Blei. Online learning for la-
                                                                    Lavrenko. Streaming first story detec-
             tent dirichlet allocation. In Neural In-
                                                                    tion with application to twitter. In Hu-
             formation Processing Systems 23, pages
                                                                    man Language Technologies: The 2010
             856–864, Vancouver, Canada, 2010.
                                                                    Annual Conference of the North Ameri-
[Jac01]      Paul Jaccard. Distribution de la Flore                 can Chapter of the Association for Com-
             Alpine: dans le Bassin des dranses et                  putational Linguistics, pages 181–189.
             dans quelques régions voisines. Rouge,                 Association for Computational Linguis-
             1901.                                                  tics, 2010.

[JS09]       David Jurgens and Keith Stevens. Event     [Reh11]     Radim Rehurek. Scalability of semantic
             detection in blogs using temporal ran-                 analysis in natural language processing.
             dom indexing. In Proceedings of the                    PhD thesis, Masaryk University, 2011.
             Workshop on Events in Emerging Text        [Seg03]     Ilya Segalovich. A fast morphological al-
             Types, pages 9–16, Borovets, Bulgaria,                 gorithm with unknown word guessing in-
             2009.                                                  duced by a dictionary for a web search
                                                                    engine. In MLMTA, pages 273–280. Cite-
[KARPS15] Vivek Kulkarni, Rami Al-Rfou, Bryan
                                                                    seer, 2003.
          Perozzi, and Steven Skiena. Statistically
          significant detection of linguistic change.   [Spe04]     Charles Spearman. The proof and mea-
          In Proceedings of the 24th International                  surement of association between two
          Conference on World Wide Web, pages                       things. The American journal of psychol-
          625–635, Florence, Italy, 2015.                           ogy, 15(1):72–101, 1904.

[KCH+ 14]    Yoon Kim, Yi-I Chiu, Kentaro Hanaki,       [TKMS03]    Kristina Toutanova, Dan Klein, Christo-
             Darshan Hegde, and Slav Petrov. Tem-                   pher D Manning, and Yoram Singer.
             poral analysis of language through neu-                Feature-rich part-of-speech tagging with
             ral language models. In Proceedings of                 a cyclic dependency network.          In
             the 52nd Annual Meeting of the Asso-                   Proceedings of the 2003 NAACL-HLT
             ciation for Computational Linguistics,                 Conference-Volume 1, pages 173–180.
             page 61, Baltimore, USA, 2014.                         Association for Computational Linguis-
                                                                    tics, 2003.
[Ken48]      Maurice George Kendall. Rank correla-
             tion methods. Griffin, 1948.               [TP+ 10]    Peter Turney, Patrick Pantel, et al. From
                                                                    frequency to meaning: Vector space
[KNR15]      Manika Kar, Sérgio Nunes, and Cristina                 models of semantics. Journal of artifi-
             Ribeiro. Summarization of changes in                   cial intelligence research, 37(1):141–188,
             dynamic text collections using Latent                  2010.