=Paper=
{{Paper
|id=Vol-1568/paper5
|storemode=property
|title=Cross-Lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models
|pdfUrl=https://ceur-ws.org/Vol-1568/paper5.pdf
|volume=Vol-1568
|authors=Andrey Kutuzov,Elizaveta Kuzmenko
|dblpUrl=https://dblp.org/rec/conf/ecir/KutuzovK16
}}
==Cross-Lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models==
Cross-lingual Trends Detection for Named Entities in News
Texts with Dynamic Neural Embedding Models
Andrey Kutuzov
University of Oslo
Postboks 1080 Blindern 0316, Oslo, Norway
andreku@ifi.uio.no
Elizaveta Kuzmenko
National Research University Higher School of Economics
Moscow, Russia
eakuzmenko_2@edu.hse.ru
new textual data arriving daily and draw conclusions
about events based on changes in word vectors induced
Abstract by new contexts. In other words, subtle semantic shifts
which the words undergo over time, influenced by real-
This paper presents an approach to detect world events, are detected by the presented method.
real-world events as manifested in news texts. Detecting semantic shifts can be of use in a variety
We use vector space models, particularly of linguistic applications. First, this method can be of
neural embeddings (prediction-based distribu- help in the problem of automatically monitoring events
tional models). The models are trained on through the stream of texts [AGK01]. Detected se-
a large ‘reference’ corpus and then succes- mantic shifts can potentially be used as additional fea-
sively updated with new textual data from tures in the algorithms aimed at extracting the course
daily news. For given words or multi-word of events. Without unsupervised approaches, it is im-
entities, calculating difference between their possible to process all the continuously generated data.
vector representations in two or more models This is the primary motivation factor for our research.
allows to find out association shifts that hap- Second, the developed approach can be used to study
pen to these words over time. The hypothesis language shift and compare temporal corpora slices.
is tested on country names, using news cor- This language area is traditionally studied by linguists,
pora for English and Russian language. We who put a lot of efforts into describing semantic shifts
show that this approach successfully extracts with the help of dictionaries, corpora and sociolinguis-
meaningful temporal trends for named entities tic research. At the same time, it is impossible to
regardless of a language. grasp all the language vocabulary and describe every
lexical shift manually. Distributional semantic models
1 Introduction facilitate this task.
The approaches to events detection and modeling
We propose an approach to track changes happen-
of language shifts have a lot in common. First tech-
ing to real-world entities (in our case, countries) with
niques employed various frequency metrics [JS09] and
the help of constantly updated distributional semantic
shallow semantic modeling [KNR15], [HBB10]. With
models. We show how one can train such models on
the emergence of distributive semantic models detec-
Copyright c 2016 for the individual papers by the paper’s au- tion of semantic shifts acquired new potential, as it was
thors. Copying permitted for private and academic purposes. shown that word embeddings significantly improve the
This volume is published and copyrighted by its editors. performance of algorithms [KARPS15].
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-
gartner, R. Campos and D. Albakour (eds.): Proceedings of the
The rest of the paper is organized as follows. In
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, Section 2 we introduce the basics of prediction-based
published at http://ceur-ws.org vector models of semantics. Section 3 describes the
principles of comparing such models, trained on pieces The idea of employing changes in distributional se-
of text which follow each other in time. Specifics of our mantic models to track semantic shifts is not in itself
datasets are covered in Section 4, followed by the de- new. [KCH+ 14] proposed to detect language change
scription of experimental setting in Section 5. Section with chronologically trained models. However, they
6 evaluates the results and in Section 7 we conclude. used rather simplified measure of ‘distance’ between
word vectors at different time slices, namely, raw co-
2 Distributed Semantic Models sine distance. We employ more sophisticated methods
as described further. [POL10] developed an approach
Vector space models (VSMs) are well established in the
to the First Story Detection in Twitter posts. Their
field of computational linguistics and have been stud-
research is similar to ours in that it deals with stream-
ied for decades (see [TP+ 10], [Reh11]). Essentially,
ing data. The authors explore the space of documents
a model is a set of words and corresponding vectors,
and compare new tweets to the existing ones. However,
which are produced from typical contexts for a given
the algorithm is developed specifically for short texts
word. The most widespread type of contexts is other
like tweets, which differ radically from news pieces an-
words co-occurring with a given one, which means that
alyzed in the presented paper.
the set of all possible contexts generally equals the size
Updating a neural model with new texts (in addi-
of the vocabulary of the corpus. The dimensionality
tion to the base training corpus used for initial train-
of the resulting count model can be reduced with well-
ing) is technically straightforward. After that, we have
known techniques like Principal Components Analysis
two models M1 and Mn , where the former is the ‘base-
(PCA) or Singular Value Decomposition (SVD). But
line’ reference model, and the latter is the updated one
in turn, this effectively forbids online training (contin-
(or a sequence of n updated models, each correspond-
uously updating the model with new data), because
ing to the next time period), probably bringing new
after each update one has to perform computationally
semantic shifts. This dynamic model in a way tries
expensive dimensionality reduction over the whole co-
to imitate human brain learning new things, gradually
occurrence matrix.
‘updating’ its state with new input data every day.
To overcome this, we employ a type of VSMs
What are the possible ways to extract these
called prediction-based models: particularly, Con-
changes? Suppose there is a set S of named entities
tinuous Bag-of-Words (CBOW) algorithm ([BDV03],
(organizations, locations or persons we are interested
[MSC+ 13])1 . Predictive models rather approximate
in). Initially in the model M1 , each element of S can
co-occurrence data, instead of counting it directly, and
be thought of as possessing a number of topical ‘as-
show a promising set of properties. Using them, one
sociates’ or ‘nearest neighbors’: words with their re-
directly learns dense lexical vectors (embeddings). Vec-
spective vectors closest to this element vector, ranked
tors are initialized randomly and then, as we move
by their closeness or similarity. The exact number of
through the training corpus with a sliding window of a
nearest neighbors we consider in the simplest case is
pre-defined width, gradually converge to values max-
defined arbitrarily (for example, 10 nearest words). As
imizing the likelihood of correctly predicting lexical
we update the model with new data, co-occurrence
neighbors. Such models as a rule use artificial neu-
counts for the elements of S are gradually growing (the
ral networks to train; this is why they are sometimes
model sees them in new contexts). It means than in
called neural models.
each successive model Mn learned vectors for elements
For our task, it is important that predictive models
of S can be different.
can be updated with new co-occurrence data in a quite
straightforward way. As already said, this is usually If contexts for these words remain pretty much the
not the case with count models which demand compu- same throughout the training data, the list of asso-
tationally expensive calculations each time a new text ciates (nearest neighbors) in Mn will also remain in-
is added. tact. However, if a word acquires new typical contexts
or loses some previous ones, its neural embedding will
change: a semantic shift happens. Accordingly, we
3 Introducing Temporal Dimension to
will see a new list of associates. For example, the vec-
Vector Models tor representation for the word president may change
Detecting semantic shifts which words undergo over so that its nearest neighbor is the vector for the name
time demands the ability to somehow compare ref- of the actual president of a country, instead of the pre-
erence (‘baseline’) and updated models, representing vious one.
later periods of time. In this way, lists of nearest neighbors can be com-
1 The well-known word2vec tool also implements SkipGram, pared across models trained on different corpora or
which is another predictive algorithm. However, it is more com- across one and the same model after an incremen-
putationally expensive, and we leave its usage for future work. tal update (as in the presented research). Substantial
changes or bursts in such lists for the named entities We employ Stanford POS tagger [TKMS03] to extract
we are interested in may signal that these entities have lemmas and to assign each lemma a part-of-speech tag.
undergone or are undergoing semantic shifts, which in In order to test whether extracted semantic shifts
turn reflects real-world events. We dub this approach are consistent across languages, we use a corpus of
‘dynamic neural embedding models’. news articles in Russian published in September 2015
Sets of neighbors in different models can be com- (unfortunately, not available publicly due to copyright
pared in many ways. Approaches to this range from restrictions). It contains about 500,000 texts extracted
simple Jaccard index [Jac01] to complex graph-based from about 1000 Russian-language news sites. The
algorithms. We test two methods: size of the corpus (after lemmatizing and removing
stop-words) is 59,167,835 words. We employ Mystem
1. Kendall’s τ coefficient [Ken48], which measures [Seg03], a state-of-the art tagger for Russian to pro-
similarity of item rankings in two sets. Intuitively, duce lemmas and part-of-speech tags.
it is important to pay attention not only to raw
appearance of some words in the nearest neigh- 5 Experimental setting
bors set, but also to their rankings in it.
News texts from September 2015 do not seem to be a
2. Relative Neighborhood Tree (RNT), introduced by good training set alone. This is because such a cor-
[CGS15]. It essentially produces a tree graph pus is inevitably limited in language coverage, lacking
with the target word as its root, nearest neigh- relations to events that happened earlier. Therefore,
bors as vertexes and similarities between them as we first train a ‘reference’ or ‘baseline’ model which
weighted edges. We then select the immediate aims to mimic some background knowledge, which is
neighbors of the target word in this tree and rank then exposed to daily updates. For English, we used
them according to their cosine similarity to the British National Corpus3 (about 50 million words) to
target word. These rankings are then compared train this reference model, while for Russian it was
across models using the same Kendall’s τ . the corpus of news articles published in the months
preceding September 2015, precisely June, July and
The reason behind the second method is that it theo-
August (taken from the same source as the Septem-
retically allows a deeper analysis of nearest neighbors’
ber articles). This corpus contains about 250 million
sets structure. Obviously, the neighbors participate
words.
in similarity relations not only with the target word
We acknowledge it is not quite correct to employ
but also between themselves. These relations convey
different types of corpora for ‘reference’ models in En-
meaning as well, making it possible to find the most
glish and Russian. However, in a way, we compensate
‘important’ neighbors. Graph-based methods to ana-
the quality and balance of BNC with the larger size of
lyze relations between words in distributional models
the reference corpus in Russian. In the future we plan
were also used in [KWHdR15]; note, however, that
to eliminate this inconsistency by using an analogous
the problem they deal with is inverse to ours – they
set of English news published in summer months or by
attempt to trace changes in surface words for a stable
employing Wikipedia dumps as reference corpora for
set of concepts, while we attempt to trace semantic
both languages.
shifts (changes in underlying concepts for a stable set
Both corpora were merged with same-language
of words).
texts released in the first half of September 2015 (be-
We hoped that this graph-supported ‘pre-selection’
fore 14th of September), in order to seed baseline mod-
would allow Kendall’s τ to improve the performance
els with some initial ‘knowledge’ of events and entities
of the model. However, these expectations failed and
belonging to this month. Then, Continuous Bag-of-
simple ranking turned out to be more efficient than
Words models were trained for both corpora, using
graph-based methods; see Section 6.
negative sampling with 10 samples, vector size 300,
symmetric window size 5 and 5 iterations. Words with
4 Data Description frequency less than 10 were ignored during training.
We test our approach on lemmatized corpora of En- After that, we successively updated these models
glish and Russian news texts. The English corpus con- with texts released in the following September time
sists of The Signal Media Dataset 2 , which contains periods: 14th–15th, 16th–17th, 18th–20th, 21th–22th,
265,512 blog articles and 734,488 news articles from 23th–24th, 25th–27th, and 28th–30th. Granularity of
September 2015. The size of the corpus (after lemma- 2 or 3 days was chosen in order to enlarge the amount
tizing and removing stop words) is 222,928,287 words. of data fed to models: for example, some one-day Rus-
sian corpora corresponding to weekends contained only
2 http://research.signalmedia.co/newsir16/
signal-dataset.html 3 http://www.natcorp.ox.ac.uk/
several thousand words. For this reason, we addition-
Table 1: Change in Chile’s neighbor set
ally tried to include week-ends in 3-days periods, to
make news stream more evenly distributed. As a re- 14th–15th September 16th–17th September
sult, average time period size in tokens was 18,774,000 English Russian English Russian
for English data and 5,332,000 for Russian data. peru бачелет quake аргентина
We once again emphasize that our baseline mod- bolivia аргентина earthquake бачелет
els were not re-trained from scratch with new texts (bachelet)
added from new corpora. Instead, we continued train- colombia коста-рика santiago никарагуа
ing the same model, gradually updating word vectors argentina перчик chilean мексика
with new contexts. All interim states were saved as honduras никарагуа tremor бельгия
separate models, and in the end we had 8 successive brazil швейцария tsunami исландия
models for each language. ecuador бельгия aftershock тунис
We extracted English and Russian countries names nicaragua исландия chileans магнитуда
from Wikipedia list of all world countries4 and manu- (magnitude)
ally checked and normalized it, bringing all name vari- paraguay аргентин temblor землетрясение
ants to one lexeme. Then we filtered out the entities (earthquake)
with frequency less than 30 per million words in either enchiladas гватемала kyushu коста-рика
of our two reference corpora (English and Russian),
producing a set CS of 36 frequent country names5 . days in question, with standard deviation 0.12. Thus,
Finally, for each of the successive models, we found in the case of English, the change to the neighbors’
nearest neighbor sets for each entity in CS and com- set can be considered a significant burst, well above
pared them to the sets from the model state at the simple chance. In the case of Russian, Kendall’s τ
previous time period. Kendall’s τ and Relative Neigh- lies only 1 point below the average value of 0.57. It
borhood Tree (RNT) were used to compute similar- is obvious that Russian mass media paid less atten-
ity coefficients for each country within the given pair tion to the earthquake (they are more concerned with
of models. This provided us with two lists of coun- Michelle Bachelet, Chile’s president), but the event is
tries (for each language) ranked by their similarity to still reflected in the nearest neighbors set.
the same country in the ‘previous’ model. Supposedly, The next section describes how we employed cross-
countries in which some major events happened dur- linguality of the data to evaluate the presented ap-
ing the last days have to position low in these lists, be- proach.
cause their associations in news texts drifted towards
the recent event or an opinion burst.
Let’s illustrate how news texts and changes in the 6 Cross-Lingual Evaluation of Events
models reflect the real-life events by comparing 10 Detection
nearest associates for Chile in the English and Russian There is no ‘golden standard’ or ground truth which
corpora. On the 16th of September 2015 there was an would allow to evaluate precision and recall of our
earthquake in Chile, and we can detect its ‘echo’ in the events and associations extraction, and to tune hy-
changes between our models for 14th–15th and 16th– perparameters of the algorithms. However, there is a
17th of September (see Table 1). way to indirectly estimate their performance in a kind
Before the 16th of September, associates for Chile of intrinsic evaluation.
in both models were mostly the neighboring countries. We hypothesize that the better is an algorithm of
However, after the earthquake things have completely detecting semantic shifts, the closer should be its re-
changed: there was a strong bias towards this topic in sults on model sequences trained on different language
news and blogs, and this is reflected in vectors for the corpora. Obviously, national media focus on different
word. 60% of English and 20% of Russian associates topics, but this mostly concerns the domestic news.
are now related to the event. As for the world news, the worst scenario could be
Kendall’s τ coefficient between these two neighbors that a news story is not covered in national media of
lists is as low as 0 (neighbors are completely replaced) a particular country. However, such scenarios should
for English and 0.56 for Russian. Average Kendall’s be rare. In other cases, the perspective on a story can
τ for CS is 0.56 in the English models for the two differ, but the ‘burst’ should remain the same6 .
4 https://en.wikipedia.org/wiki/List_of_sovereign_ Thus, English and Russian countries lists ranked by
states their ‘burstiness’ can be compared using Spearman’s ρ
5 Low-frequency country names bring in noise, because their
vectors are susceptible to wild fluctuations when exposed to even 6 Analyzing the degree to which the vision of events is differ-
a small amount of new contexts. ent in national media is beyond the scope of the present research.
the Table 1, the baseline approach almost does not re-
Table 2: 5 countries with most changed neighbors’ sets
veal any differences between neighbors sets: average
(of total 36) between September 18–20 and 21–22
Kendall’s τ is 0.92 for English and 0.99 for Russian.
Rank English Russian (translated) Thus, if in the case of English the earthquake event is
1 Italy Japan at least detected (we observe the emergence of 4 new
2 Georgia Brazil related neighbors), in the case of Russian the neigh-
3 Malaysia China bor set remained strictly the same. It seems that the
4 Japan Spain raw co-occurrences approach suffers from overestimat-
5 China Georgia ing the influence of the reference corpora, which are
much larger than the daily updates. Dynamic neural
[Spe04] for each time period. As there are 7 shifts from embedding models overcome this problem.
one time period to another, we use median of ρ values Interestingly, the wider sets of neighbors taken into
for these 7 cases as a tentative measure of algorithm’s account results in better performance only for CBOW
performance. The Table 2 gives an example of such with Kendall’s τ . For the baseline and for CBOW with
country rankings for the changes between 18–20 and RNT, increasing the size of processed neighbor sets
21–22 of September. One can see that the top lists are actually results in poorer performance. The reason
highly similar, with 3 of 5 countries appearing in both for this behavior in RNT can be that the algorithm
(actual Sperman’s ρ for the total lists of 36 countries begins to ‘roam’ in the graph attracting more far-away
between these periods is 0.5). associates as immediate tree neighbors to the target
Overall results of applying this approach to the word. In the baseline method it simply leads to much
whole dataset using two our algorithms (with different language-dependent noise, which semantically aware
sizes of nearest neighbors’ sets to consider) are pre- models filter out at the training stage.
sented in the Table 3. We also applied it to a simple
baseline method, where nearest neighbors are words 7 Conclusions
which most frequently occurred in the window of 5 to- We presented a method of detecting semantic shifts
kens to the right and to the left of the target entity in for countries in news texts with the help of dynamic
the given corpus. neural embedding models. We explored the difference
between entities’ vector representations in the mod-
Table 3: Cross-lingual evaluation els from different temporal stages and discovered as-
Algorithm Neighbors’ Median Spear- sociation shifts that happen to these words over time.
set size man’s ρ This can be employed to trace trends and events in
streaming news texts using a completely unsupervised
Raw co- 5 0.26 (p = 0.12) approach.
occurrences 10 0.15 We showed that distributional semantic models are
baseline 100 0.06 rather efficient when detecting associations shifts and
5 0.25 are in most cases language-independent. In our test
CBOW and 10 0.25 sets, there is a statistically significant correlation be-
Kendall’s τ 100 0.28 (p = 0.09) tween lists of ‘semantically shifted’ countries in En-
glish and Russian sequences of models for the same
CBOW and Rel- 5 0.20
time period.
ative Neighbor- 10 0.16
However, there is still room for improvement. First
hood Tree 100 0.14
of all, some ways to evaluate semantic shifts extraction
have to be developed (including creation of ground
Kendall’s τ consistently renders better results with-
truth datasets). Additionally, we plan to test other
out additional selection of ‘important’ associates by
ways of comparing neighbor sets and tune algorithms’
a relative neighborhood tree (additionally, it is much
hyperparameters. It would be also useful to improve
faster). This once again raises questions about
the quality of corpora (e.g. eliminate more noise and
whether vector models can be efficiently processed
stop words). Finally, we plan to experiment with us-
with graph representations. Kendall’s τ also outper-
ing different algorithms or parameter sets for different
forms the baseline approach: the margin is as small as
languages: preliminary tests show promising results.
two points, but it is supported by higher significance
(p < 0.1).
Note that qualitative analysis of the baseline results
References
shows that they are mostly inappropriate for any prac- [AGK01] James Allan, Rahul Gupta, and Vikas
tical task. For the time period which is described in Khandelwal. Temporal summaries of
new topics. In Proceedings of the 24th Dirichlet Allocation model. Information
Annual International ACM SIGIR Con- Processing & Management, 51(6):809–
ference on Research and Development in 833, 2015.
Information Retrieval, SIGIR ’01, pages
[KWHdR15] Tom Kenter, Melvin Wevers, Pim Huij-
10–18, New York, USA, 2001.
nen, and Maarten de Rijke. Ad hoc mon-
[BDV03] Yoshua Bengio, Rejean Ducharme, and itoring of vocabulary shifts over time.
Pascal Vincent. A neural probabilistic In Proceedings of the 24th ACM Inter-
language model. Journal of Machine national on Conference on Information
Learning Research, 3:1137–1155, 2003. and Knowledge Management, CIKM ’15,
pages 1191–1200, New York, NY, USA,
[CGS15] Amaru Cuba Gyllensten and Magnus 2015. ACM.
Sahlgren. Navigating the semantic hori-
zon using relative neighborhood graphs. [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai
In Proceedings of the 2015 Conference Chen, Greg S Corrado, and Jeff Dean.
on Empirical Methods in Natural Lan- Distributed representations of words and
guage Processing, pages 2451–2460, Lis- phrases and their compositionality. Ad-
bon, Portugal, September 2015. vances in Neural Information Processing
Systems 26, pages 3111–3119, 2013.
[HBB10] Matthew Hoffman, Francis R. Bach, and
[POL10] Saša Petrović, Miles Osborne, and Victor
David M. Blei. Online learning for la-
Lavrenko. Streaming first story detec-
tent dirichlet allocation. In Neural In-
tion with application to twitter. In Hu-
formation Processing Systems 23, pages
man Language Technologies: The 2010
856–864, Vancouver, Canada, 2010.
Annual Conference of the North Ameri-
[Jac01] Paul Jaccard. Distribution de la Flore can Chapter of the Association for Com-
Alpine: dans le Bassin des dranses et putational Linguistics, pages 181–189.
dans quelques régions voisines. Rouge, Association for Computational Linguis-
1901. tics, 2010.
[JS09] David Jurgens and Keith Stevens. Event [Reh11] Radim Rehurek. Scalability of semantic
detection in blogs using temporal ran- analysis in natural language processing.
dom indexing. In Proceedings of the PhD thesis, Masaryk University, 2011.
Workshop on Events in Emerging Text [Seg03] Ilya Segalovich. A fast morphological al-
Types, pages 9–16, Borovets, Bulgaria, gorithm with unknown word guessing in-
2009. duced by a dictionary for a web search
engine. In MLMTA, pages 273–280. Cite-
[KARPS15] Vivek Kulkarni, Rami Al-Rfou, Bryan
seer, 2003.
Perozzi, and Steven Skiena. Statistically
significant detection of linguistic change. [Spe04] Charles Spearman. The proof and mea-
In Proceedings of the 24th International surement of association between two
Conference on World Wide Web, pages things. The American journal of psychol-
625–635, Florence, Italy, 2015. ogy, 15(1):72–101, 1904.
[KCH+ 14] Yoon Kim, Yi-I Chiu, Kentaro Hanaki, [TKMS03] Kristina Toutanova, Dan Klein, Christo-
Darshan Hegde, and Slav Petrov. Tem- pher D Manning, and Yoram Singer.
poral analysis of language through neu- Feature-rich part-of-speech tagging with
ral language models. In Proceedings of a cyclic dependency network. In
the 52nd Annual Meeting of the Asso- Proceedings of the 2003 NAACL-HLT
ciation for Computational Linguistics, Conference-Volume 1, pages 173–180.
page 61, Baltimore, USA, 2014. Association for Computational Linguis-
tics, 2003.
[Ken48] Maurice George Kendall. Rank correla-
tion methods. Griffin, 1948. [TP+ 10] Peter Turney, Patrick Pantel, et al. From
frequency to meaning: Vector space
[KNR15] Manika Kar, Sérgio Nunes, and Cristina models of semantics. Journal of artifi-
Ribeiro. Summarization of changes in cial intelligence research, 37(1):141–188,
dynamic text collections using Latent 2010.