=Paper=
{{Paper
|id=Vol-1568/paper6
|storemode=property
|title=Using News Articles for Real-time Cross-Lingual Event Detection and Filtering
|pdfUrl=https://ceur-ws.org/Vol-1568/paper6.pdf
|volume=Vol-1568
|authors=Gregor Leban,Blaž Fortuna,Marko Grobelnik
|dblpUrl=https://dblp.org/rec/conf/ecir/LebanFG16
}}
==Using News Articles for Real-time Cross-Lingual Event Detection and Filtering==
<pdf width="1500px">https://ceur-ws.org/Vol-1568/paper6.pdf</pdf>
<pre>
      Using news articles for real-time cross-lingual event
                    detection and filtering

               Gregor Leban                          Blaž Fortuna                   Marko Grobelnik
           Jožef Stefan Institute              Jožef Stefan Institute            Jožef Stefan Institute
            Ljubljana, Slovenia                  Ljubljana, Slovenia                Ljubljana, Slovenia
            gregor.leban@ijs.si                  blaz.fortuna@ijs.si               marko.grobelnik@ijs.si


                                                                   In order to learn about current events, people nowa-
                                                                days usually either go to their favorite news publisher’s
                       Abstract                                 web site and browse through the frontpage articles or
    The written medium through which we com-                    they use of some type of aggregator, such as Flip-
    monly learn about relevant news are news ar-                board or Bloomberg Terminal. Neither of the two ap-
    ticles. Since there is an abundance of news ar-             proaches are optimal. By browsing a publisher’s web
    ticles that are written daily, the readers have             site you typically learn about a small subset of current
    a common problem of discovering the content                 events (usually constrained to the geographic location
    of interest and still not be overwhelmed with               of the news source) that are not necessarily unbiased
    the amount of it. In this paper we present a                and objective but instead implicitly promote political,
    system called Event Registry which is able to               social and religious views of the publisher/author. Us-
    group articles about an event across languages              ing a news aggregator on the other hand can provide
    and extract from the articles core event in-                the readers with a coverage of the same events from
    formation in a structured form. In this way,                multiple news sources, but unfortunately also over-
    the amount of content that the reader has to                whelms the reader with huge amounts of news articles
    check is significantly reduced while addition-              (Bloomberg Terminal daily provides over 1 million ar-
    ally providing the reader with a global cover-              ticles). Using a news aggregator is also helpful since it
    age of each event. Since all event information              usually allows one to specify a particular topic to fol-
    is structured this also provides extensive and              low, such as Business, Technology, Apple or Android.
    fine-grained options for information searching              The list of topics is however quite narrow and does not
    and filtering that are not available with cur-              allow one to specify long-tail interests.
    rent news aggregators.                                         In this paper we will describe a system called Event
                                                                Registry [4] that tries to alleviate the aforementioned
1    Introduction                                               issues with news consumption and is freely available
                                                                at 1 . Just as news aggregators it collects news arti-
News publishers daily produce large numbers of news             cles published globally from more than 100,000 news
articles. Most of these articles describe happenings            sources in over 10 different languages. However, un-
that are currently occurring in the world, such as natu-        like the aggregators, Event Registry identifies from the
ral disasters, meetings of important politicians, crime,        articles the actual events that are being described in
business and sport events. Not all reported informa-            the articles. For Event Registry, an event is defined
tion is equally important – some events get higher me-          as any significant happening in the world that was re-
dia coverage, while other events get reported only by           ported in at least a few articles. Two examples of
a small set of publishers.                                      events are the death of David Bowie on Jan 11, 2016
                                                                that was reported in over 4,000 news articles as well as
Copyright c 2016 for the individual papers by the paper’s au-
thors. Copying permitted for private and academic purposes.     the news reported in 13 articles on Jan 23, 2016, that
This volume is published and copyrighted by its editors.        in Smithsonian’s National Zoo, the Giant Panda was
In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf-   really enjoying the snow.
gartner, R. Campos and D. Albakour (eds.): Proceedings of the      Grouping of news articles into events has several ad-
NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016,
published at http://ceur-ws.org                                   1 http://eventregistry.org/
vantages. First, given an event, the reader can choose     things, such as Zika virus, murder, movie, automo-
to read articles from various news sources that re-        bile, etc. Identification of concepts (entities + non-
ported about the event. Providing the complete and         entities) is done by wikification, which is a process of
global coverage of the event allows the reader to con-     entity linking that uses Wikipedia as the knowledge
struct an unbiased view of the event and all related       base. As a result, each mentioned concept is anno-
details. Secondly, when browsing through the current       tated with a URI that is the link to the corresponding
events, the reader does not have to go through hun-        Wikipedia page. Since Wikipedia provides pages for
dreds of news articles, where several articles report      the same concept in several languages (Barack Obama
about the same event. Instead, all articles about the      has a Wikipedia page in 225 languages), the question
same event are grouped together and shown only once,       is which URL to take as the concept URI. We use the
which easily reduces the amount of content for one or      link to the English Wikipedia, when it is available,
two orders of magnitude. Lastly, for each event in         and the link to original (article) language otherwise.
Event Registry there is also abundant semantic infor-      ”Normalizing” the concepts to the same URI is very
mation that is extracted from the articles, such as the    important since it allows the readers to find content
location of the event, date, who and what the event        regardless of the language in which it is written. The
is about, etc. This semantic information allows the        URI for the concept of the Sun, for example, would
reader to determine very specifically what his inter-      be the same, regardless if it is found in an English,
ests are and get a custom-tailored feed of events and      Slovene (as ’Sonce’), Italian (as ’Sole’) or any other
news.                                                      language. Along with the URI, we also compute the
   The rest of the paper is organized as follows. We       relevance of the concept for the article. The relevance
will first describe the process in which Event Registry    is computed depending on the number of times the
identifies events from news articles. We will also de-     concept is mentioned as well as it’s locations in text
scribe in more details the process in which the articles   and can be in the range between 1 and 5.
about the same event can even be linked although they         Another type of semantic enrichment we perform
are written in different languages. Additionally we will   is categorization of the news articles based on the ar-
also describe the concept of a topic page which can be     ticle’s content. Currently we categorize news articles
used by readers to very specifically determine the news    into a DMOZ [1] taxonomy. This taxonomy contains
articles and events of interest. We end the paper with     over a million categories, but we only consider top 3
a conclusion and some ideas for future work.               levels, which amounts to 5,000 categories. The tax-
                                                           onomy was built for organizing web pages so it is not
2     Event Registry                                       the optimal fit for categorizing news content. A more
Event Registry consists of a pipeline of services that     appropriate categorization would be to the IPTC’s Me-
collect, process and analyze news articles collected       dia Topics taxonomy [2], which contains about 1.400
globally in different languages. We will now briefly       topics structured into 3 levels. Unfortunately we have
describe the major components in the pipeline.             not yet been able to obtain an annotated corpus of
                                                           articles that we could use to train the models for this
2.1   Collecting news                                      taxonomy.
                                                              Additionally we also extract from news articles all
In order to collect the news we developed a service        mentions of dates. Extracting dates is relevant for
called Newsfeed [5] that monitors RSS feeds of over        the following steps when we want to determine when
100,000 news publishers. Whenever a new article is         the event described in the text occurred. In order to
detected in a feed, we crawl the web page and ex-          extract the dates we created an extensive set of regular
tract from it the news article and the available meta-     expressions for individual languages that can detect
data information. In this way we collect daily between     date mentions in various forms.
200,000 and 300,000 news articles in various languages.

2.2   Semantic enrichment                                  2.3   Clustering of news articles
The collected news articles provide information in un-     In order to group all articles that describe the same
structured form which requires a human to interpret        event we use an online clustering algorithm. The
it.                                                        clustering is applied on each language separately and
    One way in which we extract structured/semantic        in short works as follows. Each collected article is
information from the articles is by identifying and dis-   first represented as bag-of-words – a representation in
ambiguating relevant entities (people, locations and       which we only keep an unordered list of words from the
organizations) and non-entities mentioned in the ar-       article and the number of times they occurred in the
ticles. Examples of relevant non-entities would be         article. After applying TF-IDF weighting we compute
the similarity of the article with centroids of existing         In order to determine who is involved in the event
clusters. The criteria that is used when computing           we can analyze and aggregate the entities mentioned in
similarity between the article and the cluster centroid      the articles. A list of entities and their associated rel-
are the cosine similarity of the text, similarity of the     evance can be obtained by analyzing the frequency of
mentioned concepts and the date difference. If com-          their occurrence in the articles as well as their assigned
puted similarity of the most similar cluster is above the    scores. Entities can be scored and ranked according to
threshold, the article is put into the cluster, otherwise    this criterion which provides an accurate aggregated
a new (micro) cluster is created, containing only the        view on what and who is the event about.
single article. Micro clusters are not considered to be          Location of the event is another important prop-
events until they reach a certain number of articles.        erty. Since the event location is commonly mentioned
The threshold value for becoming an event depends            in the articles, we can identify it by analyzing the fre-
on the language and was empirically determined to be         quently mentioned entities that are of type location.
between 3 – 6 articles.                                      Additional signal for determining the event location
    News about an event are typically reported only for      can be obtained by inspecting the datelines of the ar-
a limited amount of time. For this reason we also want       ticles. A dateline is a brief piece of text at the be-
to remove clusters after they reach a certain age. Cur-      ginning of the news article that describes where and
rently, when a cluster becomes 5 days old we remove          when the described story happened. The datelines are
it, which means that new articles can not be assigned        unfortunately not present in all news articles and even
to it anymore. In this way we can maintain high per-         when they are, they sometimes represent the location
formance of the system as well as prevent incorrect          where the story was written and not the actual loca-
assignments of new events to old clusters.                   tion of the event. To determine which location, if any,
                                                             is the event location, we apply an SVM classifier. Each
2.4   Construction of events                                 mentioned city is considered to be a candidate for the
                                                             event location and we generate for it a set of learning
Each time a micro-cluster of articles reaches a certain      features. The features we use are based on the num-
size, we form in Event Registry an event and associate       ber times the city is mentioned in the articles and the
it with the cluster of articles. Clustering has to be done   number of times it is mentioned in the dateline. The
for each language separately so each event is initially      SVM model that we use was trained on 200 events for
mono-lingual. Most relevant world events are however         which location was manually determined. Using 5-fold
covered by various publishers globally that report in        cross validation on this training data we found that the
various languages. To represent such clusters as a sin-      achieved classification accuracy of the model is 98%.
gle event we use a machine learning approach that will
be described in more details in the next section.
                                                             3    Cross-lingual linking of clusters
    Each created event is represented in Event Registry
with a unique identifier that can be used to reference       Since same events can be reported in multiple lan-
it. For each event we also want to extract it’s core         guages we need a way for identifying clusters in dif-
information – what occurred, where, who as involved,         ferent languages that are discussing the same event so
etc. To determine these details we use the available se-     that they can be merged and represented as a single
mantic and meta information provided by the articles         event. In short, we need an approach that given two
assigned to the event.                                       clusters of articles determines if they describe the same
    To determine the date of the event, we can ana-          event or not.
lyze the publishing date of the articles in the clusters.       To perform the task we again represent it as a learn-
The naive approach would be to use the date of the           ing problem. From the two tested clusters we extract
first article as the date of the event. In practice this     a set of learning features that can be used for train-
approach generates erroneous results for events that         ing a classification model. There are three groups of
are reported in advance (such as various meetings of         learning features that we use:
politicians, product announcements, etc.) as well as            Cross-lingual article similarity. Using an ap-
when the collected publishing dates of the articles are      proach based on CCA [3] we can compute an es-
inaccurate. A more error prone approach that we use          timated similarity between articles in different lan-
is to analyze the density of reporting and use the time      guages. Given this measure we can compute how sim-
point where the reporting intensified as the date of         ilar individual articles in one cluster are to the indi-
the event. Additional input can be provided by the           vidual articles in the other. From these results we
mentioned date references – a particular date that is        can generate a number of learning features such as the
consistently mentioned across the articles most likely       maximum similarity, the average similarity, standard
the correct date of the event.                               deviation, etc.
   Concept-related features. Articles in Event               and only those that achieve high enough score (a pa-
Registry are annotated with concepts that have lan-          rameter specified by the user) are then shown to the
guage independent URIs. For each cluster, we can             user in the feed of the topic page.
analyze the associated articles and determine the top            More specifically, the scoring is done as follows.
concepts based on how frequently they appear in these        Let’s assume that the user defines a topic T using a set
articles and what are their assigned scores. Using two       of conditions ci , i = 1..n and their associated weights
such weighted vectors, one for each cluster, we can          wi , where conditions consist of one or more concepts,
compute a list of informative features. Examples of          keywords, news sources and/or categories. For each
these features include cosine and Jaccard similarities       new event e, a score ST (e) is computed as
of the two vectors. Additional features can also be
                                                                                    n
computed separately for the entities and non-entities                               X
                                                                       ST (e) =           wi · in(ci , e) · val(ci , e)
in the vectors.
                                                                                    i=1
   Miscellaneous features. Additional set of fea-
tures can be computed reporting (a) whether the event                                       
locations found for the two clusters are the same or                                            1    ci ∈ e
                                                                           in(ci , e) =
not, (b) the absolute difference in hours between the                                           0    otherwise
events in the two clusters and (c) the similarity of the
dates that are being mentioned in the articles in the                               
two clusters.                                                                             eci /100    ci is a concept
                                                                    val(ci , e) =
   To evaluate how accurately we can, given these fea-                                    1           otherwise
tures, predict whether two clusters are about the same
                                                                The score ST (e) is therefore a simple sum over all
event or not, we performed the following experiment.
                                                             conditions, where for each condition ci we multiply the
Using two human experts we have manually annotated
                                                             associated weight wi with a Boolean function in(ci , e)
808 pairs of clusters in English, Spanish and German
                                                             and a scoring function val(ci , e). Function in(e, ci )
language. The dataset contained 402 examples of clus-
                                                             simply determines if the condition ci matches the event
ter pairs that report about the same event and 406 ex-
                                                             e or not. In case the condition is a concept or a cate-
amples where they do not. By training a linear SVM
                                                             gory, the function is true when the event is annotated
model and by using 10-fold cross validation schema we
                                                             with it. In case the condition is a news source, the
were able to achieve 89.2% classification accuracy.
                                                             function is true if the event contains an article writ-
                                                             ten by the news source. Lastly, in case the condition
4    Topic pages                                             is a keyword, the function is true if the keyword ap-
                                                             pears in any of the articles assigned to the event. The
Whenever an event is identified or updated, the in-
                                                             scoring function val(ci , e) is trivial, except in the cases
formation is stored in the Event Registry. Currently,
                                                             when ci is a concept. When concepts cj are associated
Event Registry holds information about 3.6 million
                                                             with an event e, they are assigned a score ecj that
events that it identified from 88 million news articles,
                                                             is in range between 1 and 100, which represents how
which were collected since January 2014. The users
                                                             important the concept is to the event. The function
can use the web interface to search for events based
                                                             val(ci , e) therefore simply ensures that for all condi-
on various criteria, such as relevant concepts, news
                                                             tions, the returned value is in range between 0 and 1.
sources that reported about it, location of the event,
                                                             The scoring function for scoring articles is almost the
category, date, size and others. The users can also
                                                             same, except that the normalization constant in func-
simply observe the stream of new/updated events as
                                                             tion val() is 5, each concept in an article is assigned
they are shown on the Event Registry home page.
                                                             a score between 1 and 5. The events and articles that
   An even more useful functionality than observing
                                                             match the topic page can be then visualized on a map
the whole feed of events, is the option for the users
                                                             or displayed in a feed. An example topic page for USA
to create their own feed of articles and events based
                                                             presidential elections is available at Figure 1.
on their own interests. We call this functionality a
topic page, where a topic can be defined using a set of
relevant concepts, keywords, news sources and/or cat-
                                                             5    Conclusion
egories. The user can define the topic page using an         In this paper we have presented a system called Event
interface shown in the top part of Figure 1. To each         Registry with fixes several shortcomings in the ways
specified concept, keyword, news source and category,        how news content is currently being consumed. Firstly,
the user also assigns a weight of relevance for the topic.   it is able to aggregate large amounts of news articles
Each article and event that is processed by Event Reg-       into actual events. Instead of flipping through tens
istry is then scored according to the specified criteria     or hundreds of articles about the same event in your
Figure 1: The interface for defining the topic page (top) and the feed of current events that match the criteria
(bottom). The feed can be displayed on a map or as a list of matching articles and events.

news aggregator, a single item can be shown, together       xLime (ICT-611346-STREP) projects.
with the structured information about the event (who,
what, when, where,...). If interested in the event, the     References
user can then open the details of it and read individual
articles (even in different languages) about it. By read-   [1] DMoz,        open         directory       project,
ing multiple articles, the user can form a more com-            http://www.dmoz.org/.
plete and unbiased view of the event as if he would be
                                                            [2] Media topics, https://iptc.org/standards/media-
able to by just reading about it from a single news pub-
                                                                topics/.
lisher. Having extensive structured information about
the events allows the users of Event Registry to also       [3] S. T. Dumais, T. A. Letsche, M. L. Littman,
create custom feeds based on a combination of general           and T. K. Landauer. Automatic cross-language
or long-tail topics of interest.                                retrieval using latent semantic indexing.       In
                                                                AAAI spring symposium on cross-language text
6   Acknowledgments                                             and speech retrieval, volume 15, page 21, 1997.

This work was supported by the Slovenian Research           [4] G. Leban and et. al. Event registry – learning
Agency as well as X-Like (ICT-288342-STREP) and                 about world events from news. In Proceedings of
   23rd International World Wide Web Conference,
   2014.

[5] M. Trampus and B. Novak. Internals of an aggre-
    gated web news feed. In Proceedings of 15th Multi-
    conference on Information Society 2012 (IS-2012),
    2012.

</pre>