=Paper=
{{Paper
|id=Vol-1568/paper6
|storemode=property
|title=Using News Articles for Real-time Cross-Lingual Event Detection and Filtering
|pdfUrl=https://ceur-ws.org/Vol-1568/paper6.pdf
|volume=Vol-1568
|authors=Gregor Leban,Blaž Fortuna,Marko Grobelnik
|dblpUrl=https://dblp.org/rec/conf/ecir/LebanFG16
}}
==Using News Articles for Real-time Cross-Lingual Event Detection and Filtering==
Using news articles for real-time cross-lingual event detection and filtering Gregor Leban Blaž Fortuna Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia gregor.leban@ijs.si blaz.fortuna@ijs.si marko.grobelnik@ijs.si In order to learn about current events, people nowa- days usually either go to their favorite news publisher’s Abstract web site and browse through the frontpage articles or The written medium through which we com- they use of some type of aggregator, such as Flip- monly learn about relevant news are news ar- board or Bloomberg Terminal. Neither of the two ap- ticles. Since there is an abundance of news ar- proaches are optimal. By browsing a publisher’s web ticles that are written daily, the readers have site you typically learn about a small subset of current a common problem of discovering the content events (usually constrained to the geographic location of interest and still not be overwhelmed with of the news source) that are not necessarily unbiased the amount of it. In this paper we present a and objective but instead implicitly promote political, system called Event Registry which is able to social and religious views of the publisher/author. Us- group articles about an event across languages ing a news aggregator on the other hand can provide and extract from the articles core event in- the readers with a coverage of the same events from formation in a structured form. In this way, multiple news sources, but unfortunately also over- the amount of content that the reader has to whelms the reader with huge amounts of news articles check is significantly reduced while addition- (Bloomberg Terminal daily provides over 1 million ar- ally providing the reader with a global cover- ticles). Using a news aggregator is also helpful since it age of each event. Since all event information usually allows one to specify a particular topic to fol- is structured this also provides extensive and low, such as Business, Technology, Apple or Android. fine-grained options for information searching The list of topics is however quite narrow and does not and filtering that are not available with cur- allow one to specify long-tail interests. rent news aggregators. In this paper we will describe a system called Event Registry [4] that tries to alleviate the aforementioned 1 Introduction issues with news consumption and is freely available at 1 . Just as news aggregators it collects news arti- News publishers daily produce large numbers of news cles published globally from more than 100,000 news articles. Most of these articles describe happenings sources in over 10 different languages. However, un- that are currently occurring in the world, such as natu- like the aggregators, Event Registry identifies from the ral disasters, meetings of important politicians, crime, articles the actual events that are being described in business and sport events. Not all reported informa- the articles. For Event Registry, an event is defined tion is equally important – some events get higher me- as any significant happening in the world that was re- dia coverage, while other events get reported only by ported in at least a few articles. Two examples of a small set of publishers. events are the death of David Bowie on Jan 11, 2016 that was reported in over 4,000 news articles as well as Copyright c 2016 for the individual papers by the paper’s au- thors. Copying permitted for private and academic purposes. the news reported in 13 articles on Jan 23, 2016, that This volume is published and copyrighted by its editors. in Smithsonian’s National Zoo, the Giant Panda was In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopf- really enjoying the snow. gartner, R. Campos and D. Albakour (eds.): Proceedings of the Grouping of news articles into events has several ad- NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March-2016, published at http://ceur-ws.org 1 http://eventregistry.org/ vantages. First, given an event, the reader can choose things, such as Zika virus, murder, movie, automo- to read articles from various news sources that re- bile, etc. Identification of concepts (entities + non- ported about the event. Providing the complete and entities) is done by wikification, which is a process of global coverage of the event allows the reader to con- entity linking that uses Wikipedia as the knowledge struct an unbiased view of the event and all related base. As a result, each mentioned concept is anno- details. Secondly, when browsing through the current tated with a URI that is the link to the corresponding events, the reader does not have to go through hun- Wikipedia page. Since Wikipedia provides pages for dreds of news articles, where several articles report the same concept in several languages (Barack Obama about the same event. Instead, all articles about the has a Wikipedia page in 225 languages), the question same event are grouped together and shown only once, is which URL to take as the concept URI. We use the which easily reduces the amount of content for one or link to the English Wikipedia, when it is available, two orders of magnitude. Lastly, for each event in and the link to original (article) language otherwise. Event Registry there is also abundant semantic infor- ”Normalizing” the concepts to the same URI is very mation that is extracted from the articles, such as the important since it allows the readers to find content location of the event, date, who and what the event regardless of the language in which it is written. The is about, etc. This semantic information allows the URI for the concept of the Sun, for example, would reader to determine very specifically what his inter- be the same, regardless if it is found in an English, ests are and get a custom-tailored feed of events and Slovene (as ’Sonce’), Italian (as ’Sole’) or any other news. language. Along with the URI, we also compute the The rest of the paper is organized as follows. We relevance of the concept for the article. The relevance will first describe the process in which Event Registry is computed depending on the number of times the identifies events from news articles. We will also de- concept is mentioned as well as it’s locations in text scribe in more details the process in which the articles and can be in the range between 1 and 5. about the same event can even be linked although they Another type of semantic enrichment we perform are written in different languages. Additionally we will is categorization of the news articles based on the ar- also describe the concept of a topic page which can be ticle’s content. Currently we categorize news articles used by readers to very specifically determine the news into a DMOZ [1] taxonomy. This taxonomy contains articles and events of interest. We end the paper with over a million categories, but we only consider top 3 a conclusion and some ideas for future work. levels, which amounts to 5,000 categories. The tax- onomy was built for organizing web pages so it is not 2 Event Registry the optimal fit for categorizing news content. A more Event Registry consists of a pipeline of services that appropriate categorization would be to the IPTC’s Me- collect, process and analyze news articles collected dia Topics taxonomy [2], which contains about 1.400 globally in different languages. We will now briefly topics structured into 3 levels. Unfortunately we have describe the major components in the pipeline. not yet been able to obtain an annotated corpus of articles that we could use to train the models for this 2.1 Collecting news taxonomy. Additionally we also extract from news articles all In order to collect the news we developed a service mentions of dates. Extracting dates is relevant for called Newsfeed [5] that monitors RSS feeds of over the following steps when we want to determine when 100,000 news publishers. Whenever a new article is the event described in the text occurred. In order to detected in a feed, we crawl the web page and ex- extract the dates we created an extensive set of regular tract from it the news article and the available meta- expressions for individual languages that can detect data information. In this way we collect daily between date mentions in various forms. 200,000 and 300,000 news articles in various languages. 2.2 Semantic enrichment 2.3 Clustering of news articles The collected news articles provide information in un- In order to group all articles that describe the same structured form which requires a human to interpret event we use an online clustering algorithm. The it. clustering is applied on each language separately and One way in which we extract structured/semantic in short works as follows. Each collected article is information from the articles is by identifying and dis- first represented as bag-of-words – a representation in ambiguating relevant entities (people, locations and which we only keep an unordered list of words from the organizations) and non-entities mentioned in the ar- article and the number of times they occurred in the ticles. Examples of relevant non-entities would be article. After applying TF-IDF weighting we compute the similarity of the article with centroids of existing In order to determine who is involved in the event clusters. The criteria that is used when computing we can analyze and aggregate the entities mentioned in similarity between the article and the cluster centroid the articles. A list of entities and their associated rel- are the cosine similarity of the text, similarity of the evance can be obtained by analyzing the frequency of mentioned concepts and the date difference. If com- their occurrence in the articles as well as their assigned puted similarity of the most similar cluster is above the scores. Entities can be scored and ranked according to threshold, the article is put into the cluster, otherwise this criterion which provides an accurate aggregated a new (micro) cluster is created, containing only the view on what and who is the event about. single article. Micro clusters are not considered to be Location of the event is another important prop- events until they reach a certain number of articles. erty. Since the event location is commonly mentioned The threshold value for becoming an event depends in the articles, we can identify it by analyzing the fre- on the language and was empirically determined to be quently mentioned entities that are of type location. between 3 – 6 articles. Additional signal for determining the event location News about an event are typically reported only for can be obtained by inspecting the datelines of the ar- a limited amount of time. For this reason we also want ticles. A dateline is a brief piece of text at the be- to remove clusters after they reach a certain age. Cur- ginning of the news article that describes where and rently, when a cluster becomes 5 days old we remove when the described story happened. The datelines are it, which means that new articles can not be assigned unfortunately not present in all news articles and even to it anymore. In this way we can maintain high per- when they are, they sometimes represent the location formance of the system as well as prevent incorrect where the story was written and not the actual loca- assignments of new events to old clusters. tion of the event. To determine which location, if any, is the event location, we apply an SVM classifier. Each 2.4 Construction of events mentioned city is considered to be a candidate for the event location and we generate for it a set of learning Each time a micro-cluster of articles reaches a certain features. The features we use are based on the num- size, we form in Event Registry an event and associate ber times the city is mentioned in the articles and the it with the cluster of articles. Clustering has to be done number of times it is mentioned in the dateline. The for each language separately so each event is initially SVM model that we use was trained on 200 events for mono-lingual. Most relevant world events are however which location was manually determined. Using 5-fold covered by various publishers globally that report in cross validation on this training data we found that the various languages. To represent such clusters as a sin- achieved classification accuracy of the model is 98%. gle event we use a machine learning approach that will be described in more details in the next section. 3 Cross-lingual linking of clusters Each created event is represented in Event Registry with a unique identifier that can be used to reference Since same events can be reported in multiple lan- it. For each event we also want to extract it’s core guages we need a way for identifying clusters in dif- information – what occurred, where, who as involved, ferent languages that are discussing the same event so etc. To determine these details we use the available se- that they can be merged and represented as a single mantic and meta information provided by the articles event. In short, we need an approach that given two assigned to the event. clusters of articles determines if they describe the same To determine the date of the event, we can ana- event or not. lyze the publishing date of the articles in the clusters. To perform the task we again represent it as a learn- The naive approach would be to use the date of the ing problem. From the two tested clusters we extract first article as the date of the event. In practice this a set of learning features that can be used for train- approach generates erroneous results for events that ing a classification model. There are three groups of are reported in advance (such as various meetings of learning features that we use: politicians, product announcements, etc.) as well as Cross-lingual article similarity. Using an ap- when the collected publishing dates of the articles are proach based on CCA [3] we can compute an es- inaccurate. A more error prone approach that we use timated similarity between articles in different lan- is to analyze the density of reporting and use the time guages. Given this measure we can compute how sim- point where the reporting intensified as the date of ilar individual articles in one cluster are to the indi- the event. Additional input can be provided by the vidual articles in the other. From these results we mentioned date references – a particular date that is can generate a number of learning features such as the consistently mentioned across the articles most likely maximum similarity, the average similarity, standard the correct date of the event. deviation, etc. Concept-related features. Articles in Event and only those that achieve high enough score (a pa- Registry are annotated with concepts that have lan- rameter specified by the user) are then shown to the guage independent URIs. For each cluster, we can user in the feed of the topic page. analyze the associated articles and determine the top More specifically, the scoring is done as follows. concepts based on how frequently they appear in these Let’s assume that the user defines a topic T using a set articles and what are their assigned scores. Using two of conditions ci , i = 1..n and their associated weights such weighted vectors, one for each cluster, we can wi , where conditions consist of one or more concepts, compute a list of informative features. Examples of keywords, news sources and/or categories. For each these features include cosine and Jaccard similarities new event e, a score ST (e) is computed as of the two vectors. Additional features can also be n computed separately for the entities and non-entities X ST (e) = wi · in(ci , e) · val(ci , e) in the vectors. i=1 Miscellaneous features. Additional set of fea- tures can be computed reporting (a) whether the event locations found for the two clusters are the same or 1 ci ∈ e in(ci , e) = not, (b) the absolute difference in hours between the 0 otherwise events in the two clusters and (c) the similarity of the dates that are being mentioned in the articles in the two clusters. eci /100 ci is a concept val(ci , e) = To evaluate how accurately we can, given these fea- 1 otherwise tures, predict whether two clusters are about the same The score ST (e) is therefore a simple sum over all event or not, we performed the following experiment. conditions, where for each condition ci we multiply the Using two human experts we have manually annotated associated weight wi with a Boolean function in(ci , e) 808 pairs of clusters in English, Spanish and German and a scoring function val(ci , e). Function in(e, ci ) language. The dataset contained 402 examples of clus- simply determines if the condition ci matches the event ter pairs that report about the same event and 406 ex- e or not. In case the condition is a concept or a cate- amples where they do not. By training a linear SVM gory, the function is true when the event is annotated model and by using 10-fold cross validation schema we with it. In case the condition is a news source, the were able to achieve 89.2% classification accuracy. function is true if the event contains an article writ- ten by the news source. Lastly, in case the condition 4 Topic pages is a keyword, the function is true if the keyword ap- pears in any of the articles assigned to the event. The Whenever an event is identified or updated, the in- scoring function val(ci , e) is trivial, except in the cases formation is stored in the Event Registry. Currently, when ci is a concept. When concepts cj are associated Event Registry holds information about 3.6 million with an event e, they are assigned a score ecj that events that it identified from 88 million news articles, is in range between 1 and 100, which represents how which were collected since January 2014. The users important the concept is to the event. The function can use the web interface to search for events based val(ci , e) therefore simply ensures that for all condi- on various criteria, such as relevant concepts, news tions, the returned value is in range between 0 and 1. sources that reported about it, location of the event, The scoring function for scoring articles is almost the category, date, size and others. The users can also same, except that the normalization constant in func- simply observe the stream of new/updated events as tion val() is 5, each concept in an article is assigned they are shown on the Event Registry home page. a score between 1 and 5. The events and articles that An even more useful functionality than observing match the topic page can be then visualized on a map the whole feed of events, is the option for the users or displayed in a feed. An example topic page for USA to create their own feed of articles and events based presidential elections is available at Figure 1. on their own interests. We call this functionality a topic page, where a topic can be defined using a set of relevant concepts, keywords, news sources and/or cat- 5 Conclusion egories. The user can define the topic page using an In this paper we have presented a system called Event interface shown in the top part of Figure 1. To each Registry with fixes several shortcomings in the ways specified concept, keyword, news source and category, how news content is currently being consumed. Firstly, the user also assigns a weight of relevance for the topic. it is able to aggregate large amounts of news articles Each article and event that is processed by Event Reg- into actual events. Instead of flipping through tens istry is then scored according to the specified criteria or hundreds of articles about the same event in your Figure 1: The interface for defining the topic page (top) and the feed of current events that match the criteria (bottom). The feed can be displayed on a map or as a list of matching articles and events. news aggregator, a single item can be shown, together xLime (ICT-611346-STREP) projects. with the structured information about the event (who, what, when, where,...). If interested in the event, the References user can then open the details of it and read individual articles (even in different languages) about it. By read- [1] DMoz, open directory project, ing multiple articles, the user can form a more com- http://www.dmoz.org/. plete and unbiased view of the event as if he would be [2] Media topics, https://iptc.org/standards/media- able to by just reading about it from a single news pub- topics/. lisher. Having extensive structured information about the events allows the users of Event Registry to also [3] S. T. Dumais, T. A. Letsche, M. L. Littman, create custom feeds based on a combination of general and T. K. Landauer. Automatic cross-language or long-tail topics of interest. retrieval using latent semantic indexing. In AAAI spring symposium on cross-language text 6 Acknowledgments and speech retrieval, volume 15, page 21, 1997. This work was supported by the Slovenian Research [4] G. Leban and et. al. Event registry – learning Agency as well as X-Like (ICT-288342-STREP) and about world events from news. In Proceedings of 23rd International World Wide Web Conference, 2014. [5] M. Trampus and B. Novak. Internals of an aggre- gated web news feed. In Proceedings of 15th Multi- conference on Information Society 2012 (IS-2012), 2012.