Cross-lingual detection of world events from news articles Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik Jožef Stefan Institute Ljubljana, Slovenia {firstname.surname}@ijs.si Abstract. In this demo we describe a system called Event Registry (http://eventregistry.org) that can identify world events from news arti- cles. Events can be detected in different languages. For each event, the system can extract core event information and store it in a structured form that allows advanced search options. Numerous visualizations are provided for visualizing search results. Keywords: information extraction; cross-linguality; events; named entity de- tection; clustering; news 1 Introduction Each day there are thousands of small, medium and large events that are hap- pening in the world. These events range from insignificant meetings, conferences and sport events to important natural disasters and decisions made by world leaders. The way we learn about these events is from different media channels. The unstructured nature of content provided on these channels is suitable for humans, but hard to understand by computers. Identifying events described in the news and extracting main information about these events is the main goal of the system presented in this paper. Event Registry [2] is able to process news articles published in different languages world- wide. By analyzing the articles it can identify the mentioned events and extract main event information. Extracted event information is stored in a structured way that provides unique features such as searching for events by date, location, entity, topic and other event properties. Beside finding events of interest, Event Registry also provides user interface with numerous visualizations showing date, location, concept and topic aggregates of events that match the search criteria. The rest of the paper is organized as follows. We start by describing the high level architecture of the system. Next, we describe in more details the process of identification of events from individual news articles. We continue by providing information about how event information is extracted from a group of articles mentioning the event. In the end we also describe the main features of the frontend interface of the Event Registry. 2 Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik 2 System architecture Event Registry consists of a pipeline of components, that each provide unique and relevant features for the system. In order to identify events we first need data from which we can extract the information. In Event Registry we use news articles as the data source. The news articles are collected using the News- Feed [5] service which collects news articles from more than 75.000 worldwide news sources. Collected articles are in more than 40 languages, where articles in English, Spanish, German and Chinese languages amount to about 70% of all articles. These languages are also the only ones we use in the Event Registry. Each collected article in the mentioned languages is then analyzed in order to extract relevant information. One of the key tasks is identification and disam- biguation of named entities and topics mentioned in the article. We perform this task using a semantic enrichment service developed in the XLike project. We also detect date mentions in the text, since they frequently specify the date when the event occurred. By analyzing the articles we noticed that different news sources frequently publish almost identical news articles. These duplicated articles don’t bring any new information, which is why we identify them and ignore them in the rest of the pipeline. An important feature of the Event Registry is cross-linguality. To support it, we identify for each article a set of most similar articles in other languages. To identify these articles we use canonical correlation analysis[4] which maps articles from different languages into a common semantic space. The common space can then be used to identify most similar articles in all other languages. In order to train the mapping to the common space we used the aligned corpus of Wikipedia articles. After extracting relevant features from each individual article we start with the main task of identifying events from groups of articles. Since this is the main contribution of the paper we will describe the details of the process in the next three sections. 3 Identification of events An assumption that we make in identifying events is that any relevant event should be reported at least by a few different news publishers. In order to identify events we therefore apply an online clustering algorithm based on [1] on articles as they are added into the system. Each article is first transformed into a TF-IDF weighted feature vector. The features in the vector are the words in the document as well as the identified named entities and topics. If the cosine similarity of the closest centroid is above a threshold, the article is added to the closest cluster and the cluster properties are updated. Otherwise, a new cluster is formed that contains only the new article. Each identified cluster of articles is considered to describe an event if it contains at least a minimum number of articles (the minimum value used in our system is 5 articles). Once a cluster reaches this limit, we treat it as an Cross-lingual detection of world events from news articles 3 event and the information about it’s articles are sent to the next components in the pipeline. Those components are responsible for extracting event information from the articles and will be described in the next sections. Since clusters are being constantly updated with new articles we want to reevaluate each cluster after a few updates in order to determine if it should be split into two clusters or merged with another cluster. In order to decide if the cluster should be split we apply a bisecting k-means algorithm (with k = 2) on the cluster. We then use a variant of the Bayesian Information Criterion to decide whether to accept the new split or not. Periodically we also identify pairs of clusters with high similarity and decide if they should be merged or not. The decision is made using the Lughofer’s ellipsoid criterion [3]. We assume that the clusters that have not been modified for a few days mention past events and we therefore remove them from the list of maintained clusters. 3.1 Cross-lingual merging of clusters Each identified cluster of articles contains only articles in a single language. Since articles in different languages can describe the same events we want to identify clusters describing the same event in different languages and represent them as a single event. In order to determine which cluster pairs to merge we represent the task as a binary classification problem. Given a cluster pair c1 and c2 in languages l1 and l2 we want to extract a set of features that would discriminate between cluster pairs that describe the same event and those that don’t. A very important learning feature that we can extract for each cluster pair is computed by inspecting individual articles in each cluster. Using canonical correlation analysis we are able to obtain for each article in c1 a list of 10 most similar articles in language l2 . Using this information we can check how many of these articles are in c2 . We can repeat the same computation for articles in c2 . By normalizing the results by the size of the clusters we can obtain a score that should by intuition correlate with similarity of the two clusters across the two languages. Some of the other features that we extract include the time difference between the average article date of each cluster, cosine similarity of the annotated concepts and topics, and category similarity. In order to build a classification model we collected 85 learning examples. Some cluster pairs were selected randomly and some were selected by users based on their similarity. The features for the selected learning examples were extracted automatically, while the correct class was assigned manually. A linear SVM was then trained on the data which achieved 87% accuracy using 10-fold cross validation. 4 Event information extraction Once we obtain one or more new clusters that are believed to describe a single event, we assign them a new id in the Event Registry. From the associated arti- cles we then try to extract the relevant information about the event. We try to 4 Gregor Leban, Blaž Fortuna, Janez Brank, and Marko Grobelnik determine event date by seeing if there is a common date reference frequently mentioned in the articles. If no date is mentioned frequently enough we use the average article’s published date as the event date. The location of the event is determined by locating frequently mentioned named entities that represent locations. To determine what the event is about we aggregate the named enti- ties and topics identified in the articles. Each event is also categorized using a DMoz taxonomy. All extracted information is stored in the Event Registry in a structured form that provides rich search and visualization capabilities. 5 Event search, visualization options and data accessibility Event Registry is available at http://eventregistry.org and currently contains 15.000.000 articles from which we identified about 1 million events. Available search options include search by relevant named entities, keywords, publishers, event location, date and category. The resulting events that match the criteria can then be seen as a list or using one of numerous visualizations. Main vi- sualizations of search results include location and time aggregates, list of top named entities and topics, graph of related entities, concept trends, concept ma- trix, date mentions, clusters of events and event categories. For each individual event we can provide the list of articles describing it as well as visualizations of concepts, article timeline, date mentions, article sources and other similar events. Examples of these visualizations are (due to space limit) available on http://eventregistry.org/screens. All Event Registry data is also stored using the Storyline ontology and is available through a SPARQL endpoint available at http://eventregistry.org/sparql. 6 Acknowledgments This work was supported by the Slovenian Research Agency and X-Like (ICT- 288342-STREP). References 1. C. C. Aggarwal and P. Yu. A framework for clustering massive text and categorical data streams. In Proceedings of the sixth SIAM international conference on data mining, volume 124, pages 479–483, 2006. 2. G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Event registry – learning about world events from news. In WWW 2014 Proceedings, pages 107–110. ACM, 2014. 3. E. Lughofer. A dynamic split-and-merge approach for evolving cluster models. Evolving Systems, 3(3):135–151, 2012. 4. J. Rupnik, A. Muhic, and P. Skraba. Cross-lingual document retrieval through hub languages. In NIPS, 2012. 5. M. Trampus and B. Novak. Internals of an aggregated web news feed. In Proceedings of 15th Multiconference on Information Society 2012 (IS-2012), 2012.