Event Related Document Retrieval with Multilingual Real World Event Representation Guillaume Bernard1[0000−0001−5945−4865] , Cyrille Suire1 , Cyril Faucher1 , and Antoine Doucet1[0000−0001−6160−3356] Université de La Rochelle, Laboratoire L3i, 17000 La Rochelle, France {guillaume.bernard,cyrille.suire,cyril.faucher,antoine.doucet}@univ-lr.fr https://l3i.univ-larochelle.fr/ Abstract. This demonstration paper introduces a tool to analyse his- torical digital libraries with the benefit of publicly available knowledge bases, such as Wikidata and Wikipedia. In this paper, we focus on real- world-events of the past, such as festivals or assassinations. We introduce our method which merges knowledge from Wikidata and Wikipedia ar- ticle summaries to gather entities involved in events, dates, types and la- bels. We hereby present the Web tool we designed and the implemented method to characterise events. It integrates easily into any event oriented pipeline: in our demonstration, we use it to find event related documents in the NewsEye corpus 1 . Keywords: Event · Event Representation · Information Retrieval · Linked and Open Data 1 Introduction We propose a software library 2 able to extract event characteristics from at least two knowledge bases (KB), Wikidata and Wikipedia. Used together, they contain sufficient data to characterise real world historical events. In this paper, we refer to events as happenings in the real world which have spatio-temporal anchors and additional entities involved in it. We focus on participating entities which are dates, places and participants [8]. In this paper, we also release a Web demonstrator that is able to build a language-independent, representation of events and to use it in order to query event related documents from a large his- torical news corpora. For a given language, it provides all the available spellings for the event type and the associated entities. This tool can therefore be useful for improving event-based search engines performance for digital libraries. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 The NewsEye corpus is available for anyone at https://platform.newseye.eu 2 The package is a Python 3 library called wikivents on Pypi.org and available on the Software Heritage repository at https://archive.softwareheritage.org/swh: 1:dir:ef325a054ba6f7eb1121807da7b1c92b9ecde8f8 Guillaume Bernard, Cyrille Suire, Cyril Faucher, and Antoine Doucet With such a tool, we propose to analyse a large corpora of historical docu- ments. The method we propose in this paper will help filtering documents on whether they report a specific historical event. To the best of our knowledge, such a tool does not exist. We propose a method to describe events from knowledge bases, and the interface to use it and find event related documents. In this paper, we will take the example of the Assassination of Rasputin event, a murder that occurred in December 1916 in the Russian Empire. 2 Build a Language-independent Event Representation Our methodology uses an ontology (i.e. Wikidata) as a primary source of in- formation to obtain fundamental event characteristics: type and dates or times. It also processes semi-structured databases (i.e. Wikipedia and others) to iden- tify involved entities [8]. We first focused on Wikidata for the links it has with the Wikimedia ecosystem, especially the links between Wikidata entities and Wikipedia articles in all languages. This software supplies a solution when it is either too long or complicated to aggregate event characteristics manually. It enables to add more data sources (e.g. EventKG [3] or DBPedia [2]) to refine the event representation. It can easily integrate a dedicated natural language processing pipeline focused on real-world event analysis. We take the example of researchers wishing to gather all the characteristics about political assassinations provided by a KB. There is a total of 78 entities of this type in Wikidata (this means only 78 Wikidata entries are “instances of” political assassinations), over 952.351 events. Other event types are available, but we focus on what we qualify of indisputable an event that is considered as such for people with various backgrounds (history scholars [7] or NLP reseachers [1, 6], for instance). Political assassinations are an example of an indisputable event type. For each of them, researchers collect the event dates, types and labels from a public API. The added value of this tool is that it is able to gather these information from the ontology and all the entities mentioned in the associated semi-structured databases. EventKG or Wikipedia are not exhaustive databases and some properties, as well as named entities may miss. 2.1 Ontologies: the Extraction of Elementary Event Information Ontologies such has EventKG [3], YAGO [4] or Wikidata provide different event identifiers which reveal subtleties. In this paper, we conform to the Wikidata event type (WET) definition [5], which includes both Wikidata event types Q1656682 for breaking events and Q1190554 for events without premisses. Our software takes Wikidata entities identifiers as input. Q2882749 is the input value for our example, the assassination of Rasputin. The entity is checked to confirm it is an instance of a WET. If so, the dates (in the Gregorian calendar), the event locations and labels are saved. These generic properties discriminate two similar events: it is unlikely that two distinct events have the same labels, Document Retrieval with Multilingual Real World Event Representation types, and occurred at the same place at the same time. Other properties (the target for a political assassination for instance) do depend on the WET and are often missing. To overcome this limitation, we propose to analyse Wikipedia lead sections as well, looking for relevant named entities. In our example, we get the entity labels as strings for every language, the date and the linked entities, identified by their URIs. 2.2 Semi-structured data sources: the Extraction of Entities Involved in the Event The event representation is supplemented with entities extracted from semi- structured data-sources. In this work, we analyse the lead sections of Wikipedia articles. We focus on those written in five well represented languages that are English, French, German, Spanish and Italian. Languages are chosen arbitrarily from the top-10 list of biggest Wikipedia versions, excluding ones written by bots. For each lead section, involved entities [8] represented as Wikipedia internal links are kept and the corresponding identifiers in the ontology saved. To show the relevance of entities in relation to the event, the number of occur- rences found in lead sections is counted. Entities found in multiple lead sections are important in the event description, synthesise historical knowledge and give an unbiased information about their implication in the event. In our example, there only exists Wikipedia articles written in Spanish and French. From them, we extract, for instance, these triples: (PER, Q312997 [Felix Yusupov, perpetra- tor], 3), (PER, Q43989 [Grigori Rasputin, target], 2), (GPE, Q34266 [Russian Empire], 1). The weights, respectively 3, 2 and 1 show that knowing who mur- dered the victim is more pertinent than where it happened. Beside of those entities, the system found 7 people involved, 5 organisations and 4 geo-political entities. There are respectively only 3, 0 and 3 where analysing Wikidata only. 2.3 Event Description in Understandable Languages The event representation then consists of an association of absolute properties such as dates and links to knowledge bases: it is language independent. In most cases, ontologies provide multiple names in different languages (i.e. with different writings) for each entity. In our example, in French the entity Q312997 is written Félix Youssoupoff or Felix Youssoupov. A final step transforms an abstract event representation, based on KB iden- tifiers, into a language-dependent description. It extracts all the alternative spellings for every entity involved in the event. The software makes it possi- ble to get the event description in Italian even if, in this example, only French and Spanish Wikipedias were analysed. 3 Limitations and Opportunities It may happen that for some entities, there does not exist any language-specific spelling in any ontology (e.g. the entity Q3129997, Felix Yussopov has no spelling Guillaume Bernard, Cyrille Suire, Cyril Faucher, and Antoine Doucet given in some languages). This case is rare but may be encountered when pro- cessing events where local entities are involved, less known to speakers of other languages. We only took into account a list of arbitrarily chosen languages, with- out capitalising on the languages spoken where the event happened, excluding de facto Asian and African languages. The process has drawbacks: SPARQL queries are used to retrieve event data and the procedure can be slow due to the multiple API calls. Our caching imple- mentation accelerates the process by a factor of almost 20 (from one minute to five seconds with the running example). We previously mentioned the possibility to extend the tool capabilities. This way, it is easy to locally process any triple store, such as a Wikidata dump, to overcome this limitation. Evaluation of such a tool is hard, due to the lack of annotated data. We propose, to evaluate this tool and the efficiency of the event representation to use datasets such as those from the Topic Detection and Tracking (TDT) program [1] or the event-centered dataset from Event Registry [6] which is a collection of news articles, in multiple languages with event annotations. 4 Conclusions and Future Work Fig. 1. Web Event exporter, displaying the Assassination of Rasputine in French Document Retrieval with Multilingual Real World Event Representation In this paper, we release a set of tools, including a Web front-end to the library, able to gather into one specific language, all the event related knowledge that exists. We provide an easy way to analyse events described on the Inter- net. The library can be easily extended and integrating more data sources (i.e. ontologies) is made easy. Extensive technical information may be found on the Pypi project page, as well as real world examples and outputs. It is already in use as the entry point of a pipeline in a multilingual historical event based search engine which indexes millions of historical documents. For this demonstration, we used a private access to the NewsEye corpus to build queries with information collected from Wikidata and Wikipedia. Similarly to a search engine query, we weighted the entities according to their importance related to the event. The query is a conjunction of all weighted terms found by the wikivents library. 5 Acknowledgments This work has been supported by the European Union’s Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embed- dia). References 1. Allan, J.: Introduction to Topic Detection and Tracking. In: Topic Detection And Tracking: Event-Based Information Organization, pp. 1–16 (2002) 2. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DB- pedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web 9(1), 77–129 (Nov 2017). https://doi.org/10.3233/SW-170275 3. Gottschalk, S., Demidova, E.: EventKG: A Multilingual Event-Centric Tempo- ral Knowledge Graph. Lecture Notes in Computer Science 10843 (Jun 2018). https://doi.org/10.1007/978-3-319-93417-41 8 4. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence 194, 28–61 (Jan 2013). https://doi.org/10.1016/j.artint.2012.06.001 5. Rudnik, C., Ehrhart, T., Ferret, O., Teyssou, D., Troncy, R., Tannier, X.: Search- ing News Articles Using an Event Knowledge Graph Leveraged by Wikidata. In: Companion Proceedings of The 2019 World Wide Web Conference on - WWW ’19 (2019). https://doi.org/10.1145/3308560.3316761 6. Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., Grobelnik, M.: News Across Languages - Cross-Lingual Document Similarity and Event Tracking. Journal of Artificial Intelligence Research 55, 283–316 (Jan 2016). https://doi.org/10.1613/jair.4780 7. Shaw, R.: A Semantic Tool for Historical Events. In: Proceedings of the The 1st Workshop on EVENTS: Definition, Detection, Coreference, and Representation. pp. 38–46. Atlanta, Georgia, USA (Jun 2013) 8. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics (Aug 2018), https://www.aclweb.org/anthology/ C18-1182