<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event Related Document Retrieval with Multilingual Real World Event Representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guillaume Bernard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cyrille Suire</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cyril Faucher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoin</string-name>
          <email>antoine.doucetg@univ-lr.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universite de La Rochelle, Laboratoire L3i, 17000 La Rochelle, France https://l3i.univ-larochelle.fr/</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This demonstration paper introduces a tool to analyse historical digital libraries with the bene t of publicly available knowledge bases, such as Wikidata and Wikipedia. In this paper, we focus on realworld-events of the past, such as festivals or assassinations. We introduce our method which merges knowledge from Wikidata and Wikipedia article summaries to gather entities involved in events, dates, types and labels. We hereby present the Web tool we designed and the implemented method to characterise events. It integrates easily into any event oriented pipeline: in our demonstration, we use it to nd event related documents in the NewsEye corpus 1.</p>
      </abstract>
      <kwd-group>
        <kwd>Event</kwd>
        <kwd>Event Representation</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Linked and Open Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        We propose a software library 2 able to extract event characteristics from at
least two knowledge bases (KB), Wikidata and Wikipedia. Used together, they
contain su cient data to characterise real world historical events. In this paper,
we refer to events as happenings in the real world which have spatio-temporal
anchors and additional entities involved in it. We focus on participating entities
which are dates, places and participants [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this paper, we also release a Web
demonstrator that is able to build a language-independent, representation of
events and to use it in order to query event related documents from a large
historical news corpora. For a given language, it provides all the available spellings
for the event type and the associated entities. This tool can therefore be useful
for improving event-based search engines performance for digital libraries.
      </p>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
1 The NewsEye corpus is available for anyone at https://platform.newseye.eu
2 The package is a Python 3 library called wikivents on Pypi.org and available on
the Software Heritage repository at https://archive.softwareheritage.org/swh:
1:dir:ef325a054ba6f7eb1121807da7b1c92b9ecde8f8
With such a tool, we propose to analyse a large corpora of historical
documents. The method we propose in this paper will help ltering documents on
whether they report a speci c historical event. To the best of our knowledge, such
a tool does not exist. We propose a method to describe events from knowledge
bases, and the interface to use it and nd event related documents.</p>
      <p>In this paper, we will take the example of the Assassination of Rasputin
event, a murder that occurred in December 1916 in the Russian Empire.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Build a Language-independent Event Representation</title>
      <p>
        Our methodology uses an ontology (i.e. Wikidata) as a primary source of
information to obtain fundamental event characteristics: type and dates or times.
It also processes semi-structured databases (i.e. Wikipedia and others) to
identify involved entities [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We rst focused on Wikidata for the links it has with
the Wikimedia ecosystem, especially the links between Wikidata entities and
Wikipedia articles in all languages.
      </p>
      <p>
        This software supplies a solution when it is either too long or complicated
to aggregate event characteristics manually. It enables to add more data sources
(e.g. EventKG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or DBPedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) to re ne the event representation. It can easily
integrate a dedicated natural language processing pipeline focused on real-world
event analysis. We take the example of researchers wishing to gather all the
characteristics about political assassinations provided by a KB. There is a total
of 78 entities of this type in Wikidata (this means only 78 Wikidata entries are
\instances of" political assassinations), over 952.351 events. Other event types
are available, but we focus on what we qualify of indisputable an event that
is considered as such for people with various backgrounds (history scholars [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
or NLP reseachers [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ], for instance). Political assassinations are an example
of an indisputable event type. For each of them, researchers collect the event
dates, types and labels from a public API. The added value of this tool is that
it is able to gather these information from the ontology and all the entities
mentioned in the associated semi-structured databases. EventKG or Wikipedia
are not exhaustive databases and some properties, as well as named entities may
miss.
2.1
      </p>
      <p>
        Ontologies: the Extraction of Elementary Event Information
Ontologies such has EventKG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], YAGO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or Wikidata provide di erent event
identi ers which reveal subtleties. In this paper, we conform to the Wikidata
event type (WET) de nition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which includes both Wikidata event types
Q1656682 for breaking events and Q1190554 for events without premisses.
      </p>
      <p>Our software takes Wikidata entities identi ers as input. Q2882749 is the
input value for our example, the assassination of Rasputin. The entity is checked
to con rm it is an instance of a WET. If so, the dates (in the Gregorian calendar),
the event locations and labels are saved. These generic properties discriminate
two similar events: it is unlikely that two distinct events have the same labels,
types, and occurred at the same place at the same time. Other properties (the
target for a political assassination for instance) do depend on the WET and are
often missing. To overcome this limitation, we propose to analyse Wikipedia lead
sections as well, looking for relevant named entities.</p>
      <p>In our example, we get the entity labels as strings for every language, the
date and the linked entities, identi ed by their URIs.
2.2</p>
      <sec id="sec-2-1">
        <title>Semi-structured data sources: the Extraction of Entities Involved in the Event</title>
        <p>
          The event representation is supplemented with entities extracted from
semistructured data-sources. In this work, we analyse the lead sections of Wikipedia
articles. We focus on those written in ve well represented languages that are
English, French, German, Spanish and Italian. Languages are chosen arbitrarily
from the top-10 list of biggest Wikipedia versions, excluding ones written by bots.
For each lead section, involved entities [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] represented as Wikipedia internal links
are kept and the corresponding identi ers in the ontology saved.
        </p>
        <p>To show the relevance of entities in relation to the event, the number of
occurrences found in lead sections is counted. Entities found in multiple lead sections
are important in the event description, synthesise historical knowledge and give
an unbiased information about their implication in the event. In our example,
there only exists Wikipedia articles written in Spanish and French. From them,
we extract, for instance, these triples: (PER, Q312997 [Felix Yusupov,
perpetrator], 3), (PER, Q43989 [Grigori Rasputin, target], 2), (GPE, Q34266 [Russian
Empire], 1). The weights, respectively 3, 2 and 1 show that knowing who
murdered the victim is more pertinent than where it happened. Beside of those
entities, the system found 7 people involved, 5 organisations and 4 geo-political
entities. There are respectively only 3, 0 and 3 where analysing Wikidata only.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Event Description in Understandable Languages</title>
        <p>The event representation then consists of an association of absolute properties
such as dates and links to knowledge bases: it is language independent. In most
cases, ontologies provide multiple names in di erent languages (i.e. with di erent
writings) for each entity. In our example, in French the entity Q312997 is written
Felix Youssoupo or Felix Youssoupov.</p>
        <p>A nal step transforms an abstract event representation, based on KB
identi ers, into a language-dependent description. It extracts all the alternative
spellings for every entity involved in the event. The software makes it
possible to get the event description in Italian even if, in this example, only French
and Spanish Wikipedias were analysed.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Limitations and Opportunities</title>
      <p>It may happen that for some entities, there does not exist any language-speci c
spelling in any ontology (e.g. the entity Q3129997, Felix Yussopov has no spelling
given in some languages). This case is rare but may be encountered when
processing events where local entities are involved, less known to speakers of other
languages. We only took into account a list of arbitrarily chosen languages,
without capitalising on the languages spoken where the event happened, excluding
de facto Asian and African languages.</p>
      <p>The process has drawbacks: SPARQL queries are used to retrieve event data
and the procedure can be slow due to the multiple API calls. Our caching
implementation accelerates the process by a factor of almost 20 (from one minute to
ve seconds with the running example). We previously mentioned the possibility
to extend the tool capabilities. This way, it is easy to locally process any triple
store, such as a Wikidata dump, to overcome this limitation.</p>
      <p>
        Evaluation of such a tool is hard, due to the lack of annotated data. We
propose, to evaluate this tool and the e ciency of the event representation to use
datasets such as those from the Topic Detection and Tracking (TDT) program [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
or the event-centered dataset from Event Registry [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which is a collection of news
articles, in multiple languages with event annotations.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we release a set of tools, including a Web front-end to the
library, able to gather into one speci c language, all the event related knowledge
that exists. We provide an easy way to analyse events described on the
Internet. The library can be easily extended and integrating more data sources (i.e.
ontologies) is made easy. Extensive technical information may be found on the
Pypi project page, as well as real world examples and outputs.</p>
      <p>It is already in use as the entry point of a pipeline in a multilingual historical
event based search engine which indexes millions of historical documents. For
this demonstration, we used a private access to the NewsEye corpus to build
queries with information collected from Wikidata and Wikipedia. Similarly to
a search engine query, we weighted the entities according to their importance
related to the event. The query is a conjunction of all weighted terms found by
the wikivents library.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been supported by the European Union's Horizon 2020 research
and innovation program under grants 770299 (NewsEye) and 825153
(Embeddia).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , J.:
          <article-title>Introduction to Topic Detection and Tracking</article-title>
          .
          <source>In: Topic Detection And Tracking: Event-Based Information Organization</source>
          , pp.
          <volume>1</volume>
          {
          <issue>16</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Bartscherer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Menne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rettinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Linked data quality of DBpedia, Freebase</article-title>
          , OpenCyc, Wikidata, and YAGO.
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ),
          <volume>77</volume>
          {129 (Nov
          <year>2017</year>
          ). https://doi.org/10.3233/SW-170275
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gottschalk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demidova</surname>
          </string-name>
          , E.:
          <article-title>EventKG: A Multilingual Event-Centric Temporal Knowledge Graph</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          <volume>10843</volume>
          (
          <year>Jun 2018</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -93417-418
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ho</surname>
            <given-names>art</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>194</volume>
          ,
          <fpage>28</fpage>
          {61 (Jan
          <year>2013</year>
          ). https://doi.org/10.1016/j.artint.
          <year>2012</year>
          .
          <volume>06</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rudnik</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrhart</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferret</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teyssou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata</article-title>
          .
          <source>In: Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19</source>
          (
          <year>2019</year>
          ). https://doi.org/10.1145/3308560.3316761
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rupnik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muhic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leban</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skraba</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fortuna</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>News Across Languages - Cross-Lingual Document</surname>
            Similarity and
            <given-names>Event</given-names>
          </string-name>
          <string-name>
            <surname>Tracking</surname>
          </string-name>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>55</volume>
          ,
          <volume>283</volume>
          {316 (Jan
          <year>2016</year>
          ). https://doi.org/10.1613/jair.4780
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shaw</surname>
          </string-name>
          , R.:
          <article-title>A Semantic Tool for Historical Events</article-title>
          .
          <source>In: Proceedings of the The 1st Workshop</source>
          on EVENTS: De nition, Detection, Coreference, and Representation. pp.
          <volume>38</volume>
          {
          <fpage>46</fpage>
          .
          <string-name>
            <surname>Atlanta</surname>
          </string-name>
          , Georgia, USA (Jun
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yadav</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A survey on recent advances in named entity recognition from deep learning models</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics (Aug</source>
          <year>2018</year>
          ), https://www.aclweb.org/anthology/ C18-1182
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>