=Paper= {{Paper |id=Vol-2554/paper6 |storemode=property |title=Giveme5W1H: A Universal System for Extracting Main Events from News Articles |pdfUrl=https://ceur-ws.org/Vol-2554/paper_06.pdf |volume=Vol-2554 |authors=Felix Hamborg,Corinna Breitinger,Bela Gipp |dblpUrl=https://dblp.org/rec/conf/recsys/HamborgBG19 }} ==Giveme5W1H: A Universal System for Extracting Main Events from News Articles== https://ceur-ws.org/Vol-2554/paper_06.pdf
                        Giveme5W1H: A Universal System for Extracting
                              Main Events from News Articles

                                         Felix Hamborg1, Corinna Breitinger1, Bela Gipp2
                                                         1 University of Konstanz, Germany

                                                     {firstname.lastname}@uni-konstanz.de
                                                       2 University of Wuppertal, Germany

                                                             gipp@uni-wuppertal.de



ABSTRACT
                                                                                 1    INTRODUCTION
Event extraction from news articles is a commonly required
prerequisite for various tasks, such as article summarization,                   The extraction of a news article’s main event is an automated
article clustering, and news aggregation. Due to the lack of uni-                analysis task at the core of a range of use cases, including news
versally applicable and publicly available methods tailored to                   aggregation, clustering of articles reporting on the same event,
news datasets, many researchers redundantly implement                            and news summarization [4, 15]. Beyond computer science,
event extraction methods for their own projects. The journal-                    other disciplines also analyze news events, for example, re-
istic 5W1H questions are capable of describing the main event                    searchers from the social sciences analyze how news outlets re-
of an article, i.e., by answering who did what, when, where, why,                port on events in what is known as frame analyses [13, 14].
and how. We provide an in-depth description of an improved                          Despite main event extraction being a fundamental task in
version of Giveme5W1H, a system that uses syntactic and do-                      news analysis, no publicly available method exists that can be
main-specific rules to automatically extract the relevant                        applied to the diverse use cases mentioned to capably extract
phrases from English news articles to provide answers to these                   explicit event descriptors from a given article [17]. Explicit
5W1H questions. Given the answers to these questions, the sys-                   event descriptors are properties that occur in a text to describe
tem determines an article’s main event. In an expert evaluation                  an event, e.g., the phrases in an article that enable a reader to
with three assessors and 120 articles, we determined an over-                    understand what the article is reporting on. The reliable extrac-
all precision of p=0.73, and p=0.82 for answering the first four                 tion of event-describing phrases also allows later analysis tasks
W questions, which alone can sufficiently summarize the main                     to use common natural language processing (NLP) methods,
event reported on in a news article. We recently made our sys-                   such as TF-IDF and cosine similarity, including named entity
tem publicly available, and it remains the only universal open-                  recognition (NER) [10] and named entity disambiguation
source 5W1H extractor capable of being applied to a wide                         (NERD) [19] to assess the similarity of two events. State-of-the-
range of use cases in news analysis.                                             art methods for extracting events from articles suffer from
                                                                                 three main shortcomings [17]. First, most approaches only de-
CCS CONCEPTS                                                                     tect events implicitly, e.g., by employing topic modeling [2, 42].
                                                                                 Second, they are specialized for the extraction of task-specific
• Computing methodologies → Information extraction • In-
                                                                                 properties, e.g., extracting only the number of injured people in
formation systems → Content analysis and feature selection •
                                                                                 an attack [32, 42]. Lastly, some methods extract explicit de-
Information systems → Summarization
                                                                                 scriptors, but are not publicly available, or are described in in-
                                                                                 sufficient detail to allow researchers to reimplement the ap-
KEYWORDS
                                                                                 proaches [34, 45, 47, 48].
News Event Detection, 5W1H Extraction, 5W1H Question An-                            Last year, we introduced Giveme5W1H in the form of a
swering, Reporter’s Questions, Journalist’s Questions, 5W QA.                    poster abstract [16], which was at that time still an in-progress
                                                                                 prototype capable of extracting universally usable phrases that
                                                                                 answer the journalistic 5W1H questions, i.e., who did what,
                                                                                 when, where, why, and how (see Figure 1). This poster, how-
                                                                                 ever, did not disclose or discuss the scoring mechanisms used
                                                                                 for determining the best candidate phrases during main event
                                                                                 extraction. In this paper, we describe in detail how the im-
                                                                                 proved version of Giveme5W1H extracts 5W1H phrases and we
  Copyright © 2019 for this paper by its authors. Use permitted under Creative   describe the results of our evaluation of these improvements.
  Commons License Attribution 4.0 International (CC BY 4.0).                     We also introduce an annotated data set, which we created to
 INRA 2019, September 2019, Copenhagen, Denmark                                                                              Hamborg et al.

train our system’s model to improve extraction performance.                the current state-of-the-art and focus this section on the extrac-
The training data set is available in the online repository (see           tion of the ‘how’ phrases. Most systems focus only on the ex-
Section 6) and can be used by other researchers to train their             traction of 5W phrases without ‘how’ phrases (cf. [9, 34, 47,
own 5W1H approaches. This paper is relevant to researchers                 48]). The authors of prior work do not justify this, but we sus-
and developers from various disciplines with the shared aim of             pect two reasons.
extracting and analyzing the main events that are being re-                First, the ‘how’ question is particularly difficult to extract due
ported on in articles.                                                     to its ambiguity, as we will explain later in this section. Second,
Taliban attacks German consulate in northern Afghan city of                ‘how’ (and ‘why’) phrases are considered less important in
Mazar-i-Sharif with truck bomb                                             many use cases when compared to the other phrases, particu-
  The death toll from a powerful Taliban truck bombing at the German       larly those answering the ‘who’, ‘what’, ‘when’, and ‘where’
consulate in Afghanistan's Mazar-i-Sharif city rose to at least six Fri-   (4W) questions (cf. [21, 40, 49]). For the sake of readability in
day, with more than 100 others wounded in a major militant assault.        this section, we will also include approaches that only extract
  The Taliban said the bombing late Thursday, which tore a massive         the 5Ws when referring to 5W1H extraction. Aside for the ‘how’
crater in the road and overturned cars, was a "revenge attack" for US
                                                                           extraction, the analysis of approaches for 5W1H or 5W-extrac-
air strikes this month in the volatile province of Kunduz that left 32
civilians dead. […] The suicide attacker rammed his explosives-laden
                                                                           tion is generally the same.
car into the wall […].                                                         Systems for 5W1H QA on news texts typically perform three
                                                                           tasks to determine the article’s main event [45, 47]: (1) prepro-
Figure 1: News article [1] consisting of title (bold), lead para-          cessing, (2) phrase extraction [10, 25, 36, 47, 48], where for in-
graph (italic), and first of remaining paragraphs. Highlighted             stance linguistic rules are used to extract phrases candidates,
phrases represent the 5W1H event properties (who did what,                 and (3) candidate scoring, which selects the best answer for
when, where, why, and how).                                                each question by employing heuristics, such as the position of
                                                                           a phrase within the document. The input data to QA systems is
Our objective is to devise an automated method for extracting
                                                                           usually text, such as a full article including the headline, lead
the main event being reported on by a given news article. For
                                                                           paragraph, and main text [36], or a single sentence, e.g., in news
this purpose, we exclude non-event-reporting articles, such as
                                                                           ticker format [48]. Other systems use automatic speech recog-
commentaries or press reviews. First, we define the extracted
                                                                           nition (ASR) to convert broadcasts into text [47]. The outcomes
main event descriptors to be concise (requirement R1). This
                                                                           of the process are six textual phrases, one for each of the 5W1H
means they must be as short as possible and contain only the
                                                                           questions, which together describe the main event of a given
information describing the event, while also being as long as
                                                                           news text, as highlighted in Figure 1. Thus far, no systems have
necessary to contain all information of the event. Second, the
                                                                           been described in sufficient detail to allow for a reimplementa-
descriptors must be of high accuracy (R2). For this reason, we
                                                                           tion by other researchers.
give higher priority to extraction accuracy than execution
                                                                               Both the ‘why’ and ‘how’ question pose a particular chal-
speed [17]. We also defined that the developed system must
                                                                           lenge in comparison to the other questions. As discussed by
achieve a higher extraction accuracy than Giveme5W [17].
                                                                           Hamborg et al. [17], determining the reason or cause (i.e. ‘why’)
Compared to Giveme5W, the system proposed in this paper not
                                                                           can even be difficult for humans. Often the reason is unknown,
only additionally extracts the ‘how’ answer, but its analysis
                                                                           or it is only described implicitly, if at all [11]. Extracting the
workflow is more semantics-oriented to address the issues of
                                                                           ‘how’ answer is also difficult, because this question can be an-
the previous statistics- and syntax-based extraction. We also
                                                                           swered in many ways. To find ‘how’ candidates, the system by
publish the first annotated 5W1H dataset, which we use to
                                                                           Sharma et al. extracts the adverb or adverbial phrase within the
learn the optimal parameters. In the Giveme5W implementa-
                                                                           ‘what’ phrase [36]. The tokens extracted with this simplistic ap-
tion, the values were based on expert judgement.
                                                                           proach detail the verb, e.g., “He drove quickly”, but do not an-
    The presented system especially benefits: (1) social scien-
                                                                           swer the method how the action was performed (cf. [37]), e.g.,
tists with limited programming knowledge, who would benefit
                                                                           by ramming an explosive-laden car into the consulate (in the
from ready-to-use main event extraction methods, and (2)
                                                                           example in Figure 1), which is a prepositional phrase. Other ap-
computer scientists who are welcome to modify or build on any
                                                                           proaches employ ML [24], but have not been devised for the
of the modular components of our system and use our test col-
                                                                           English language. In summary, few approaches exist that ex-
lection and results as a benchmark for their implementations.
                                                                           tract ‘how’ phrases. The reviewed approaches provide no de-
                                                                           tails on their extraction method, and achieve poor results, e.g.,
2    RELATED WORK                                                          they extract adverbs rather than the tool or the method by
The extraction of 5W1H phrases from news articles is related               which an action was performed (cf. [22, 24, 36]).
to closed-domain question answering, which is why some au-                     None of the reviewed approaches output canonical or nor-
thors call their approaches 5W1H question answering (QA) sys-              malized data. Canonical output is more concise and also less
tems. Hamborg et al. [17] gave an in-depth overview of 5W1H                ambiguous than its original textual form (cf. [46]), e.g., poly-
extraction systems. Thus, we only provide a brief summary of
 Giveme5W1H: A Universal System for Extracting
                                                                                                     INRA 2019, September 2019, Copenhagen, Denmark
 Main Events from News Articles


                                                      Preprocessing                            Phrase Extraction         Candidate Scoring




                                                                                                                                                           5W1H Phrases
         Python                                                                                                          Who & what

                     Raw Article
                                             Sentence Split. Canonicalization                       Action




                                                                                Enrichment




                                                                                                                                               Combined
        Interface


                                   CoreNLP
                                              Tokenization                                                                  When




                                                                                  Custom




                                                                                                                                                Scoring
                                                                SUTime                           Environment
         RESTful                             POS & Parsing                                                                 Where
                                                              Nominatim                             Cause
           API                                    NER                                                                       Why
                                                                 AIDA
                                               Coref. Res.                                         Method                   How

                                                          Cache                                         input / output   analysis process   3rd party libraries


      Figure 2: The three-phases analysis pipeline preprocesses a news text, finds candidate phrases for each of the
      5W1H questions, and scores these. Giveme5W1H can easily be accessed via Python and via a RESTful API.
semes, such as crane (animal or machine), have multiple mean-                            uses coreference resolution, question-specific semantic dis-
ings. Hence, canonical data is often more useful for subsequent                          tance measures, combined scoring of candidates, and extracts
analysis tasks (see Section 1). Phrases containing temporal in-                          phrases for the ‘how’ question. The values of the parameters
formation or location information may be canonicalized, e.g., by                         introduced in this section result from a semi-automated search
converting the phrases to dates or timespans [7, 38] or to pre-                          for the optimal configuration of Giveme5W1H using an anno-
cise geographic positions [29]. Phrases answering the other                              tated learning dataset including a manual, qualitative revision
questions could be canonicalized by employing NERD on the                                (see Section 3.5).
contained NEs, and then linking the NEs to concepts defined in
a knowledge graph, such as YAGO [19], or WordNet [31].                                   3.1      Preprocessing of News Articles
    While the evaluations of reviewed papers generally indicate                          Giveme5W1H accepts as input the full text of a news article, in-
sufficient quality to be usable for news event extraction, e.g.,                         cluding headline, lead paragraph, and body text. The user can
the system by Yaman et al. achieved 𝐹1 = 0.85 on the Darpa                               specify these three components as one or separately. Option-
corpus from 2009 [48], the evaluations lack comparability for                            ally, the article’s publishing date can be provided, which helps
two reasons. First, no gold standard exists for journalistic                             Giveme5W1H parse relative dates, such as “yesterday at 1 pm”.
5W1H question answering on news articles. A few datasets ex-                                 During preprocessing, we use Stanford CoreNLP for sentence
ist for automated question answering, specifically for the pur-                          splitting, tokenization, lemmatization, POS-tagging, full pars-
pose of disaster tracking [28, 41]; However, these datasets are                          ing, NER (with Stanford NER’s seven-class model), and pro-
so specialized to their own use cases that they cannot be ap-                            nominal and nominal coreference resolution. Since our main
plied to the use case of automated journalistic question an-                             goal is high 5W1H extraction accuracy (rather than fast execu-
swering. Another challenge to the evaluation of news event ex-                           tion speed), we use the best-performing model for each of the
traction is that the evaluation data sets of previous papers are                         CoreNLP annotators, i.e., the ‘neural’ model if available. We use
no longer publicly available [34, 47, 48]. Second, previous pa-                          the default settings for English in all libraries.
pers each used different quality measures, such as precision                                 After the initial preprocessing, we bring all NEs in the text
and recall [9] or error rates [47].                                                      into their canonical form. Following from requirement R1, ca-
                                                                                         nonical information is the preferred output of Giveme5W1H,
3    GIVEME5W1H: DESCRIPTION OF                                                          since it is the most concise form. Because Giveme5W1H uses
     METHODS AND SYSTEM                                                                  the canonical information to extract and score ‘when’ and
Giveme5W1H is an open-source main event retrieval system for                             ‘where’ candidates, we implement the canonicalization task
news articles that addresses the objectives we defined in Sec-                           during preprocessing.
tion 1. The system extracts 5W1H phrases that describe the                                   We parse dates written in natural language into canonical
most defining characteristics of a news event, i.e., who did what,                       dates using SUTime [7]. SUTime looks for NEs of the type date
when, where, why, and how. This section describes the analysis                           or time and merges adjacent tokens to phrases. SUTime also
workflow of Giveme5W1H, as shown in Figure 1. Giveme5W1H                                 handles heterogeneous phrases, such as “yesterday at 1 pm”,
can be accessed by other software as a Python library and via a                          which consist not only of temporal NEs but also other tokens,
RESTful API. Due to its modularity, researchers can efficiently                          such as function words. Subsequently, SUTime converts each
adapt or replace components. For example, researchers can in-                            temporal phrase into a standardized TIMEX3 [44] instance.
tegrate a custom parser or adapt the scoring functions tailored                          TIMEX3 defines various types, also including repetitive peri-
to the characteristics of their data. The system builds on                               ods. Since events according to our definition occur at a single
Giveme5W [17], but improves extraction performance by ad-                                point in time, we only retrieve datetimes indicating an exact
dressing the planned future work directions: Giveme5W1H                                  time, e.g., “yesterday at 6pm”, or a duration, e.g., “yesterday”,
                                                                                         which spans the whole day.
 INRA 2019, September 2019, Copenhagen, Denmark                                                                           Hamborg et al.

    Geocoding is the process of parsing places and addresses           candidates, we take TIMEX3 instances from preprocessing.
written in natural language into canonical geocodes, i.e., one or      Similarly, we take the geocodes as ‘where’ candidates.
more coordinates referring to a point or area on earth. We look            The cause extractor looks for linguistic features indicating a
for tokens classified as NEs of the type location (cf. [48]). We       causal relation within a sentence’s constituents. We look for
merge adjacent tokens of the same NE type within the same              three types of cause-effect indicators (cf. [25, 26]): causal con-
sentence constituent, e.g., within the same NP or VP. Similar to       junctions, causative adverbs, and causative verbs. Causal con-
temporal phrases, locality phrases are often heterogeneous, i.e.,      junctions, e.g. “due to”, “result of”, and “effect of”, connect two
they do not only contain temporal NEs but also function words.         clauses, whereas the second clause yields the ‘why’ candidate.
Hence, we introduce a locality phrase merge range 𝑟where = 1,          For causative adverbs, e.g., “therefore”, “hence”, and “thus”, the
to merge phrases where up to 𝑟where arbitrary NE tokens are            first clause yields the ‘why’ candidate. If we find that one or
allowed between two location NEs. Lastly, we geocode the               more subsequent tokens of a sentence match with one of the
merged phrases with Nominatim1, which uses free data from              tokens adapted from Khoo et al. [25], we take all tokens on the
OpenStreetMap.                                                         right (causal conjunction) or left side (causative adverb) as the
    We canonicalize NEs of the remaining types, e.g., persons          ‘why’ candidate.
and organizations, by linking NEs to concepts in the YAGO                  Causative verbs, e.g. “activate” and “implicate”, are con-
graph [30] using AIDA [19]. The YAGO graph is a state-of-the-          tained in the middle VP of the causative NP-VP-NP pattern,
art knowledge base, where nodes in the graph represent se-             whereas the last NP yields the ‘why’ candidate [11, 26]. For
mantic concepts that are connected to other nodes through at-          each NP-VP-NP pattern we find in the parse-tree, we determine
tributes and relations. The data is derived from other well-es-        whether the VP is causative. To do this, we extract the VP’s
tablished knowledge bases, such as Wikipedia, WordNet, Wiki-           verb, retrieve the verb’s synonyms from WordNet [31] and
Data, and GeoNames [39].                                               compare the verb and its synonyms with the list of causative
                                                                       verbs from Girju [11], which we also extended by their syno-
3.2     Phrase Extraction                                              nyms (cf. [11]). If there is at least one match, we take the last
Giveme5W1H performs four independent extraction chains to              NP of the causative pattern as the ‘why’ candidate. To reduce
retrieve the article’s main event: (1) the action chain extracts       false positives, we check the NP and VP for the causal con-
phrases for the ‘who’ and ‘what’ questions, (2) environment for        straints for verbs proposed by Girju [11].
‘when’ and ‘where’, (3) cause for ‘why’, and (4) method for                The method extractor retrieves ‘how’ phrases, i.e., the
‘how’.                                                                 method by which an action was performed. The combined
    The action extractor identifies who did what in the article’s      method consists of two subtasks, one analyzing copulative con-
main event. The main idea for retrieving ‘who’ candidates is to        junctions, the other looking for adjectives and adverbs. Often,
collect the subject of each sentence in the news article. There-       sentences with a copulative conjunction contain a method
fore, we extract the first NP that is a direct child to the sentence   phrase in the clause that follows the copulative conjunction,
in the parse tree, and that has a VP as its next right sibling (cf.    e.g., “after [the train came off the tracks]”. Therefore, we look
[5]). We discard all NPs that contain a child VP, since such NPs       for copulative conjunctions compiled from [33]. If a token
yield lengthy ‘who’ phrases. Take, for instance, this sentence:        matches, we take the right clause as the ‘how’ candidate. To
“((NP) Mr. Trump, ((VP) who stormed to a shock election vic-           avoid long phrases, we cut off phrases longer than 𝑙how,max =
tory on Wednesday)), ((VP) said it was […])”, where “who               10 tokens. The second subtask extracts phrases that consist
stormed […]” is the child VP of the NP. We then put the NPs into       purely of adjectives or adverbs (cf. [36]), since these often rep-
the list of ‘who’ candidates. For each ‘who’ candidate, we take        resent how an action was performed. We use this extraction
the VP that is the next right sibling as the corresponding ‘what’      method as a fallback, since we found the copulative conjunc-
candidate (cf. [5]). To avoid long ‘what’ phrases, we cut VPs af-      tion-based extraction too restrictive in many cases.
ter their first child NP, which long VPs usually contain. How-
ever, we do not cut the ‘what’ candidate if the VP contains at         3.3     Candidate Scoring
most 𝑙what,min = 3 tokens, and the right sibling to the VP’s child     The last task is to determine the best candidate of each 5W1H.
NP is a prepositional phrase (PP). This way, we avoid short, un-       The scoring consists of two sub-tasks. First, we score candi-
descriptive ‘what’ phrases. For instance, in the simplified exam-      dates independently for each of the 5W1H questions. Second,
ple: “((NP) The microchip) ((VP) is ((NP) part) ((PP) of a wider       we perform a combined scoring where we adjust scores of can-
range of the company’s products)).”, the truncated VP “is part”        didates of one question dependent on properties, e.g., position,
contains no descriptive information; Hence, our presented              of candidates of other questions. For each question 𝑞, we use a
rules prevent this truncation.                                         scoring function that is composed as a weighted sum of 𝑛 scor-
    The environment extractor retrieves phrases describing the         ing factors: 𝑠𝑞 = ∑𝑛−1𝑖=0 𝑤q,𝑖 𝑠q,𝑖 , where 𝑤q,𝑖 is the weight of the
temporal and locality context of the event. To determine ‘when’        scoring factor 𝑠q,𝑖 .

1 https://github.com/openstreetmap/Nominatim, v3.0.0
 Giveme5W1H: A Universal System for Extracting
                                                                                    INRA 2019, September 2019, Copenhagen, Denmark
 Main Events from News Articles

    To score ‘who’ candidates, we define three scoring factors:         Table 2: Weights and scoring factors for ‘when’ phrases
the candidate shall occur in the article (1) early and (2) often,
                                                                                   𝑖     𝑤when,𝑖                   𝑠when,𝑖
and (3) contain a named entity. The first scoring factor targets
                                                                            0 (position)  .24                      pos(𝑐)
the concept of the inverse pyramid [8]: news mention the most
important information, i.e., the main event, early in the article,          1 (frequency) .16                       f(𝑐)
e.g., in the headline and lead paragraph, while later paragraphs                                                      Δs (𝑐, 𝑑pub )
                                                                            2 (closeness)       .4        1 − min 1,
contain details. However, journalists often use so called hooks                                                            emax
to get the reader’s attention without revealing all content of the                                                log 𝑠(𝑐) − log 𝑠min
                                                                            3 (duration)        .2     1 − min 1,
article [35]. Hence, for each candidate, we also consider the fre-                                                log 𝑠max − log 𝑠min
quency of similar phrases in the article, since the primary actor
involved in the main event is likely to be mentioned frequently        To count 𝑛f (𝑐), we determine two TIMEX3 instances as similar
in the article. Furthermore, if a candidate contains a NE, we will     if their start and end-dates are at most 24h apart. Δs (𝑐, 𝑑pub ) is
score it higher, since in news, the actors involved in events are      the difference in seconds of candidate 𝑐 and the publication
often NEs, e.g., politicians. Table 1 shows the weights and scor-      date of the news article 𝑑pub , 𝑠(𝑐) the duration in seconds of 𝑐,
ing factors.                                                           and the normalization constants emax ≈ 2.5Ms (one month in
  Table 1: Weights and scoring factors for ‘who’ phrases               seconds), 𝑠min = 60s, and 𝑠max ≈ 31Ms (one year).
                                                                           The scoring of location candidates follows four scoring fac-
               𝑖        𝑤who,𝑖             𝑠who,𝑖
                                                                       tors: the candidate shall occur (1) early and (2) often in the ar-
                                                  𝑛pos (𝑐)
        0 (position)       .9     pos(𝑐) = 1 −                         ticle. It should also be (3) often geographically contained in
                                                    𝑑len               other location candidates and be (4) specific. The first two scor-
                                               𝑛𝑓 (𝑐)
                                                                       ing factors have the same motivation as in the scoring of ‘who’
        1 (frequency) .095 f(𝑐) =
                                        max 𝑐 ′∈𝐶 (𝑛𝑓 (𝑐 ′ ))          and ‘when’ candidates. The second and third scoring factors
        2 (type)         .005             NE(𝑐)                        aim to (1) find locations that occur often, either by being similar
                                                                       to others, or (2) by being contained in other location candi-
𝑛pos (𝑐) is the position measured in sentences of candidate 𝑐          dates. The fourth scoring factor favors specific locations, e.g.,
within the document, 𝑛f (𝑐) the frequency of phrases similar to        Berlin, over broader mentions of location, e.g., Germany or Eu-
𝑐 in the document, and NE(𝑐) = 1 if 𝑐 contains an NE, else 0 (cf.      rope. We logarithmically normalize the location specificity be-
[10]). To measure 𝑛f (𝑐) of the actor in candidate 𝑐, we use the       tween 𝑎min = 225𝑚2 (a small property’s size) and 𝑎max =
number of the actor’s coreferences, which we extracted during          530,000𝑘𝑚2 (approx. the mean area of all countries [43]). We
coreference resolution (see Section 3.1). This allows                  discuss other scoring options in Section 5. The used weights
Giveme5W1H to recognize and count name variations, as well             and scoring factors are shown in Table 3. We measure 𝑛f (𝑐), the
as pronouns. Due to the strong relation between agent and ac-          number of similar mentions of candidate 𝑐 , by counting how
tion, we rank VPs according to their NPs’ scores. Hence, the           many other candidates have the same Nominatim place ID. We
most likely VP is the sibling in the parse tree of the most likely     measure 𝑛e (𝑐) by counting how many other candidates are ge-
NP: 𝑠what = 𝑠who .                                                     ographically contained within the bounding box of 𝑐 , where
    We score temporal candidates according to four scoring fac-        𝑎(𝑐) is the area of the bounding box of 𝑐 in square meters.
tors: the candidate shall occur in the article (1) early and (2)
often. It should also be (3) close to the publishing date of the ar-    Table 3: Weights and scoring factors for ‘where’ phrases
ticle, and (4) of a relatively short duration. The first two scoring
factors have the same motivation as in the scoring of ‘who’ can-                𝑖           𝑤where,𝑖                𝑠where,𝑖
didates. The idea for the third scoring factor, close to the pub-       0 (position)          .37                   pos(𝑐)
lishing date, is that events reported on by news articles often         1 (frequency)          .3                    f(𝑐)
occurred on the same day or on the day before the article was           2 (contain-                                 𝑛e (𝑐)
                                                                                               .3
published. For example, if a candidate represents a date one or         ment)                                 max 𝑐 ′∈𝐶 (𝑛e (𝑐 ′ ))
more years in the past before the publishing date of the article,
                                                                                                                    log 𝑎(𝑐) − log 𝑎min
the candidate will achieve the lowest possible score in the third       3 (specificity)       .03      1 − min 1,
scoring factor. The fourth scoring factor prefers temporal can-                                                     log 𝑎max − log 𝑎min
didates that have a short duration, since events according to
our definition happen during a specific point in time with a           Scoring causal candidates was challenging, since it often re-
short duration. We logarithmically normalize the duration fac-         quires semantic interpretation of the text and simple heuristics
tor between one minute and one month (cf. [49]). The resulting         may fail [11]. We define two objectives: candidates shall (1) oc-
scoring formula for a temporal candidate 𝑐 is the sum of the
weighted scoring factors shown in Table 2.
 INRA 2019, September 2019, Copenhagen, Denmark                                                                        Hamborg et al.

cur early in the document, and (2) their causal type shall be re-      increases the efficiency of the overall analysis workflow in
liable [26]. The second scoring factor rewards causal types with       which Giveme5W1H may be embedded, since later analysis
low ambiguity (cf. [3, 11]), e.g., “because” has a very high likeli-   tasks can reuse the information.
hood that the subsequent phrase contains a cause [11]. The                For the temporal phrases and locality phrases,
weighted scoring factors are shown in Table 4. The causal type         Giveme5W1H also provides their canonical forms, i.e., TIMEX3
TC(𝑐) = 1 if 𝑐 is extracted due to a causal conjunction, 0.62 if it    instances and geocodes. For the news article shown in Figure 1,
starts with a causative RB, and 0.06 if it contains a causative VB     the canonical form of the ‘when’ phrase represents the entire
(cf. [25, 26]).                                                        day of November 10, 2016. The canonical geocode for the
                                                                       ‘where’ phrase represents the coordinates of the center of the
  Table 4: Weights and scoring factors for ‘why’ phrases
                                                                       city Mazar-i-Sharif (36°42'30.8"N 67°07'09.7"E), where the
             𝑖         𝑤why,𝑖               𝑠why,𝑖                     bounding box represents the area of the city, and further infor-
      0 (position)       .56                pos(𝑐)                     mation from OSM, such as a canonical name and place ID, which
      1 (type)           .44                CT(𝑐)                      uniquely identifies the place. Lastly, Giveme5W1H provides
                                                                       linked YAGO concepts [30] for other NEs.
The scoring of method candidates uses three simple scoring fac-
tors: the candidate shall occur (1) early and (2) often in the         3.5    Parameter Learning
news article, and (3) their method type shall be reliable. The         Determining the best values for the parameters introduced in
weighted scoring factors for method candidates are shown in            Section 3, e.g., weights of scoring factors, is a supervised ML
Table 5.                                                               problem [6]. Since there is no gold standard for journalistic
                                                                       5W1H extraction on news (see Section 2), we created an anno-
  Table 5: Weights and scoring factors for ‘how’ phrases
                                                                       tated dataset.
              𝑖        𝑤how,𝑖               𝑠how,𝑖                        The dataset is available in the open-source repository (see
      0 (position)       .23                pos(𝑐)                     Section 6). To facilitate diversity in both content and writing
                                                                       style, we selected 13 major news outlets from the U.S. and the
      1 (frequency)      .14                 f(𝑐)
                                                                       UK. We sampled 100 articles from the news categories politics,
      2 (type)           .63                TM(𝑐)
                                                                       disaster, entertainment, business and sports for November 6th
                                                                       – 14th, 2016. We crawled the articles [18] and manually revised
The method type TM(𝑐) = 1 if 𝑐 is extracted because of a copu-
                                                                       the extracted information to ensure that it was free of extrac-
lative conjunction, else 0.41. We determine the number of men-
                                                                       tion errors.
tions of a method phrase 𝑛f (𝑐) by the term frequency (includ-
                                                                          We asked three annotators (graduate IT students, aged be-
ing inflected forms) of its most frequent token (cf. [45]).
                                                                       tween 22 and 26) to read each of the 100 news articles and to
    The final sub-task in candidate scoring is combined scoring,
                                                                       annotate the single most suitable phrase for each 5W1H ques-
which adjusts scores of candidates of a single 5W1H question
                                                                       tion. Finally, for each article and question, we combined the an-
depending on the candidates of other questions. To improve
                                                                       notations using a set of combination rules, e.g., if all phrases
the scoring of method candidates, we devise a combined sen-
                                                                       were semantically equal, we selected the most concise phrase,
tence-distance scorer. The assumption is that the method of
                                                                       or if there was no agreement between the annotators, we se-
performing an action should be close to the mention of the ac-
                                                                       lected each annotator’s first phrase, resulting in three semanti-
tion. The resulting equation for a method candidate 𝑐 given an
                                                                       cally diverging but valid phrases. We also manually added a
action candidate 𝑎 is:
                                                                       TIMEX3 instance to each ‘when’ annotation, which was used by
                                     |𝑛pos (𝑐) − 𝑛pos (𝑎)|      (1     the error function for ‘when’. The intercoder reliability was
   𝑠how,new (𝑐, 𝑎) = 𝑠how (𝑐) − 𝑤0
                                              𝑑len               )     ICR ann = 0.81, measured using average pairwise percentage
                                                                       agreement.
where 𝑤0 = 1 . Section 5 describes additional scoring ap-
                                                                          We divided the dataset into two subsets for training (80%
proaches.
                                                                       randomly sampled articles) and testing (20%). To find the op-
                                                                       timal parameter values for our extraction method, we first
3.4      Output
                                                                       computed for each parameter configuration the mean error
The highlighted phrases in Figure 1 are candidates extracted by        (ME) on the training set. To measure the ME of a configuration,
Giveme5W1H for each of the 5W1H event properties of the                we devised three error functions to measure the semantic dis-
shown article. Giveme5W1H enriches the returned phrases                tance between candidate phrases and annotated phrases. For
with additional information that the system extracted for its          the textual candidates, i.e., who, what, why, and how, we used
own analysis or during custom enrichment, with which users             the Word Mover’s Distance (WMD) [27]. WMD is a state-of-the-
can integrate their own preprocessing. The additional infor-           art generic measure for semantic similarity of two phrases. For
mation for each token is its POS-tag, parse-tree context, and NE       ‘when’ candidates, we computed the difference in seconds be-
type if applicable. Enriching the tokens with this information         tween candidate and annotation. For ‘where’ candidates, we
 Giveme5W1H: A Universal System for Extracting
                                                                                   INRA 2019, September 2019, Copenhagen, Denmark
 Main Events from News Articles

computed the distance in meters between both coordinates.              [17]: other systems were tested on non-disclosed datasets [34,
We linearly normalized all measures.                                   47, 48], they were translated from other languages [34], they
   We then validated the 5% best performing configurations             were devised for different languages [22, 24, 45], or they used
on the test set and discarded all configurations that yielded a        different evaluation measures, such as error rates [47] or bi-
significantly different ME. Finally, we selected the best per-         nary relevance assessments [48], which are both not optimal
forming parameter configuration for each question.                     because of the non-binary relevance of 5W1H answers (cf.
                                                                       [23]). Finally, none of the related systems have been made pub-
4     EVALUATION                                                       licly available or have been described in sufficient detail to en-
                                                                       able a re-implementation, which was the primary motivation
We use the same evaluation rules and procedure as described
                                                                       for our research (see Section 1).
by Hamborg et al. [17] but employed a larger dataset of 120
                                                                           Therefore, a direct comparison of the results and related
news articles, which we sampled from the BBC dataset [12] in
                                                                       work was not possible. Compared to the fraction of correct 5W
order to conduct a survey with three assessors. The dataset
                                                                       answers by the best system by Parton et al. [34], Giveme5W1H
contains 24 news articles in each of the following categories:
                                                                       achieves a 0.12 higher MAgP5W . The best system by Yaman et
business (Bus), entertainment (Ent), politics (Pol), sport (Spo),
                                                                       al. achieved a precision 𝑃5𝑊 = 0.89 [48], which is 0.14 higher
and tech (Tec)). We asked the assessors to read one article at a
                                                                       than our MAgP5W and – as a rough approximation of the best
time. After reading each article, we showed the assessors the
                                                                       achievable precision [20] – surprisingly almost identical to the
5W1H phrases that had been extracted by the system and
                                                                       ICR of our assessors.
asked them to judge the relevance of each answer on a 3-point
                                                                           We found that different forms of journalistic presentation in
scale: non-relevant (if an answer contained no relevant infor-
                                                                       the five news categories of the dataset led to different extrac-
mation, score 𝑠 = 0), partially relevant (if only part of the an-
                                                                       tion performance. Politics articles, which yielded the best per-
swer was relevant or if information was missing, 𝑠 = 0.5), and
                                                                       formance, mostly reported on single events. The performance
relevant (if the answer was completely relevant without miss-
                                                                       on sports articles was unexpectedly high, even though they not
ing information, 𝑠 = 1).
                                                                       only report on single events but also are background reports or
    Table 6 shows the mean average generalized precision
                                                                       announcements, for which event detection is more difficult. De-
(MAgP), a score suitable for multi-graded relevance assess-
                                                                       termining the ‘how’ in sports articles was difficult (MAgPhow =
ments [17, 23]. MAgP was 0.73 over all categories and ques-
                                                                       0.51), since often articles implicitly described the method of an
tions. If only considering the first 4Ws, which the literature con-
                                                                       event, e.g., how one team won a match, by reporting on multiple
siders as sufficient to represent an event (cf. [21, 40, 49]), over-
                                                                       key events during the match. Some categories, such as enter-
all MAgP was 0.82.
                                                                       tainment and tech, achieved lower extraction performances,
    Table 6: ICR and MAgP-Performance of Giveme5W1H                    mainly because they often contained much background infor-
    Question    ICR    Bus    Ent     Pol    Spo    Tec    Avg.        mation on earlier events and the actors involved.
    Who         .93    .98    .88     .89    .97    .90    .92
    What        .88    .85    .69     .89    .84    .66    .79
                                                                       5     DISCUSSION AND FUTURE WORK
    When        .89    .55    .91     .79    .81    .82    .78         Most importantly, we plan to improve the extraction quality of
    Where       .95    .82    .63     .85    .79    .80    .78         the ‘what’ question, being one of the important 4W questions.
    Why         .96    .48    .62     .42    .45    .42    .48         We aim to achieve an extraction performance similar to the
    How         .87    .63    .58     .68    .51    .65    .61         performance of the ‘who’ extraction (MaGPwho = 0.91), since
    Avg. all    .91    .72    .72     .75    .73    .71    .73         both are strongly related. In our evaluation, we identified two
    Avg. 4W     .91    .80    .78     .86    .85    .80    .82         main issues: (1) joint extraction of optimal ‘who’ candidates
                                                                       with non-optimal ‘what’ candidates and (2) cut-off ‘what’ can-
Of the few existing approaches capable of extracting phrases           didates. In some cases (1), the headline contained a concise
that answer all six 5W1H questions (see Section 2), only one           ‘who’ phrase but the ‘what’ phrase did not contain all infor-
publication reported the results of an evaluation: the approach        mation, e.g., because it only aimed to catch the reader’s interest,
developed by Khodra achieved a precision of 0.74 on Indone-            a journalistic hook (Section 2). We plan to devise separate ex-
sian articles [24]. Others did not conduct any evaluation [36] or      traction methods for both questions. Thereby, we need to en-
only evaluated the extracted ‘who’ and ‘what’ phrases of Japa-         sure that the top candidates of both questions fit to each other,
nese news articles [22].                                               e.g., by verifying that the semantic concept of the answer of
   We also investigated the performance of systems that are            each question, e.g., represented by the nouns in the ‘who’
only capable of extracting 5W phrases. Our system achieves a           phrase, or verbs in the ‘what’ phrase, co-occur in at least one
MAgP5W = 0.75 , which is 0.05 higher than the MAgP of                  sentence of the article. In other cases (2), our strategy to avoid
Giveme5W [17]. We also compared the performance with other             too detailed ‘what’ candidates (Section 3.2) cut off the relevant
systems, despite the difficulties mentioned by Hamborg et al.
 INRA 2019, September 2019, Copenhagen, Denmark                                                                                      Hamborg et al.

information, e.g., “widespread corruption in the finance minis-        mentions in the text to standardized TIMEX3 instances, loca-
try has cost it $2m”, in which the underlined text was cut off.        tions to geocoordinates, and other NEs, e.g., persons and organ-
We will investigate dependency parsing and further syntax              izations, to unique concepts in a knowledge graph. The system
rules, e.g., to always include the direct object of a transitive       uses syntactic and domain-specific rules to extract and score
verb.                                                                  phrases for each 5W1H question. Giveme5W1H achieved a
   For ‘when’ and ‘where’ questions, we found that in some             mean average generalized precision (MAgP) of 0.73 on all ques-
cases an article does not explicitly mention the main event’s          tions, and an MAgP of 0.82 on the first four W questions (who,
date or location. The date of an event may be implicitly defined       what, when, and where), which alone can represent an event.
by the reported event, e.g., “in the final of the Canberra Classic”.   Extracting the answers to ‘why’ and ‘how’ performed more
The location may be implicitly defined by the main actor, e.g.,        poorly, since articles often only imply causes and methods. An-
“Apple Postpones Release of […]”, which likely happened at the         swering the 5W1H questions is at the core of understanding
Apple headquarters in Cupertino. Similarly, the proper noun            any article, and thus an essential task in many research efforts
“Stanford University” also defines a location. We plan to inves-       that analyze articles. We hope that redundant implementations
tigate how we can use the YAGO concepts, which are linked to           and non-reproducible evaluations can be avoided with
NEs, to gather further information regarding the date and loca-        Giveme5W1H as the first universally applicable, modular, and
tion of the main event. If no date can be identified, the publish-     open-source 5W1H extraction system. In addition to benefiting
ing date of the article or the day before it might sometimes be a      developers and computer scientists, our system especially ben-
suitable fallback date.                                                efits researchers from the social sciences, for whom automated
   Using the standardized TIMEX3 instances from SUTime is an           5W1H extraction was previously not made accessible.
improvement (MAgPwhen = 0.78) over a first version, where                  Giveme5W1H and the datasets for training and evaluation
we used dates without a duration (MAgPwhen = 0.72).                    are        available       at:      https://github.com/fham-
   The extraction of ‘why’ and ‘how’ phrases was most chal-            borg/Giveme5W1H
lenging, which manifests in lower extraction performances
compared to the other questions. One reason is that articles of-       REFERENCES
ten do not explicitly state a single cause or method of an event,      [1] Agence France-Presse 2016. Taliban attacks German consulate in northern
                                                                            Afghan city of Mazar-i-Sharif with truck bomb. The Telegraph.
but implicitly describe this throughout the article, particularly
                                                                       [2] Allan, J. et al. 1998. Topic detection and tracking pilot study: Final report.
in sports articles (see Section 5). In such cases, NLP methods              Proceedings of the DARPA Broadcast News Transcription and Understanding
are currently not advanced enough to find and abstract or sum-              Workshop (1998), 194–218.
                                                                       [3] Asghar, N. 2016. Automatic Extraction of Causal Relations from Natural
marize the cause or method (see Section 3.3). However, we                   Language Texts: A Comprehensive Survey. arXiv preprint arXiv:1605.07895.
plan to improve the extraction accuracy by preventing the sys-              (2016).
                                                                       [4] Best, C. et al. 2005. Europe media monitor.
tem from returning false positives. For instance, in cases where       [5] Bird, S. et al. 2009. Natural language processing with Python: analyzing text
no cause or method could be determined, we plan to introduce                with the natural language toolkit. O’Reilly Media, Inc.
a score threshold to prevent the system from outputting candi-         [6] Burnham, K.P. and Anderson, D.R. 2002. Model selection and multimodel
                                                                            inference: a practical information-theoretic approach.
dates with a low score, which are presumably wrong. Currently,         [7] Chang, A.X. and Manning, C.D. 2012. SUTime: A library for recognizing and
the system always outputs a candidate if at least one cause or              normalizing time expressions. LREC. iii (2012), 3735–3740.
                                                                            DOI:https://doi.org/10.1017/CBO9781107415324.004.
method was found.                                                      [8] Christian, D. et al. 2014. The Associated Press stylebook and briefing on media
   To improve the performance of all textual questions, i.e.,               law. The Associated Press.
who, what, why, and how, we will investigate two approaches.           [9] Das, A. et al. 2012. The 5W structure for sentiment summarization-
                                                                            visualization-tracking. Lecture Notes in Computer Science (including subseries
First, we want to improve measuring a candidate’s frequency,                Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
an important scoring factor in multiple questions (Section 3.3).            (2012), 540–555.
                                                                       [10] Finkel, J.R. et al. 2005. Incorporating non-local information into information
We currently use the number of coreferences, which does not                 extraction systems by gibbs sampling. Proceedings of the 43rd annual
include synonymous mentions. We plan to count the number of                 meeting on association for computational linguistics (2005), 363–370.
                                                                       [11] Girju, R. 2003. Automatic detection of causal relations for question
YAGO concepts that are semantically related to the current can-             answering. Proceedings of the ACL 2003 workshop on Multilingual
didate. Second, we found that a few top candidates of the four              summarization and question answering-Volume 12 (2003), 76–83.
textual questions were semantically correct but only contained         [12] Greene, D. and Cunningham, P. 2006. Practical solutions to the problem of
                                                                            diagonal dominance in kernel document clustering. Proceedings of the 23rd
a pronoun referring to the more meaningful noun. We plan to                 international conference on Machine learning (2006), 377–384.
add the coreference’s original mention to extracted answers.           [13] Hamborg, F. et al. 2019. Automated Identification of Media Bias by Word
                                                                            Choice and Labeling in News Articles. Proceedings of the ACM/IEEE-CS Joint
                                                                            Conference on Digital Libraries (JCDL) (Urbana-Champaign, IL, USA, 2019),
6     CONCLUSION                                                            1–10.
                                                                       [14] Hamborg, F. et al. 2018. Automated identification of media bias in news
In this paper, we proposed Giveme5W1H, the first open-source                articles: an interdisciplinary literature review. International Journal on
system that extracts answers to the journalistic 5W1H ques-                 Digital Libraries. (2018), 1–25. DOI:https://doi.org/10.1007/s00799-018-
                                                                            0261-y.
tions, i.e., who did what, when, where, why, and how, to describe      [15] Hamborg, F. et al. 2018. Bias-aware news analysis using matrix-based news
a news article’s main event. The system canonicalizes temporal              aggregation. International Journal on Digital Libraries. (2018).
                                                                            DOI:https://doi.org/10.1007/s00799-018-0239-9.
                                                                       [16] Hamborg, F. et al. 2018. Extraction of Main Event Descriptors from News
 Giveme5W1H: A Universal System for Extracting
                                                                                                        INRA 2019, September 2019, Copenhagen, Denmark
 Main Events from News Articles

     Articles by Answering the Journalistic Five W and One H Questions.                  [32] Oliver, P.E. and Maney, G.M. 2000. Political Processes and Local Newspaper
     Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)              Coverage of Protest Events: From Selection Bias to Triadic Interactions.
     (Fort Worth, Texas, USA, 2018), 339–340.                                                 American Journal of Sociology. 106, 2 (2000), 463–505.
[17] Hamborg, F. et al. 2018. Giveme5W: Main Event Retrieval from News Articles          [33] Oxford English 2009. Oxford English Dictionary. Oxford University Press.
     by Extraction of the Five Journalistic W Questions. Proceedings of the              [34] Parton, K. et al. 2009. Who, what, when, where, why?: comparing multiple
     iConference 2018 (Sheffield, UK, 2018), 355–356.                                         approaches to the cross-lingual 5W task. Proceedings of the Joint Conference
[18] Hamborg, F. et al. 2017. news-please: A Generic News Crawler and Extractor.              of the 47th Annual Meeting of the ACL and the 4th International Joint
     Proceedings of the 15th International Symposium of Information Science                   Conference on Natural Language Processing of the AFNLP: Volume 1-Volume
     (2017), 218–223.                                                                         1 (2009), 423–431.
[19] Hoffart, J. et al. 2011. Robust Disambiguation of Named Entities in Text.           [35] Peters, C. et al. 2012. Improving the Hook in Case Writing. Journal of Case
     Proceedings of the 2011 Conference on Empirical Methods in Natural                       Studies. 30, (2012), 1–6.
     Language Processing (2011), 782–792.                                                [36] Sharma, S. et al. 2013. News Event Extraction Using 5W1H Approach & Its
[20] Hripcsak, G. and Rothschild, A.S. 2005. Agreement, the F-measure, and                    Analysis. International Journal of Scientific & Engineering Research - IJSER. 4,
     reliability in information retrieval. Journal of the American Medical                    5 (2013), 2064–2067.
     Informatics          Association.     12,       3       (2005),        296–298.     [37] Sharp, D. 2002. Kipling’s guide to writing a scientific paper. Croatian medical
     DOI:https://doi.org/10.1197/jamia.M1733.                                                 journal. 43, 3 (2002), 262–7.
[21] Ide, I. et al. 2005. TrackThem: Exploring a large-scale news video archive by       [38] Strötgen, J. and Gertz, M. 2013. Multilingual and cross-domain temporal
     tracking human relations. Lecture Notes in Computer Science (including                   tagging. Language Resources and Evaluation. 47, 2 (2013), 269–298.
     subseries Lecture Notes in Artificial Intelligence and Lecture Notes in                  DOI:https://doi.org/10.1007/s10579-012-9179-y.
     Bioinformatics) (2005), 510–515.                                                    [39] Suchanek, F.M. et al. 2007. YAGO: a core of semantic knowledge. Proceedings
[22] Ikeda, T. et al. 1998. Information Classification and Navigation Based on 5W1            of the 16th international conference on World Wide Web. (2007), 697–706.
     H of the Target Information. Proceedings of the 36th Annual Meeting of the               DOI:https://doi.org/10.1145/1242572.1242667.
     Association for Computational Linguistics and 17th International Conference         [40] Sundberg, R. and Melander, E. 2013. Introducing the UCDP Georeferenced
     on Computational Linguistics-Volume 1 (1998), 571–577.                                   Event Dataset. Journal of Peace Research. 50, 4 (2013), 523–532.
[23] Kekäläinen, J. and Järvelin, K. 2002. Using graded relevance assessments in              DOI:https://doi.org/10.1177/0022343313484347.
     IR evaluation. Journal of the American Society for Information Science and          [41] Sundheim, B. 1992. Overview of the fourth message understanding
     Technology. 53, 13 (2002), 1120–1129.                                                    evaluation and conference. Proceedings of the 4th conference on Message
[24] Khodra, M.L. 2015. Event extraction on Indonesian news article using                     understanding (1992), 3–21.
     multiclass categorization. ICAICTA 2015 - 2015 International Conference on          [42] Tanev, H. et al. 2008. Real-time news event extraction for global crisis
     Advanced Informatics: Concepts, Theory and Applications (2015).                          monitoring. Lecture Notes in Computer Science (including subseries Lecture
[25] Khoo, C.S.G. et al. 1998. Automatic extraction of cause-effect information               Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008),
     from newspaper text without knowledge-based inferencing. Literary and                    207–218.
     Linguistic Computing. 13, 4 (1998), 177–186.                                        [43] The              CIA             World             Factbook:              2016.
[26] Khoo, C.S.G. 1995. Automatic identification of causal relations in text and their        https://www.cia.gov/library/publications/the-world-factbook/geos/.
     use for improving precision in information retrieval.                               [44] TimeML Working Group 2009. Guidelines for Temporal Expression
[27] Kusner, M.J. et al. 2015. From Word Embeddings To Document Distances.                    Annotation for English for TempEval 2010. English. (2009), 1–14.
     Proceedings of The 32nd International Conference on Machine Learning. 37,           [45] Wang, W. et al. 2010. Chinese news event 5w1h elements extraction using
     (2015), 957–966.                                                                         semantic role labeling. Information Processing (ISIP), 2010 Third
[28] Lejeune, G. et al. 2015. Multilingual event extraction for epidemic detection.           International Symposium on (2010), 484–489.
     Artificial           Intelligence         in         Medicine.           (2015).    [46] Wick, M.L. et al. 2008. A unified approach for schema matching, coreference
     DOI:https://doi.org/10.1016/j.artmed.2015.06.005.                                        and canonicalization. Proceeding of the 14th ACM SIGKDD international
[29] Li, H. et al. 2003. InfoXtract location normalization: a hybrid approach to              conference on Knowledge discovery and data mining - KDD 08 (2008), 722.
     geographic references in information extraction. Proceedings of the {HLT-           [47] Yaman, S. et al. 2009. Classification-based strategies for combining multiple
     NAACL} 2003 Workshop on Analysis of Geographic References. 1, (2003), 39–                5-w question answering systems. INTERSPEECH (2009), 2703–2706.
     44. DOI:https://doi.org/10.3115/1119394.1119400.                                    [48] Yaman, S. et al. 2009. Combining semantic and syntactic information sources
[30] Mahdisoltani, F. et al. 2015. YAGO3: A Knowledge Base from Multilingual                  for 5-w question answering. INTERSPEECH (2009), 2707–2710.
     Wikipedias.          Proceedings       of       CIDR.       (2015),        1–11.    [49] Yang, Y. et al. 1998. A study on retrospective and on-line event detection.
     DOI:https://doi.org/10.1016/j.jbi.2013.09.007.                                           Proceedings of the 21st annual international ACM SIGIR conference on
[31] Miller, G.A. 1995. WordNet: a lexical database for English. Communications               Research and development in information retrieval - SIGIR ’98 (1998), 28–36.
     of          the         ACM.       38,         11        (1995),          39–41.
     DOI:https://doi.org/10.1145/219717.219748.