=Paper=
{{Paper
|id=Vol-2554/paper6
|storemode=property
|title=Giveme5W1H: A Universal System for Extracting Main Events from News Articles
|pdfUrl=https://ceur-ws.org/Vol-2554/paper_06.pdf
|volume=Vol-2554
|authors=Felix Hamborg,Corinna Breitinger,Bela Gipp
|dblpUrl=https://dblp.org/rec/conf/recsys/HamborgBG19
}}
==Giveme5W1H: A Universal System for Extracting Main Events from News Articles==
Giveme5W1H: A Universal System for Extracting Main Events from News Articles Felix Hamborg1, Corinna Breitinger1, Bela Gipp2 1 University of Konstanz, Germany {firstname.lastname}@uni-konstanz.de 2 University of Wuppertal, Germany gipp@uni-wuppertal.de ABSTRACT 1 INTRODUCTION Event extraction from news articles is a commonly required prerequisite for various tasks, such as article summarization, The extraction of a news article’s main event is an automated article clustering, and news aggregation. Due to the lack of uni- analysis task at the core of a range of use cases, including news versally applicable and publicly available methods tailored to aggregation, clustering of articles reporting on the same event, news datasets, many researchers redundantly implement and news summarization [4, 15]. Beyond computer science, event extraction methods for their own projects. The journal- other disciplines also analyze news events, for example, re- istic 5W1H questions are capable of describing the main event searchers from the social sciences analyze how news outlets re- of an article, i.e., by answering who did what, when, where, why, port on events in what is known as frame analyses [13, 14]. and how. We provide an in-depth description of an improved Despite main event extraction being a fundamental task in version of Giveme5W1H, a system that uses syntactic and do- news analysis, no publicly available method exists that can be main-specific rules to automatically extract the relevant applied to the diverse use cases mentioned to capably extract phrases from English news articles to provide answers to these explicit event descriptors from a given article [17]. Explicit 5W1H questions. Given the answers to these questions, the sys- event descriptors are properties that occur in a text to describe tem determines an article’s main event. In an expert evaluation an event, e.g., the phrases in an article that enable a reader to with three assessors and 120 articles, we determined an over- understand what the article is reporting on. The reliable extrac- all precision of p=0.73, and p=0.82 for answering the first four tion of event-describing phrases also allows later analysis tasks W questions, which alone can sufficiently summarize the main to use common natural language processing (NLP) methods, event reported on in a news article. We recently made our sys- such as TF-IDF and cosine similarity, including named entity tem publicly available, and it remains the only universal open- recognition (NER) [10] and named entity disambiguation source 5W1H extractor capable of being applied to a wide (NERD) [19] to assess the similarity of two events. State-of-the- range of use cases in news analysis. art methods for extracting events from articles suffer from three main shortcomings [17]. First, most approaches only de- CCS CONCEPTS tect events implicitly, e.g., by employing topic modeling [2, 42]. Second, they are specialized for the extraction of task-specific • Computing methodologies → Information extraction • In- properties, e.g., extracting only the number of injured people in formation systems → Content analysis and feature selection • an attack [32, 42]. Lastly, some methods extract explicit de- Information systems → Summarization scriptors, but are not publicly available, or are described in in- sufficient detail to allow researchers to reimplement the ap- KEYWORDS proaches [34, 45, 47, 48]. News Event Detection, 5W1H Extraction, 5W1H Question An- Last year, we introduced Giveme5W1H in the form of a swering, Reporter’s Questions, Journalist’s Questions, 5W QA. poster abstract [16], which was at that time still an in-progress prototype capable of extracting universally usable phrases that answer the journalistic 5W1H questions, i.e., who did what, when, where, why, and how (see Figure 1). This poster, how- ever, did not disclose or discuss the scoring mechanisms used for determining the best candidate phrases during main event extraction. In this paper, we describe in detail how the im- proved version of Giveme5W1H extracts 5W1H phrases and we Copyright © 2019 for this paper by its authors. Use permitted under Creative describe the results of our evaluation of these improvements. Commons License Attribution 4.0 International (CC BY 4.0). We also introduce an annotated data set, which we created to INRA 2019, September 2019, Copenhagen, Denmark Hamborg et al. train our system’s model to improve extraction performance. the current state-of-the-art and focus this section on the extrac- The training data set is available in the online repository (see tion of the ‘how’ phrases. Most systems focus only on the ex- Section 6) and can be used by other researchers to train their traction of 5W phrases without ‘how’ phrases (cf. [9, 34, 47, own 5W1H approaches. This paper is relevant to researchers 48]). The authors of prior work do not justify this, but we sus- and developers from various disciplines with the shared aim of pect two reasons. extracting and analyzing the main events that are being re- First, the ‘how’ question is particularly difficult to extract due ported on in articles. to its ambiguity, as we will explain later in this section. Second, Taliban attacks German consulate in northern Afghan city of ‘how’ (and ‘why’) phrases are considered less important in Mazar-i-Sharif with truck bomb many use cases when compared to the other phrases, particu- The death toll from a powerful Taliban truck bombing at the German larly those answering the ‘who’, ‘what’, ‘when’, and ‘where’ consulate in Afghanistan's Mazar-i-Sharif city rose to at least six Fri- (4W) questions (cf. [21, 40, 49]). For the sake of readability in day, with more than 100 others wounded in a major militant assault. this section, we will also include approaches that only extract The Taliban said the bombing late Thursday, which tore a massive the 5Ws when referring to 5W1H extraction. Aside for the ‘how’ crater in the road and overturned cars, was a "revenge attack" for US extraction, the analysis of approaches for 5W1H or 5W-extrac- air strikes this month in the volatile province of Kunduz that left 32 civilians dead. […] The suicide attacker rammed his explosives-laden tion is generally the same. car into the wall […]. Systems for 5W1H QA on news texts typically perform three tasks to determine the article’s main event [45, 47]: (1) prepro- Figure 1: News article [1] consisting of title (bold), lead para- cessing, (2) phrase extraction [10, 25, 36, 47, 48], where for in- graph (italic), and first of remaining paragraphs. Highlighted stance linguistic rules are used to extract phrases candidates, phrases represent the 5W1H event properties (who did what, and (3) candidate scoring, which selects the best answer for when, where, why, and how). each question by employing heuristics, such as the position of a phrase within the document. The input data to QA systems is Our objective is to devise an automated method for extracting usually text, such as a full article including the headline, lead the main event being reported on by a given news article. For paragraph, and main text [36], or a single sentence, e.g., in news this purpose, we exclude non-event-reporting articles, such as ticker format [48]. Other systems use automatic speech recog- commentaries or press reviews. First, we define the extracted nition (ASR) to convert broadcasts into text [47]. The outcomes main event descriptors to be concise (requirement R1). This of the process are six textual phrases, one for each of the 5W1H means they must be as short as possible and contain only the questions, which together describe the main event of a given information describing the event, while also being as long as news text, as highlighted in Figure 1. Thus far, no systems have necessary to contain all information of the event. Second, the been described in sufficient detail to allow for a reimplementa- descriptors must be of high accuracy (R2). For this reason, we tion by other researchers. give higher priority to extraction accuracy than execution Both the ‘why’ and ‘how’ question pose a particular chal- speed [17]. We also defined that the developed system must lenge in comparison to the other questions. As discussed by achieve a higher extraction accuracy than Giveme5W [17]. Hamborg et al. [17], determining the reason or cause (i.e. ‘why’) Compared to Giveme5W, the system proposed in this paper not can even be difficult for humans. Often the reason is unknown, only additionally extracts the ‘how’ answer, but its analysis or it is only described implicitly, if at all [11]. Extracting the workflow is more semantics-oriented to address the issues of ‘how’ answer is also difficult, because this question can be an- the previous statistics- and syntax-based extraction. We also swered in many ways. To find ‘how’ candidates, the system by publish the first annotated 5W1H dataset, which we use to Sharma et al. extracts the adverb or adverbial phrase within the learn the optimal parameters. In the Giveme5W implementa- ‘what’ phrase [36]. The tokens extracted with this simplistic ap- tion, the values were based on expert judgement. proach detail the verb, e.g., “He drove quickly”, but do not an- The presented system especially benefits: (1) social scien- swer the method how the action was performed (cf. [37]), e.g., tists with limited programming knowledge, who would benefit by ramming an explosive-laden car into the consulate (in the from ready-to-use main event extraction methods, and (2) example in Figure 1), which is a prepositional phrase. Other ap- computer scientists who are welcome to modify or build on any proaches employ ML [24], but have not been devised for the of the modular components of our system and use our test col- English language. In summary, few approaches exist that ex- lection and results as a benchmark for their implementations. tract ‘how’ phrases. The reviewed approaches provide no de- tails on their extraction method, and achieve poor results, e.g., 2 RELATED WORK they extract adverbs rather than the tool or the method by The extraction of 5W1H phrases from news articles is related which an action was performed (cf. [22, 24, 36]). to closed-domain question answering, which is why some au- None of the reviewed approaches output canonical or nor- thors call their approaches 5W1H question answering (QA) sys- malized data. Canonical output is more concise and also less tems. Hamborg et al. [17] gave an in-depth overview of 5W1H ambiguous than its original textual form (cf. [46]), e.g., poly- extraction systems. Thus, we only provide a brief summary of Giveme5W1H: A Universal System for Extracting INRA 2019, September 2019, Copenhagen, Denmark Main Events from News Articles Preprocessing Phrase Extraction Candidate Scoring 5W1H Phrases Python Who & what Raw Article Sentence Split. Canonicalization Action Enrichment Combined Interface CoreNLP Tokenization When Custom Scoring SUTime Environment RESTful POS & Parsing Where Nominatim Cause API NER Why AIDA Coref. Res. Method How Cache input / output analysis process 3rd party libraries Figure 2: The three-phases analysis pipeline preprocesses a news text, finds candidate phrases for each of the 5W1H questions, and scores these. Giveme5W1H can easily be accessed via Python and via a RESTful API. semes, such as crane (animal or machine), have multiple mean- uses coreference resolution, question-specific semantic dis- ings. Hence, canonical data is often more useful for subsequent tance measures, combined scoring of candidates, and extracts analysis tasks (see Section 1). Phrases containing temporal in- phrases for the ‘how’ question. The values of the parameters formation or location information may be canonicalized, e.g., by introduced in this section result from a semi-automated search converting the phrases to dates or timespans [7, 38] or to pre- for the optimal configuration of Giveme5W1H using an anno- cise geographic positions [29]. Phrases answering the other tated learning dataset including a manual, qualitative revision questions could be canonicalized by employing NERD on the (see Section 3.5). contained NEs, and then linking the NEs to concepts defined in a knowledge graph, such as YAGO [19], or WordNet [31]. 3.1 Preprocessing of News Articles While the evaluations of reviewed papers generally indicate Giveme5W1H accepts as input the full text of a news article, in- sufficient quality to be usable for news event extraction, e.g., cluding headline, lead paragraph, and body text. The user can the system by Yaman et al. achieved 𝐹1 = 0.85 on the Darpa specify these three components as one or separately. Option- corpus from 2009 [48], the evaluations lack comparability for ally, the article’s publishing date can be provided, which helps two reasons. First, no gold standard exists for journalistic Giveme5W1H parse relative dates, such as “yesterday at 1 pm”. 5W1H question answering on news articles. A few datasets ex- During preprocessing, we use Stanford CoreNLP for sentence ist for automated question answering, specifically for the pur- splitting, tokenization, lemmatization, POS-tagging, full pars- pose of disaster tracking [28, 41]; However, these datasets are ing, NER (with Stanford NER’s seven-class model), and pro- so specialized to their own use cases that they cannot be ap- nominal and nominal coreference resolution. Since our main plied to the use case of automated journalistic question an- goal is high 5W1H extraction accuracy (rather than fast execu- swering. Another challenge to the evaluation of news event ex- tion speed), we use the best-performing model for each of the traction is that the evaluation data sets of previous papers are CoreNLP annotators, i.e., the ‘neural’ model if available. We use no longer publicly available [34, 47, 48]. Second, previous pa- the default settings for English in all libraries. pers each used different quality measures, such as precision After the initial preprocessing, we bring all NEs in the text and recall [9] or error rates [47]. into their canonical form. Following from requirement R1, ca- nonical information is the preferred output of Giveme5W1H, 3 GIVEME5W1H: DESCRIPTION OF since it is the most concise form. Because Giveme5W1H uses METHODS AND SYSTEM the canonical information to extract and score ‘when’ and Giveme5W1H is an open-source main event retrieval system for ‘where’ candidates, we implement the canonicalization task news articles that addresses the objectives we defined in Sec- during preprocessing. tion 1. The system extracts 5W1H phrases that describe the We parse dates written in natural language into canonical most defining characteristics of a news event, i.e., who did what, dates using SUTime [7]. SUTime looks for NEs of the type date when, where, why, and how. This section describes the analysis or time and merges adjacent tokens to phrases. SUTime also workflow of Giveme5W1H, as shown in Figure 1. Giveme5W1H handles heterogeneous phrases, such as “yesterday at 1 pm”, can be accessed by other software as a Python library and via a which consist not only of temporal NEs but also other tokens, RESTful API. Due to its modularity, researchers can efficiently such as function words. Subsequently, SUTime converts each adapt or replace components. For example, researchers can in- temporal phrase into a standardized TIMEX3 [44] instance. tegrate a custom parser or adapt the scoring functions tailored TIMEX3 defines various types, also including repetitive peri- to the characteristics of their data. The system builds on ods. Since events according to our definition occur at a single Giveme5W [17], but improves extraction performance by ad- point in time, we only retrieve datetimes indicating an exact dressing the planned future work directions: Giveme5W1H time, e.g., “yesterday at 6pm”, or a duration, e.g., “yesterday”, which spans the whole day. INRA 2019, September 2019, Copenhagen, Denmark Hamborg et al. Geocoding is the process of parsing places and addresses candidates, we take TIMEX3 instances from preprocessing. written in natural language into canonical geocodes, i.e., one or Similarly, we take the geocodes as ‘where’ candidates. more coordinates referring to a point or area on earth. We look The cause extractor looks for linguistic features indicating a for tokens classified as NEs of the type location (cf. [48]). We causal relation within a sentence’s constituents. We look for merge adjacent tokens of the same NE type within the same three types of cause-effect indicators (cf. [25, 26]): causal con- sentence constituent, e.g., within the same NP or VP. Similar to junctions, causative adverbs, and causative verbs. Causal con- temporal phrases, locality phrases are often heterogeneous, i.e., junctions, e.g. “due to”, “result of”, and “effect of”, connect two they do not only contain temporal NEs but also function words. clauses, whereas the second clause yields the ‘why’ candidate. Hence, we introduce a locality phrase merge range 𝑟where = 1, For causative adverbs, e.g., “therefore”, “hence”, and “thus”, the to merge phrases where up to 𝑟where arbitrary NE tokens are first clause yields the ‘why’ candidate. If we find that one or allowed between two location NEs. Lastly, we geocode the more subsequent tokens of a sentence match with one of the merged phrases with Nominatim1, which uses free data from tokens adapted from Khoo et al. [25], we take all tokens on the OpenStreetMap. right (causal conjunction) or left side (causative adverb) as the We canonicalize NEs of the remaining types, e.g., persons ‘why’ candidate. and organizations, by linking NEs to concepts in the YAGO Causative verbs, e.g. “activate” and “implicate”, are con- graph [30] using AIDA [19]. The YAGO graph is a state-of-the- tained in the middle VP of the causative NP-VP-NP pattern, art knowledge base, where nodes in the graph represent se- whereas the last NP yields the ‘why’ candidate [11, 26]. For mantic concepts that are connected to other nodes through at- each NP-VP-NP pattern we find in the parse-tree, we determine tributes and relations. The data is derived from other well-es- whether the VP is causative. To do this, we extract the VP’s tablished knowledge bases, such as Wikipedia, WordNet, Wiki- verb, retrieve the verb’s synonyms from WordNet [31] and Data, and GeoNames [39]. compare the verb and its synonyms with the list of causative verbs from Girju [11], which we also extended by their syno- 3.2 Phrase Extraction nyms (cf. [11]). If there is at least one match, we take the last Giveme5W1H performs four independent extraction chains to NP of the causative pattern as the ‘why’ candidate. To reduce retrieve the article’s main event: (1) the action chain extracts false positives, we check the NP and VP for the causal con- phrases for the ‘who’ and ‘what’ questions, (2) environment for straints for verbs proposed by Girju [11]. ‘when’ and ‘where’, (3) cause for ‘why’, and (4) method for The method extractor retrieves ‘how’ phrases, i.e., the ‘how’. method by which an action was performed. The combined The action extractor identifies who did what in the article’s method consists of two subtasks, one analyzing copulative con- main event. The main idea for retrieving ‘who’ candidates is to junctions, the other looking for adjectives and adverbs. Often, collect the subject of each sentence in the news article. There- sentences with a copulative conjunction contain a method fore, we extract the first NP that is a direct child to the sentence phrase in the clause that follows the copulative conjunction, in the parse tree, and that has a VP as its next right sibling (cf. e.g., “after [the train came off the tracks]”. Therefore, we look [5]). We discard all NPs that contain a child VP, since such NPs for copulative conjunctions compiled from [33]. If a token yield lengthy ‘who’ phrases. Take, for instance, this sentence: matches, we take the right clause as the ‘how’ candidate. To “((NP) Mr. Trump, ((VP) who stormed to a shock election vic- avoid long phrases, we cut off phrases longer than 𝑙how,max = tory on Wednesday)), ((VP) said it was […])”, where “who 10 tokens. The second subtask extracts phrases that consist stormed […]” is the child VP of the NP. We then put the NPs into purely of adjectives or adverbs (cf. [36]), since these often rep- the list of ‘who’ candidates. For each ‘who’ candidate, we take resent how an action was performed. We use this extraction the VP that is the next right sibling as the corresponding ‘what’ method as a fallback, since we found the copulative conjunc- candidate (cf. [5]). To avoid long ‘what’ phrases, we cut VPs af- tion-based extraction too restrictive in many cases. ter their first child NP, which long VPs usually contain. How- ever, we do not cut the ‘what’ candidate if the VP contains at 3.3 Candidate Scoring most 𝑙what,min = 3 tokens, and the right sibling to the VP’s child The last task is to determine the best candidate of each 5W1H. NP is a prepositional phrase (PP). This way, we avoid short, un- The scoring consists of two sub-tasks. First, we score candi- descriptive ‘what’ phrases. For instance, in the simplified exam- dates independently for each of the 5W1H questions. Second, ple: “((NP) The microchip) ((VP) is ((NP) part) ((PP) of a wider we perform a combined scoring where we adjust scores of can- range of the company’s products)).”, the truncated VP “is part” didates of one question dependent on properties, e.g., position, contains no descriptive information; Hence, our presented of candidates of other questions. For each question 𝑞, we use a rules prevent this truncation. scoring function that is composed as a weighted sum of 𝑛 scor- The environment extractor retrieves phrases describing the ing factors: 𝑠𝑞 = ∑𝑛−1𝑖=0 𝑤q,𝑖 𝑠q,𝑖 , where 𝑤q,𝑖 is the weight of the temporal and locality context of the event. To determine ‘when’ scoring factor 𝑠q,𝑖 . 1 https://github.com/openstreetmap/Nominatim, v3.0.0 Giveme5W1H: A Universal System for Extracting INRA 2019, September 2019, Copenhagen, Denmark Main Events from News Articles To score ‘who’ candidates, we define three scoring factors: Table 2: Weights and scoring factors for ‘when’ phrases the candidate shall occur in the article (1) early and (2) often, 𝑖 𝑤when,𝑖 𝑠when,𝑖 and (3) contain a named entity. The first scoring factor targets 0 (position) .24 pos(𝑐) the concept of the inverse pyramid [8]: news mention the most important information, i.e., the main event, early in the article, 1 (frequency) .16 f(𝑐) e.g., in the headline and lead paragraph, while later paragraphs Δs (𝑐, 𝑑pub ) 2 (closeness) .4 1 − min 1, contain details. However, journalists often use so called hooks emax to get the reader’s attention without revealing all content of the log 𝑠(𝑐) − log 𝑠min 3 (duration) .2 1 − min 1, article [35]. Hence, for each candidate, we also consider the fre- log 𝑠max − log 𝑠min quency of similar phrases in the article, since the primary actor involved in the main event is likely to be mentioned frequently To count 𝑛f (𝑐), we determine two TIMEX3 instances as similar in the article. Furthermore, if a candidate contains a NE, we will if their start and end-dates are at most 24h apart. Δs (𝑐, 𝑑pub ) is score it higher, since in news, the actors involved in events are the difference in seconds of candidate 𝑐 and the publication often NEs, e.g., politicians. Table 1 shows the weights and scor- date of the news article 𝑑pub , 𝑠(𝑐) the duration in seconds of 𝑐, ing factors. and the normalization constants emax ≈ 2.5Ms (one month in Table 1: Weights and scoring factors for ‘who’ phrases seconds), 𝑠min = 60s, and 𝑠max ≈ 31Ms (one year). The scoring of location candidates follows four scoring fac- 𝑖 𝑤who,𝑖 𝑠who,𝑖 tors: the candidate shall occur (1) early and (2) often in the ar- 𝑛pos (𝑐) 0 (position) .9 pos(𝑐) = 1 − ticle. It should also be (3) often geographically contained in 𝑑len other location candidates and be (4) specific. The first two scor- 𝑛𝑓 (𝑐) ing factors have the same motivation as in the scoring of ‘who’ 1 (frequency) .095 f(𝑐) = max 𝑐 ′∈𝐶 (𝑛𝑓 (𝑐 ′ )) and ‘when’ candidates. The second and third scoring factors 2 (type) .005 NE(𝑐) aim to (1) find locations that occur often, either by being similar to others, or (2) by being contained in other location candi- 𝑛pos (𝑐) is the position measured in sentences of candidate 𝑐 dates. The fourth scoring factor favors specific locations, e.g., within the document, 𝑛f (𝑐) the frequency of phrases similar to Berlin, over broader mentions of location, e.g., Germany or Eu- 𝑐 in the document, and NE(𝑐) = 1 if 𝑐 contains an NE, else 0 (cf. rope. We logarithmically normalize the location specificity be- [10]). To measure 𝑛f (𝑐) of the actor in candidate 𝑐, we use the tween 𝑎min = 225𝑚2 (a small property’s size) and 𝑎max = number of the actor’s coreferences, which we extracted during 530,000𝑘𝑚2 (approx. the mean area of all countries [43]). We coreference resolution (see Section 3.1). This allows discuss other scoring options in Section 5. The used weights Giveme5W1H to recognize and count name variations, as well and scoring factors are shown in Table 3. We measure 𝑛f (𝑐), the as pronouns. Due to the strong relation between agent and ac- number of similar mentions of candidate 𝑐 , by counting how tion, we rank VPs according to their NPs’ scores. Hence, the many other candidates have the same Nominatim place ID. We most likely VP is the sibling in the parse tree of the most likely measure 𝑛e (𝑐) by counting how many other candidates are ge- NP: 𝑠what = 𝑠who . ographically contained within the bounding box of 𝑐 , where We score temporal candidates according to four scoring fac- 𝑎(𝑐) is the area of the bounding box of 𝑐 in square meters. tors: the candidate shall occur in the article (1) early and (2) often. It should also be (3) close to the publishing date of the ar- Table 3: Weights and scoring factors for ‘where’ phrases ticle, and (4) of a relatively short duration. The first two scoring factors have the same motivation as in the scoring of ‘who’ can- 𝑖 𝑤where,𝑖 𝑠where,𝑖 didates. The idea for the third scoring factor, close to the pub- 0 (position) .37 pos(𝑐) lishing date, is that events reported on by news articles often 1 (frequency) .3 f(𝑐) occurred on the same day or on the day before the article was 2 (contain- 𝑛e (𝑐) .3 published. For example, if a candidate represents a date one or ment) max 𝑐 ′∈𝐶 (𝑛e (𝑐 ′ )) more years in the past before the publishing date of the article, log 𝑎(𝑐) − log 𝑎min the candidate will achieve the lowest possible score in the third 3 (specificity) .03 1 − min 1, scoring factor. The fourth scoring factor prefers temporal can- log 𝑎max − log 𝑎min didates that have a short duration, since events according to our definition happen during a specific point in time with a Scoring causal candidates was challenging, since it often re- short duration. We logarithmically normalize the duration fac- quires semantic interpretation of the text and simple heuristics tor between one minute and one month (cf. [49]). The resulting may fail [11]. We define two objectives: candidates shall (1) oc- scoring formula for a temporal candidate 𝑐 is the sum of the weighted scoring factors shown in Table 2. INRA 2019, September 2019, Copenhagen, Denmark Hamborg et al. cur early in the document, and (2) their causal type shall be re- increases the efficiency of the overall analysis workflow in liable [26]. The second scoring factor rewards causal types with which Giveme5W1H may be embedded, since later analysis low ambiguity (cf. [3, 11]), e.g., “because” has a very high likeli- tasks can reuse the information. hood that the subsequent phrase contains a cause [11]. The For the temporal phrases and locality phrases, weighted scoring factors are shown in Table 4. The causal type Giveme5W1H also provides their canonical forms, i.e., TIMEX3 TC(𝑐) = 1 if 𝑐 is extracted due to a causal conjunction, 0.62 if it instances and geocodes. For the news article shown in Figure 1, starts with a causative RB, and 0.06 if it contains a causative VB the canonical form of the ‘when’ phrase represents the entire (cf. [25, 26]). day of November 10, 2016. The canonical geocode for the ‘where’ phrase represents the coordinates of the center of the Table 4: Weights and scoring factors for ‘why’ phrases city Mazar-i-Sharif (36°42'30.8"N 67°07'09.7"E), where the 𝑖 𝑤why,𝑖 𝑠why,𝑖 bounding box represents the area of the city, and further infor- 0 (position) .56 pos(𝑐) mation from OSM, such as a canonical name and place ID, which 1 (type) .44 CT(𝑐) uniquely identifies the place. Lastly, Giveme5W1H provides linked YAGO concepts [30] for other NEs. The scoring of method candidates uses three simple scoring fac- tors: the candidate shall occur (1) early and (2) often in the 3.5 Parameter Learning news article, and (3) their method type shall be reliable. The Determining the best values for the parameters introduced in weighted scoring factors for method candidates are shown in Section 3, e.g., weights of scoring factors, is a supervised ML Table 5. problem [6]. Since there is no gold standard for journalistic 5W1H extraction on news (see Section 2), we created an anno- Table 5: Weights and scoring factors for ‘how’ phrases tated dataset. 𝑖 𝑤how,𝑖 𝑠how,𝑖 The dataset is available in the open-source repository (see 0 (position) .23 pos(𝑐) Section 6). To facilitate diversity in both content and writing style, we selected 13 major news outlets from the U.S. and the 1 (frequency) .14 f(𝑐) UK. We sampled 100 articles from the news categories politics, 2 (type) .63 TM(𝑐) disaster, entertainment, business and sports for November 6th – 14th, 2016. We crawled the articles [18] and manually revised The method type TM(𝑐) = 1 if 𝑐 is extracted because of a copu- the extracted information to ensure that it was free of extrac- lative conjunction, else 0.41. We determine the number of men- tion errors. tions of a method phrase 𝑛f (𝑐) by the term frequency (includ- We asked three annotators (graduate IT students, aged be- ing inflected forms) of its most frequent token (cf. [45]). tween 22 and 26) to read each of the 100 news articles and to The final sub-task in candidate scoring is combined scoring, annotate the single most suitable phrase for each 5W1H ques- which adjusts scores of candidates of a single 5W1H question tion. Finally, for each article and question, we combined the an- depending on the candidates of other questions. To improve notations using a set of combination rules, e.g., if all phrases the scoring of method candidates, we devise a combined sen- were semantically equal, we selected the most concise phrase, tence-distance scorer. The assumption is that the method of or if there was no agreement between the annotators, we se- performing an action should be close to the mention of the ac- lected each annotator’s first phrase, resulting in three semanti- tion. The resulting equation for a method candidate 𝑐 given an cally diverging but valid phrases. We also manually added a action candidate 𝑎 is: TIMEX3 instance to each ‘when’ annotation, which was used by |𝑛pos (𝑐) − 𝑛pos (𝑎)| (1 the error function for ‘when’. The intercoder reliability was 𝑠how,new (𝑐, 𝑎) = 𝑠how (𝑐) − 𝑤0 𝑑len ) ICR ann = 0.81, measured using average pairwise percentage agreement. where 𝑤0 = 1 . Section 5 describes additional scoring ap- We divided the dataset into two subsets for training (80% proaches. randomly sampled articles) and testing (20%). To find the op- timal parameter values for our extraction method, we first 3.4 Output computed for each parameter configuration the mean error The highlighted phrases in Figure 1 are candidates extracted by (ME) on the training set. To measure the ME of a configuration, Giveme5W1H for each of the 5W1H event properties of the we devised three error functions to measure the semantic dis- shown article. Giveme5W1H enriches the returned phrases tance between candidate phrases and annotated phrases. For with additional information that the system extracted for its the textual candidates, i.e., who, what, why, and how, we used own analysis or during custom enrichment, with which users the Word Mover’s Distance (WMD) [27]. WMD is a state-of-the- can integrate their own preprocessing. The additional infor- art generic measure for semantic similarity of two phrases. For mation for each token is its POS-tag, parse-tree context, and NE ‘when’ candidates, we computed the difference in seconds be- type if applicable. Enriching the tokens with this information tween candidate and annotation. For ‘where’ candidates, we Giveme5W1H: A Universal System for Extracting INRA 2019, September 2019, Copenhagen, Denmark Main Events from News Articles computed the distance in meters between both coordinates. [17]: other systems were tested on non-disclosed datasets [34, We linearly normalized all measures. 47, 48], they were translated from other languages [34], they We then validated the 5% best performing configurations were devised for different languages [22, 24, 45], or they used on the test set and discarded all configurations that yielded a different evaluation measures, such as error rates [47] or bi- significantly different ME. Finally, we selected the best per- nary relevance assessments [48], which are both not optimal forming parameter configuration for each question. because of the non-binary relevance of 5W1H answers (cf. [23]). Finally, none of the related systems have been made pub- 4 EVALUATION licly available or have been described in sufficient detail to en- able a re-implementation, which was the primary motivation We use the same evaluation rules and procedure as described for our research (see Section 1). by Hamborg et al. [17] but employed a larger dataset of 120 Therefore, a direct comparison of the results and related news articles, which we sampled from the BBC dataset [12] in work was not possible. Compared to the fraction of correct 5W order to conduct a survey with three assessors. The dataset answers by the best system by Parton et al. [34], Giveme5W1H contains 24 news articles in each of the following categories: achieves a 0.12 higher MAgP5W . The best system by Yaman et business (Bus), entertainment (Ent), politics (Pol), sport (Spo), al. achieved a precision 𝑃5𝑊 = 0.89 [48], which is 0.14 higher and tech (Tec)). We asked the assessors to read one article at a than our MAgP5W and – as a rough approximation of the best time. After reading each article, we showed the assessors the achievable precision [20] – surprisingly almost identical to the 5W1H phrases that had been extracted by the system and ICR of our assessors. asked them to judge the relevance of each answer on a 3-point We found that different forms of journalistic presentation in scale: non-relevant (if an answer contained no relevant infor- the five news categories of the dataset led to different extrac- mation, score 𝑠 = 0), partially relevant (if only part of the an- tion performance. Politics articles, which yielded the best per- swer was relevant or if information was missing, 𝑠 = 0.5), and formance, mostly reported on single events. The performance relevant (if the answer was completely relevant without miss- on sports articles was unexpectedly high, even though they not ing information, 𝑠 = 1). only report on single events but also are background reports or Table 6 shows the mean average generalized precision announcements, for which event detection is more difficult. De- (MAgP), a score suitable for multi-graded relevance assess- termining the ‘how’ in sports articles was difficult (MAgPhow = ments [17, 23]. MAgP was 0.73 over all categories and ques- 0.51), since often articles implicitly described the method of an tions. If only considering the first 4Ws, which the literature con- event, e.g., how one team won a match, by reporting on multiple siders as sufficient to represent an event (cf. [21, 40, 49]), over- key events during the match. Some categories, such as enter- all MAgP was 0.82. tainment and tech, achieved lower extraction performances, Table 6: ICR and MAgP-Performance of Giveme5W1H mainly because they often contained much background infor- Question ICR Bus Ent Pol Spo Tec Avg. mation on earlier events and the actors involved. Who .93 .98 .88 .89 .97 .90 .92 What .88 .85 .69 .89 .84 .66 .79 5 DISCUSSION AND FUTURE WORK When .89 .55 .91 .79 .81 .82 .78 Most importantly, we plan to improve the extraction quality of Where .95 .82 .63 .85 .79 .80 .78 the ‘what’ question, being one of the important 4W questions. Why .96 .48 .62 .42 .45 .42 .48 We aim to achieve an extraction performance similar to the How .87 .63 .58 .68 .51 .65 .61 performance of the ‘who’ extraction (MaGPwho = 0.91), since Avg. all .91 .72 .72 .75 .73 .71 .73 both are strongly related. In our evaluation, we identified two Avg. 4W .91 .80 .78 .86 .85 .80 .82 main issues: (1) joint extraction of optimal ‘who’ candidates with non-optimal ‘what’ candidates and (2) cut-off ‘what’ can- Of the few existing approaches capable of extracting phrases didates. In some cases (1), the headline contained a concise that answer all six 5W1H questions (see Section 2), only one ‘who’ phrase but the ‘what’ phrase did not contain all infor- publication reported the results of an evaluation: the approach mation, e.g., because it only aimed to catch the reader’s interest, developed by Khodra achieved a precision of 0.74 on Indone- a journalistic hook (Section 2). We plan to devise separate ex- sian articles [24]. Others did not conduct any evaluation [36] or traction methods for both questions. Thereby, we need to en- only evaluated the extracted ‘who’ and ‘what’ phrases of Japa- sure that the top candidates of both questions fit to each other, nese news articles [22]. e.g., by verifying that the semantic concept of the answer of We also investigated the performance of systems that are each question, e.g., represented by the nouns in the ‘who’ only capable of extracting 5W phrases. Our system achieves a phrase, or verbs in the ‘what’ phrase, co-occur in at least one MAgP5W = 0.75 , which is 0.05 higher than the MAgP of sentence of the article. In other cases (2), our strategy to avoid Giveme5W [17]. We also compared the performance with other too detailed ‘what’ candidates (Section 3.2) cut off the relevant systems, despite the difficulties mentioned by Hamborg et al. INRA 2019, September 2019, Copenhagen, Denmark Hamborg et al. information, e.g., “widespread corruption in the finance minis- mentions in the text to standardized TIMEX3 instances, loca- try has cost it $2m”, in which the underlined text was cut off. tions to geocoordinates, and other NEs, e.g., persons and organ- We will investigate dependency parsing and further syntax izations, to unique concepts in a knowledge graph. The system rules, e.g., to always include the direct object of a transitive uses syntactic and domain-specific rules to extract and score verb. phrases for each 5W1H question. Giveme5W1H achieved a For ‘when’ and ‘where’ questions, we found that in some mean average generalized precision (MAgP) of 0.73 on all ques- cases an article does not explicitly mention the main event’s tions, and an MAgP of 0.82 on the first four W questions (who, date or location. The date of an event may be implicitly defined what, when, and where), which alone can represent an event. by the reported event, e.g., “in the final of the Canberra Classic”. Extracting the answers to ‘why’ and ‘how’ performed more The location may be implicitly defined by the main actor, e.g., poorly, since articles often only imply causes and methods. An- “Apple Postpones Release of […]”, which likely happened at the swering the 5W1H questions is at the core of understanding Apple headquarters in Cupertino. Similarly, the proper noun any article, and thus an essential task in many research efforts “Stanford University” also defines a location. We plan to inves- that analyze articles. We hope that redundant implementations tigate how we can use the YAGO concepts, which are linked to and non-reproducible evaluations can be avoided with NEs, to gather further information regarding the date and loca- Giveme5W1H as the first universally applicable, modular, and tion of the main event. If no date can be identified, the publish- open-source 5W1H extraction system. In addition to benefiting ing date of the article or the day before it might sometimes be a developers and computer scientists, our system especially ben- suitable fallback date. efits researchers from the social sciences, for whom automated Using the standardized TIMEX3 instances from SUTime is an 5W1H extraction was previously not made accessible. improvement (MAgPwhen = 0.78) over a first version, where Giveme5W1H and the datasets for training and evaluation we used dates without a duration (MAgPwhen = 0.72). are available at: https://github.com/fham- The extraction of ‘why’ and ‘how’ phrases was most chal- borg/Giveme5W1H lenging, which manifests in lower extraction performances compared to the other questions. One reason is that articles of- REFERENCES ten do not explicitly state a single cause or method of an event, [1] Agence France-Presse 2016. Taliban attacks German consulate in northern Afghan city of Mazar-i-Sharif with truck bomb. The Telegraph. but implicitly describe this throughout the article, particularly [2] Allan, J. et al. 1998. Topic detection and tracking pilot study: Final report. in sports articles (see Section 5). In such cases, NLP methods Proceedings of the DARPA Broadcast News Transcription and Understanding are currently not advanced enough to find and abstract or sum- Workshop (1998), 194–218. [3] Asghar, N. 2016. Automatic Extraction of Causal Relations from Natural marize the cause or method (see Section 3.3). However, we Language Texts: A Comprehensive Survey. arXiv preprint arXiv:1605.07895. plan to improve the extraction accuracy by preventing the sys- (2016). [4] Best, C. et al. 2005. Europe media monitor. tem from returning false positives. For instance, in cases where [5] Bird, S. et al. 2009. Natural language processing with Python: analyzing text no cause or method could be determined, we plan to introduce with the natural language toolkit. O’Reilly Media, Inc. a score threshold to prevent the system from outputting candi- [6] Burnham, K.P. and Anderson, D.R. 2002. Model selection and multimodel inference: a practical information-theoretic approach. dates with a low score, which are presumably wrong. Currently, [7] Chang, A.X. and Manning, C.D. 2012. SUTime: A library for recognizing and the system always outputs a candidate if at least one cause or normalizing time expressions. LREC. iii (2012), 3735–3740. DOI:https://doi.org/10.1017/CBO9781107415324.004. method was found. [8] Christian, D. et al. 2014. The Associated Press stylebook and briefing on media To improve the performance of all textual questions, i.e., law. The Associated Press. who, what, why, and how, we will investigate two approaches. [9] Das, A. et al. 2012. The 5W structure for sentiment summarization- visualization-tracking. Lecture Notes in Computer Science (including subseries First, we want to improve measuring a candidate’s frequency, Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) an important scoring factor in multiple questions (Section 3.3). (2012), 540–555. [10] Finkel, J.R. et al. 2005. Incorporating non-local information into information We currently use the number of coreferences, which does not extraction systems by gibbs sampling. Proceedings of the 43rd annual include synonymous mentions. We plan to count the number of meeting on association for computational linguistics (2005), 363–370. [11] Girju, R. 2003. Automatic detection of causal relations for question YAGO concepts that are semantically related to the current can- answering. Proceedings of the ACL 2003 workshop on Multilingual didate. Second, we found that a few top candidates of the four summarization and question answering-Volume 12 (2003), 76–83. textual questions were semantically correct but only contained [12] Greene, D. and Cunningham, P. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. Proceedings of the 23rd a pronoun referring to the more meaningful noun. We plan to international conference on Machine learning (2006), 377–384. add the coreference’s original mention to extracted answers. [13] Hamborg, F. et al. 2019. Automated Identification of Media Bias by Word Choice and Labeling in News Articles. Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) (Urbana-Champaign, IL, USA, 2019), 6 CONCLUSION 1–10. [14] Hamborg, F. et al. 2018. Automated identification of media bias in news In this paper, we proposed Giveme5W1H, the first open-source articles: an interdisciplinary literature review. International Journal on system that extracts answers to the journalistic 5W1H ques- Digital Libraries. (2018), 1–25. DOI:https://doi.org/10.1007/s00799-018- 0261-y. tions, i.e., who did what, when, where, why, and how, to describe [15] Hamborg, F. et al. 2018. Bias-aware news analysis using matrix-based news a news article’s main event. The system canonicalizes temporal aggregation. International Journal on Digital Libraries. (2018). DOI:https://doi.org/10.1007/s00799-018-0239-9. [16] Hamborg, F. et al. 2018. Extraction of Main Event Descriptors from News Giveme5W1H: A Universal System for Extracting INRA 2019, September 2019, Copenhagen, Denmark Main Events from News Articles Articles by Answering the Journalistic Five W and One H Questions. [32] Oliver, P.E. and Maney, G.M. 2000. Political Processes and Local Newspaper Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) Coverage of Protest Events: From Selection Bias to Triadic Interactions. (Fort Worth, Texas, USA, 2018), 339–340. American Journal of Sociology. 106, 2 (2000), 463–505. [17] Hamborg, F. et al. 2018. Giveme5W: Main Event Retrieval from News Articles [33] Oxford English 2009. Oxford English Dictionary. Oxford University Press. by Extraction of the Five Journalistic W Questions. Proceedings of the [34] Parton, K. et al. 2009. Who, what, when, where, why?: comparing multiple iConference 2018 (Sheffield, UK, 2018), 355–356. approaches to the cross-lingual 5W task. Proceedings of the Joint Conference [18] Hamborg, F. et al. 2017. news-please: A Generic News Crawler and Extractor. of the 47th Annual Meeting of the ACL and the 4th International Joint Proceedings of the 15th International Symposium of Information Science Conference on Natural Language Processing of the AFNLP: Volume 1-Volume (2017), 218–223. 1 (2009), 423–431. [19] Hoffart, J. et al. 2011. Robust Disambiguation of Named Entities in Text. [35] Peters, C. et al. 2012. Improving the Hook in Case Writing. Journal of Case Proceedings of the 2011 Conference on Empirical Methods in Natural Studies. 30, (2012), 1–6. Language Processing (2011), 782–792. [36] Sharma, S. et al. 2013. News Event Extraction Using 5W1H Approach & Its [20] Hripcsak, G. and Rothschild, A.S. 2005. Agreement, the F-measure, and Analysis. International Journal of Scientific & Engineering Research - IJSER. 4, reliability in information retrieval. Journal of the American Medical 5 (2013), 2064–2067. Informatics Association. 12, 3 (2005), 296–298. [37] Sharp, D. 2002. Kipling’s guide to writing a scientific paper. Croatian medical DOI:https://doi.org/10.1197/jamia.M1733. journal. 43, 3 (2002), 262–7. [21] Ide, I. et al. 2005. TrackThem: Exploring a large-scale news video archive by [38] Strötgen, J. and Gertz, M. 2013. Multilingual and cross-domain temporal tracking human relations. Lecture Notes in Computer Science (including tagging. Language Resources and Evaluation. 47, 2 (2013), 269–298. subseries Lecture Notes in Artificial Intelligence and Lecture Notes in DOI:https://doi.org/10.1007/s10579-012-9179-y. Bioinformatics) (2005), 510–515. [39] Suchanek, F.M. et al. 2007. YAGO: a core of semantic knowledge. Proceedings [22] Ikeda, T. et al. 1998. Information Classification and Navigation Based on 5W1 of the 16th international conference on World Wide Web. (2007), 697–706. H of the Target Information. Proceedings of the 36th Annual Meeting of the DOI:https://doi.org/10.1145/1242572.1242667. Association for Computational Linguistics and 17th International Conference [40] Sundberg, R. and Melander, E. 2013. Introducing the UCDP Georeferenced on Computational Linguistics-Volume 1 (1998), 571–577. Event Dataset. Journal of Peace Research. 50, 4 (2013), 523–532. [23] Kekäläinen, J. and Järvelin, K. 2002. Using graded relevance assessments in DOI:https://doi.org/10.1177/0022343313484347. IR evaluation. Journal of the American Society for Information Science and [41] Sundheim, B. 1992. Overview of the fourth message understanding Technology. 53, 13 (2002), 1120–1129. evaluation and conference. Proceedings of the 4th conference on Message [24] Khodra, M.L. 2015. Event extraction on Indonesian news article using understanding (1992), 3–21. multiclass categorization. ICAICTA 2015 - 2015 International Conference on [42] Tanev, H. et al. 2008. Real-time news event extraction for global crisis Advanced Informatics: Concepts, Theory and Applications (2015). monitoring. Lecture Notes in Computer Science (including subseries Lecture [25] Khoo, C.S.G. et al. 1998. Automatic extraction of cause-effect information Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008), from newspaper text without knowledge-based inferencing. Literary and 207–218. Linguistic Computing. 13, 4 (1998), 177–186. [43] The CIA World Factbook: 2016. [26] Khoo, C.S.G. 1995. Automatic identification of causal relations in text and their https://www.cia.gov/library/publications/the-world-factbook/geos/. use for improving precision in information retrieval. [44] TimeML Working Group 2009. Guidelines for Temporal Expression [27] Kusner, M.J. et al. 2015. From Word Embeddings To Document Distances. Annotation for English for TempEval 2010. English. (2009), 1–14. Proceedings of The 32nd International Conference on Machine Learning. 37, [45] Wang, W. et al. 2010. Chinese news event 5w1h elements extraction using (2015), 957–966. semantic role labeling. Information Processing (ISIP), 2010 Third [28] Lejeune, G. et al. 2015. Multilingual event extraction for epidemic detection. International Symposium on (2010), 484–489. Artificial Intelligence in Medicine. (2015). [46] Wick, M.L. et al. 2008. A unified approach for schema matching, coreference DOI:https://doi.org/10.1016/j.artmed.2015.06.005. and canonicalization. Proceeding of the 14th ACM SIGKDD international [29] Li, H. et al. 2003. InfoXtract location normalization: a hybrid approach to conference on Knowledge discovery and data mining - KDD 08 (2008), 722. geographic references in information extraction. Proceedings of the {HLT- [47] Yaman, S. et al. 2009. Classification-based strategies for combining multiple NAACL} 2003 Workshop on Analysis of Geographic References. 1, (2003), 39– 5-w question answering systems. INTERSPEECH (2009), 2703–2706. 44. DOI:https://doi.org/10.3115/1119394.1119400. [48] Yaman, S. et al. 2009. Combining semantic and syntactic information sources [30] Mahdisoltani, F. et al. 2015. YAGO3: A Knowledge Base from Multilingual for 5-w question answering. INTERSPEECH (2009), 2707–2710. Wikipedias. Proceedings of CIDR. (2015), 1–11. [49] Yang, Y. et al. 1998. A study on retrospective and on-line event detection. DOI:https://doi.org/10.1016/j.jbi.2013.09.007. Proceedings of the 21st annual international ACM SIGIR conference on [31] Miller, G.A. 1995. WordNet: a lexical database for English. Communications Research and development in information retrieval - SIGIR ’98 (1998), 28–36. of the ACM. 38, 11 (1995), 39–41. DOI:https://doi.org/10.1145/219717.219748.