Annotating Content Zones in News Articles Daniela Baiamonte Tommaso Caselli Irina Prodanof University of Pavia VU University Amsterdam University of Pavia daniela.baiamonte01@uni t.caselli@vu.nl irina.prodanof@gmail.com versitadipavia.it Abstract duced by the genre of the text1 , provides more re- liable and fine-grained cues to access the struc- English. This paper presents a method- ture of its types of functional content. Previous ology for the annotation of the semantic attempts to annotate CZs have mainly focused and functional components of news arti- on highly standardized texts like scientific articles cles (Content Zones, henceforth CZs). We (Teufel et al., 2009; Liakata et al., 2012; Liakata distinguish between narrative and descrip- and Teufel et al., 2010) and scheduling dialogues tive zones and, within them, among finer- (Taboada and Lavid, 2003), or on semi-structured grained units contributing to the overall texts like film reviews (Bieler et al., 2007; Taboada communicative purpose of the text. Fur- et al., 2009). Other work (Palmer and Friedrich, thermore, we show that the segmentation 2014; Mavridou et al., 2015) adopts the theory of in CZs could provide valuable cues for discourse modes (Smith, 2003) to distinguish be- the recognition of time relations between tween the different types of text passages in a text events. document. To the best of our knowledge, no efforts have Italiano. In questo lavoro viene presen- been undertaken to devise an annotation scheme tata una metodologia per l’annotazione targeting the functional structure of news articles delle componenti semantiche e funzion- in terms of their content: the inverted pyramid ali del testo giornalistico (Zone di Con- structure, i.e. the gathering of key details at the tenuto). Distinguiamo tra zone narrative beginning, followed by supporting information in e descrittive e, al loro interno, tra ulteri- order of diminishing importance, is too coarse- ori unità che contribuiscono al dispiega- grained to be effectively used for information ex- mento dello scopo comunicativo del testo. traction purposes. Our hypothesis is that modeling Inoltre, mostriamo che la segmentazione the document’s content via CZs could yield ben- in Zone di Contenuto offre preziosi indizi efits for high-level NLP tasks such as Temporal per il riconoscimento delle relazioni tem- Processing, Summarization, Question-Answering, porali tra eventi. among others. In addition to this, CZs qualify as a higher-level analysis of a text/discourse which captures different information with respect to Dis- 1 Introduction course Relations. The remainder of the paper is structured as follows: in Section 2 the motivations The logical structure of a document, i.e. its hier- of this work are presented, together with related archical arrangement in sections, paragraphs, sen- studies. Section 3 reports on our inventory of CZs, tences and the like, reflects a functional organiza- used to annotate a corpus of English news arti- tion of the information flow and creates expecta- cles. Details on the corpus are provided in Sec- tions on where the desired information may be lo- tion 4. In Section 5, we describe a case-study cated. As it is often the case, however, breakups in on the correlation between CZs and temporal re- sections and paragraphs are motivated by style or even arbitrary choices. 1 The segmentation of the text in Content Zones We adopt Systemic Functional Linguistics’ view of genre as “a staged, goal oriented, purposeful activity in (CZs, henceforth), i.e. functional categories con- which speakers engage as members of our culture" (Martin, tributing to the overall message or purpose, as in- 1984:25). lations to show that the segmentation in CZs can two macro CZs is further divided into more fine- provide cues in recognizing temporal relations be- grained categories. tween events. Finally, Section 6 draws on conclu- The class NARRATION (NARR) includes the fol- sion and suggests directions for future work. lowing zones: • Foreground (FGR): text span containing 2 Motivations and related work the most salient events, i.e. those in the fo- The bulk of the work on discourse structures has cus of attention (as intended by Boguraev focused on low-level structures corresponding to and Kennedy, 1999). The information it con- Discourse Relations holding between textual seg- veys is both referentially and relationally new ments pairs. CZs take a different view on texts, (Gundel and Fretheim, 2005), as it is usually as they perform a function towards the text as mentioned at the beginning of the article. a whole. As an instance of a particular genre, • Background (BGR): ancillary, referen- every text is meant to accomplish a culturally- tially and relationally old information per- established communicative purpose, e.g. a news forming an explanatory function (through article reports on events happening in the world. causal and temporal precedence relations) to- This goal is not accomplished all at once: sepa- wards FGRs. rate functional stages (i.e. CZs) convey fragments • Follow-up (FUP): reactions and conse- of its overall meaning (Eggins and Martin, 1997). quences to FGR events (to whom they’re re- Therefore, the knowledge about the typical func- lated through cause-effect and temporal suc- tional structure of genres can be exploited to pre- cession relations), i.e. relationally new infor- dict the internal organization of a text. This kind mation moving the discourse forward. of information can be of help to produce balanced summaries or to select the passages most likely to • Expectation (EXP): assumptions and contain the answer to a question. probable or possible outcomes, i.e. non fac- Teufel et al. (2009) and Liakata et al. (2010)’s tual information (e.g. conditionals, modal- works present two complementary perspectives ity). on scientific papers: the former models their The class DESCRIPTION (DSCR) includes the argumentative/rhetorical structure (following the following zones: knowledge claims made by the authors); the lat- ter treats them as the humanly readable represen- • Description (DES): characteristics of tations of scientific investigations. In the works of a person or an object, customary circum- Bieler et al. (2007) and Taboada et al. (2009), two stances, or states of affairs. different kinds of zones are recognized in film re- • Evaluation (EVL): subjective descrip- views: formal zones (required by the genre, e.g. tions, explicit judgements showing the author credits and cast) and functional zones (reflecting or some other agent’s attitude towards a tar- the abstract functions of describing and comment- get. ing). In addition, a third macro-class is posited, OTHER In the elaboration of news articles’ CZs, we (OTHR), containing categories performing auxil- were mostly inspired by Labov (2013)’s study of iary functions towards the other CZs: oral narratives of personal experiences and by Bell (1991)’s analysis of the structure of news stories. • Attribution (ATT): text span containing the source and, if present, the cue of an attri- 3 Annotation Schema bution (as intended by Pareti and Prodanof, 2010) - while the content is assigned the rel- The opposition between dynamicity and staticity, evant CZ(s). mainly realized by grammatical and lexical aspect, is adopted as the basic parameter for differenti- • Metatext (MTX): text span guiding the ating between two macro CZs: NARRATION and reader’s attention towards metatextual ele- DESCRIPTION . The former is aimed at reporting ments like figures or tables. temporally interrelated (dynamic) events, the latter • Interrogative (INT): questions directly is used to comment by focusing on selected enti- addressed to the reader, e.g. to introduce a ties, properties, and states of affairs. Each of these new topic or to prompt a reaction. MTX BGR FGR EVL Major approaches to functional discourse structur- EXP ATT DES FUP INT ing adopt the sentence or the paragraph as unit of Tense annotation. On the other hand, we have opted for PRESENT 46 46 26 21 35 58 34 0 0 a clause level annotation as this allows us to bet- ter deal with news articles’ high level of informa- PAST 66 204 29 3 2 10 172 0 0 tion density. Although CZs are conceptually non- FUTURE 21 1 25 22 0 2 0 0 0 overlapping, empirical analysis indicates that an annotation unit may fit into more than one cate- INFINITIVE 32 51 26 18 2 12 1 0 0 gory, that is a clause may represent complex con- PRESPART 6 19 10 5 2 3 9 0 0 tents. Cases as such suggest that the more infor- PASTPART 2 11 1 1 0 2 1 0 0 mative content should be preferred. In the exam- ple below, the tag ATT is assigned, even though a NONE 6 8 6 21 0 2 2 0 0 descriptive content may as well be recognized. Table 1: Distribution of tenses among CZs. 1. [On an office wall of the Senate intelligence committee hangs a quote from Chairman David Boren,]AT T {PDTB2 , wsj_0771} (20 from the test section of TempEval-3 (UzZa- man et al., 2013), 20 shared between the PDTB The annotation of CZs is further complicated (Prasad et al., 2008) and the training section of the by the fact the distribution of the zones does not TimeBank (Pustejovsky et al., 2003), 17 from the follow the linear order of the text. In most cases, PDTB). The corpus contains 2059 annotation units CZs are discontinuous, that is either their contigu- and it is dominated by narrative sections (57%). ity may be “broken” by the presence of other CZs Within them, the most frequent CZ is the BGR or the same CZ may appear in different sentences (26.5%), followed by FGR (12.4%), EXP (9.6%) along the entire document (see example ?? for the and FUP (8.4%). These figures show that news FGR zone). articles mostly consist of redundant information, 2. [South Korea registered a trade deficit of only mentioned in order to help the reader to an- $101 million in October,]F GR [reflecting chor the new data to the prior knowledge. De- the country’s economic sluggishness,]EV L scriptive sections constitute the 25.5% of the cor- [according to government figures released pus: EVLs are slightly more frequent than DESs Wednesday.]AT T [Preliminary tallies by the (14.8% vs. 8.9%) — contradicting the alleged Trade and Industry Ministry showed an- objectivity expected in news reports (note, how- other trade deficit in October, the fifth ever, that EVLs tend to occur in association with monthly setback this year,]F GR [casting ATTs). As to the OTHER macro CZ, it makes up a cloud on South Korea’s export-oriented the 17.4% of the corpus: this percentage almost economy.]EV L {PDTB, wsj_0011} entirely refers to ATTs, since MTXs and INTs are only marginal zones (0.19% and 0.33%, respec- In other cases, due to the use of the clause as tively). minimal annotation span of a CZ, nested CZs may To test our hypotheses about some formal prop- occurr (see example ??). erties of CZs, we carried out a corpus study. The 3. [South Korea’s economic boom, [which be- results are reported below. gan in 1986,]BGR stopped this year be- Position in the text. 71.7% of FGRs are lo- cause of prolonged labor disputes, trade cated in the opening sections of the articles and conflicts and sluggish exports.]BGR {PDTB, their occurrence decreases towards the central wsj_0011} (18.4%) and closing sections (9.8%). BGRs show a fairly complementary distribution to FGRs, as 4 Description of the corpus they mostly occupy the central (31.6%) and clos- ing sections (27.3%) of the articles. As expected, We used the CZs annotation schema and the an- ATTs are quite evenly distributed among the three notation tool CAT (Bartalesi Lenzi et al., 2012) sections. The remaining CZs do not show any to construct a small corpus of 57 news articles clear-cut tendencies. 2 Penn Discourse TreeBank (Prasad et al., 2008). Verbal tenses. Table ?? shows the distribution target FGR BGR FUP EXP DES EVL ATT MTX INT FGR 46 13 1 - 2 12 8 1 1 - - - 2 2 9 - - - - - 1 2 - - - - - - - - - - - - - - - - - 1 1 3 38 7 - - - 1 - - - - - - - - - - - - - - - BGR 16 1 - 1 - 1 8 91 45 1 3 2 41 8 6 - - - - - - 7 - - - - - - 3 - - - - - - 5 - - - - - - 81 5 - - - 5 2 - - - - - - - - - - - - - - FUP 1 4 - - - 1 4 - - - - - - - 33 6 - 1 6 10 - 1 - - - - - - - - - - - 1 1 2 1 1 - - - - - - - - - 3 1 - - - - - - - - - - - - - - source EXP - 1 - - - - 1 - - - - - - 1 - - - - - - - 24 10 - 1 1 7 - - - - - - - - 2 2 - - - - - 2 1 - - - 1 - - - - - - - - - - - - - - - DES 2 1 - - - - 1 - 1 - - - - 2 2 1 - - - 1 - 5 - - - - - 1 4 - 2 - - 2 - 1 1 - - - 1 - 5 3 - - - 1 - - - - - - - - - - - - - - - EVL 1 1 - - - 1 3 2 2 - - - 1 1 2 2 - - - 1 - 4 - - - - - - - - - - - 1 - 15 - - - - 5 - 7 3 2 - - 5 - - - - - - - - - - - - - - - ATT 16 1 - - - 1 1 18 2 - - - 4 - 18 1 - - - 1 - 21 1 - - - - - 3 - - - - - - 2 - - - - 1 - 14 5 - - - 44 21 - - - - - - - - - - - - - - MTX - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - INT - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - BEFORE , INCLUDES , DURING , BEGINS, ENDS , SIMULTANEOUS , IDENTITY Table 2: Distribution of time relations among CZs. of verbal tenses, as annotated in the TimeBank Modality markers, Pronouns corpus, among CZs. BGRs and ATTs are domi- nated by the past tense, this is in accordance with INT 0 0.13 our expectations as the former is characterized 0 MTX 0 by temporal precedence relations to FGR events ATT 4.54 and the second mostly contains events of saying. 9.05 CZs belonging to the DSCR class are significantly EVL 3.03 15.22 dominated by the present tense, usually associated 3.03 DES 6.44 with imperfective aspect and staticity. The high EXP 43.93 frequency of present tenses in FGRs and BGRs 7.95 doesn’t necessarily defy our expectations, since FUP 13.63 4.8 FGRs contain both dynamic and static events and 19.69 BGR 47.32 the tag PRESENT is also used to refer to instances FGR 12.12 of present perfect. 9.05 Modality markers. The majority of modality 0 10 20 30 40 50 markers is located in EXPs and, more broadly, in Percentage (%) the narrative CZs, as shown in Figure ??. In the Figure 1: Distribution of modality markers and TimeBank corpus, the MODALITY tag is mostly pronouns among CZs. assigned to modal auxiliaries, we believe that the annotation of modal adverbs would further raise the percentages observed in EXPs and in the NARR class. ing their temporal event structure. Therefore, we Pronouns. Looking at Figure ?? we can see used the annotations available for the TimeBank that almost 50% of all pronouns is located in section of the corpus to check whether some con- BGRs. The percentages are consistent with our nections between CZs and temporal relations be- expectations as BGRs convey referentially old in- tween event pairs exist. The full set of temporal formation and, although FUPs and EXPs elaborate relations specified in TimeML contains 14 types on FGR events, they often introduce new referents. of relations: BEFORE, AFTER, IBEFORE, IAFTER, Note that the distribution of pronouns is not, alone, BEGINS, BEGUN _ BY , ENDS , ENDED _ BY , DUR - a sufficient indicator of referential oldness since ING , DURING _ INV , INCLUDES , IS _ INCLUDED , also lexical and zero anaphoras should be taken SIMULTANEOUS and IDENTITY . We simplified into account. the set as follows: the relation types that invert each other were collapsed into a single type; given 5 Interactions between CZs and time the low frequency of the relation type IBEFORE, relations it was mapped to the corresponding more coarse- In news articles events are not iconically presented grained type BEFORE. in the linear order of their real succession, this Given the narrative shape of news articles, the poses a challenge to systems aimed at uncover- corpus is considerably dominated by precedence SMLT ENDS a result of their textual salience, FGR events can INCL IDNT DUR BEG BFR source - target be mentioned in other FGRs or further clarified in narrative or descriptive sections. NARR - NARR 218 77 1 6 11 71 26 NARR - DSCR 17 1 1 0 1 2 7 6 Conclusions and future work NARR - OTHR 119 9 0 0 0 10 3 We have developed an inventory of zone labels DSCR - DSCR 30 5 3 0 0 12 6 for the genre news article and shown that the DSCR - NARR 20 8 0 0 0 4 6 so-generated content structure could help narrow- ing down the range of time relations connecting DSCR - OTHR 16 10 2 0 0 6 0 events. OTHR - OTHR 14 5 0 0 0 44 21 Future work would involve testing the stabil- OTHR - NARR 72 5 0 0 0 6 0 ity and reproducibility of the annotation scheme through the measurement of inter-annotator agree- OTHR - DSCR 6 0 0 0 0 1 1 ment and elaborating a separate annotation scheme for editorials, whose argumentative style Table 3: Distribution of time relations among the reflects different structuring principles than those macro-classes. acting in news reports. Finally, we would like to automatize the process of annotation and test (BEFORE) and succession (AFTER) relations. Ta- the effectiveness of the approach in texts belong- ble ?? shows that the majority of temporal rela- ing to different genres, e.g. novels (Ouyang and tions holds between events belonging to the same McKeown, 2014) and historical essays. Even the CZ types: events tend to precede, include, oc- basic distinction between narrative and descrip- cur during, begin, end, be simultaneous to and tive zones could facilitate the performance of more anaphorically evoke (through TimeML IDENTITY complex NLP tasks by targeting the relevant in- temporal relations) other events belonging to the formational zones. The corpus and the annotation same zone. guidelines are publicly available3 . FGR events precede rather than follow ATT, Acknowledgment FUP and EXP events. BGR events, the most in- volved in BEFORE relations, tend to precede other This has been partially supported by the Erasmus events, especially if located in ATTs and FGRs. + Traineeship Program 2015/2016 from Univer- Unexpected outcomes mostly occur in cases like sity of Pavia and the NWO Spinoza Prize project the following, where the FGR event precedes the Understanding Language by Machines (sub-track BGR one. This is because conflicting contents 3). may be expressed in the same unit (in this case a reaction to the FGR event and the list of its premises): References Baiamonte, D. 2016. Annotazione di Zone di Con- 4. [Delta Air Lines earnings soared to 33% tenuto: una strutturazione funzionale del testo gior- to a record in the fiscal first quarter,]F GR nalistico. Thesis of the Master in Theoretical and [bucking the industry trend toward declining Applied Linguistics. University of Pavia, Pavia. profits.]F U P [The Atlanta-based airline, the Bärenfänger, M., Hilbert, M., Lobin, H., Lüngen, H., third largest in the U.S., attributed the in- Puskás, C. 2006. Cues and constraints for the re- crease to higher passenger traffic, new inter- lational discourse analysis of complex text types - national routes and reduced service by Rival the role of logical and generic document structure. Eastern Airlines...]BGR {PDTB, wsj_1011} Sidner C.L., Harpur J. Benz A., Kühnlein P. (eds.), Proceedings of the Workshop on Constraints in Dis- As highlighted in Table ??, NARR events begin or course. Maynooth, Ireland. 27-34. end other NARR or DSCR events (more specifically, Bartalesi Lenzi, V., Moretti, G., Sprugnoli, R. 2012. these relations hold between events belonging to CAT: the CELCT Annotation Tool. Proceedings of instances of the same CZ) and DSCR events in- LREC 2012. Istanbul. clude (rather than being included in) other events. 3 https://github.com/cltl/ContentZones. IDENTITY relations mostly involve FGRs: as git Bell, A. 1991. The Language of News Media. Black- Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., well Publishers, Oxford. Robaldo, L., Joshi, A., Webber, B. 2008. The Penn Discourse TreeBank 2.0. Proceedings of the 6th In- Bieler, H., Dipper, S., Stede, M. 2007. Identifying For- ternational Conference on Language Resources and mal and Functional Zones in Film Reviews. Keizer Evaluation (LREC). Marrakech, Morocco. S., Bunt H., Paek T. (eds.), Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue. Pustejovsky, J., Hanks, P., Sauri, R., See, A., Day, D., Antwerp, Belgium. 75-78. Ferro, L., Gaizauskas, R., Lazo, M., Setzerr, A., Sundheim, B. 2003. The TimeBank Corpus. Cor- Boguraev, B. and Kennedy, C. 1999. Salience-based pus Linguistics. 647-56. content characterisation of text documents. Inder- jeet M. and Maybury M. T. (eds.), Advances in Auto- Smith, S. Carlota 2003. Modes of discourse: The local matic Text Summarization. MIT Press, Cambridge, structure of texts. Cambridge University Press. MA. Stede, M. 2011. Discourse Processing. Morgan & Eggins, S. and Martin, J. R. 1997. Genres and registers Claypool Publishers. 7-38. of discourse. van Dijk T. (ed.), Discourse Studies. Discourse as structure and process, volume 1. Sage, Taboada, M., Brooke, J., Stede, M. 2009. Genre based London (UK) and Thousand Oaks (CA). 230-257. paragraph classification for sentiment analysis. Pro- ceedings of SIGDIAL 2009: the 10th Annual Meet- Gundel, J. K. and Fretheim, T. 2005. Topic and Fo- ing of the Special Interest Group in Discourse and cus. Horn L. and Ward G. (eds.), The Handbook of Dialogue. Queen Mary University of London. 62- Pragmatics. Blackwell Publishing, 175-196. 70. Labov, W. 2013. The Language of Life and Death. Taboada, M. and Lavid, J. 2003. Rhetorical and The- Cambridge University Press, Cambridge, UK. matic Patterns in Scheduling Dialogues: A Generic Characterization. Functions of Language, 10(2). Liakata, M., Saha, S., Dobnik, S., Batchelor, C., 147-179. Rebholz-Schuhmann, D. 2012. Automatic recog- nition of conceptualization zones in scientific arti- Teufel, S., Siddharthan, A., Batchelor, C. 2009. To- cles and two life science applications. Bioinformat- wards discipline-independent argumentative zoning: ics 2012, volume 28. 991-1000. Evidence from chemistry and computational linguis- tics. Proceedings of the 2009 Conference on Em- Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C. pirical Methods in Natural Language Processing, 2010. Corpora for conceptualisation and zoning of EMNLP 2009. Singapore. scientific papers. Proceedings of the 7th conference on Internation Language Resource and Evaluation UzZaman, N., Llorens, H., Derczynski, L., Allen, J. (LREC10). Verhagen, M., Pustejovsky, J. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Martin, J. R. 1984. Language, register and genre. Events, and Temporal Relations. Second Joint Con- Christie F. (ed.), Language studies: Children writ- ference on Lexical and Computational Semantics ing. Reader. Deakin University Press, Geelong, (*SEM), Volume 2: Proceedings of the Seventh In- Australia. 21-30. ternational Workshop on Semantic Evaluation (Se- mEval 2013) Mavridou, K., Friedrich, A., Peate Sørensen, M., Palmer, A., and Pinkal, M. 2015. Linking dis- course modes and situation entity types in a cross- linguistic corpus study. September 2015. In Proceed- ings of Linking Models of Lexical. Sentential and Discourse-level Semantics (LSDSem). , Lisbon, Por- tugal. Ouyang, J. and McKeown, K. 2014. Towards auto- matic detection of narrative structure. Proceedings of LREC14, Reykjavik, Iceland. Palmer, A. and Friedrich, A. 2014. Genre distinctions and discourse modes: Text types differ in their sit- uation type distributions. Proceedings of the Work- shop on Frontiers and Connections between Argu- mentation Theory and Natural Language Process- ing. Forlí-Cesena, Italy. Pareti, S. and Prodanof, I. 2010. Annotating Attribu- tion Relations: Towards an Italian Discourse Tree- bank. Proceedings of LREC10.