A Pilot Study on Annotating Scenes in Narrative Text using SceneML Tarfah Alrashid1,2 Robert Gaizauskas1 1 Department of Computer Science, University of Sheffield, Sheffield, UK {ttalrashid1,r.gaizauskas}@sheffield.ac.uk 2 University of Jeddah, Jeddah, Saudi Arabia Abstract SceneML is a framework for annotating scenes in narratives, along with their attributes and relations [GA19]. It adopts the widely held view of scenes as narrative elements that exhibit continuity of time, loca- tion and character. Broadly, SceneML posits scenes as abstract dis- course elements that comprise one or more scene description segments – contiguous sequences of sentences that are the textual realisation of the scene – and have associated with them a location, a time and a set of characters. A change in any of these three elements signals a change of scene. Additionally, scenes stand in narrative progression re- lations with other scenes, relations that indicate the temporal relations between scenes. In this paper we describe a first small-scale, multi- annotator pilot study on annotating selected SceneML elements in real narrative texts. Results show reasonable agreement on some but not all aspects of the annotation. Quantitative and qualitative analysis of the results suggest how the task definition and guidelines should be improved. 1 Introduction We all have an informal idea of what constitutes a scene in a narrative, such as in literature or in film: the story moves to a di↵erent location; or one set of characters exits the story and another set enters; or we are taken forwards or backwards in time. Scenes are the fundamental building blocks of extended narrative, the chunks into which narrative naturally divides. Despite the ubiquity of the notion in literary studies, in particular in narrative theory and in drama studies, there has been little work on formalising a notion of scene or on developing an annotation framework for scenes such that automated approaches to scene segmentation might be developed. Why might one want to do this? One reason is that as in any area of literary or linguistic studies, operationalising a theoretical model in a computer program and applying to data allows one to verify or, if necessary, to refine the theory to give it empirical support. Another reason is that there are several potential applications for automated scene segmentation capability. These include: (1) automatic text illustration [JWL04, AGKK11, FL10], since scenes are the likely discourse units with which to associate illustrations; (2) aligning books with movies [ZKZ+ 15], since scenes are the high level units to be aligned; (3) automatic generation of image descriptions [KPD+ 11, DFUL17, YTDIA11], Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Finlayson (eds.): Proceedings of the Text2Story’21 Workshop, Online, 1-April- 2021, published at http://ceur-ws.org 7 where scene-segmented narratives could provide background knowledge about what sorts of descriptive elements should be mentioned in descriptions of particular scene-types; and, (4) automatic narrative generation [CL02], which could benefit from a corpus of scene-segmented narratives on which to train models. To address the lack of a formal model of scene annotation in narrative text Gaizauskas and Alrashid [GA19] proposed an annotation framework called SceneML. That paper is an initial, theoretical proposal for an annota- tion framework comprising annotations for entities, including scenes, scene description segments, times, locations and characters, and for relations between scenes, such as narrative progression relations. While that paper laid the foundation for SceneML as a framework for scene annotation, it did not report any empirical work on anno- tating a collection of texts, nor did it discuss levels of agreement obtainable by annotators. This paper addresses these considerations, reporting the first pilot study on applying SceneML to real narrative texts in which multiple annotators applied SceneML to annotate selected SceneML elements in several chapters from a children’s story. The rest of the paper is organised as follows. Section 2 briefly reviews the SceneML annotation framework to enable this paper to be read stand-alone. Section 3 describes the pilot study carried out and the results, both quantitative and qualitative, of an analysis of inter-annotator agreement. It also discusses the implications of the pilot study for SceneML and what steps need to be taken to improve inter-annotator agreement. Section 4 summarises our findings and discusses future work. 2 SceneML This section provides a compressed overview of the SceneML framework, as introduced in [GA19]. All references to SceneML in the following are to that paper. 2.1 The Annotation Framework Adopting the most widely accepted definition of scene in the literature, we treat a scene as a unit of a story in which the elements of time, location, and main characters are constant. Any change in these elements indicates a change of scene. A scene is an abstract discourse element, not a specific span of text. It consists of a location or setting, a time and characters who are involved in the events that take place in the scene. These elements exist in the real or fictive world, i.e. the storyworld as per narrative theory [Sch10], within which the narrative unfolds. “The scene itself is an abstraction away from the potentially infinitely complex detail of that real or fictive world, a cognitive construct that makes finite, focussed narrative possible” (SceneML, §3.1). A scene in a textual narrative is realised through one or more scene description segments (SDS). An SDS is “a contiguous span of text that, possibly together with other SDSs, expresses a scene” (SceneML, §3.1). Typically a scene consists of a single SDS. However, a scene may reference other scenes, e.g. scenes involving past or future events. The SDSs describing these events may be embedded within the description of the embedding scene, causing its textual realisation to be split into multiple, non-textually adjacent SDSs. A scene may also be realised through multiple SDSs if “the author is employing the narrative device of rotating between multiple concurrent scenes each of which is advancing a distinct storyline (a common technique in action movies)” (SceneML, §3.1). It is important to define what each element of a scene (characters, time, location) is to facilitate the annotation process and make it easier to detect scene changes if any of the elements change. We propose to adopt the definitions and annotation standards for the elements time, location, and spatial entities from Iso-TimeML and Iso-Space 1 . As for characters, we will adopt the definition and annotation standards for named entities of type person from the ACE program, recently used in the TAC 2018 entity discovery and linking task 2 . These previous standards facilitate the annotation process for all mentions of times, locations/spatial entities and persons represented in the text. However, we are interested in just the specific characters, time, and location that define a specific scene. 2.2 SceneML Elements SceneML elements are categorised into two main categories: 1. Entities: entities consists of scenes, SDSs, characters, times, and locations; 1 www.iso.org/standard/37331.html and www.iso.org/obp/ui/#iso:std:60779. 2 See http://nlp.cs.rpi.edu/kbp/2018/ and https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/ english-entities-guidelines-v5.6.6.pdf. 8 2. Relations: scene-scene narrative progression links, relational links connecting times, characters, locations to the scenes in which they are participant elements and connecting SDS’s to the scenes they comprise. At present these relational links are all represented via attributes on entities. Scenes Scenes are the main element in SceneML. A scene has attributes: id, time and location. It also includes a list of character sub-elements as there may be more than one character for each scene. SDSs Scene description segments (SDSs) are the contiguous sequences of words from the narrative text that comprise the textual realisation of a scene. One SDS cannot belong to more than one scene, but a scene can be composed of multiple SDSs. SDS attributes include: id and scene id, i.e the id of the scene that the SDS belongs to. Time Time elements used here are the ones developed within ISO-TimeML. Each time annotation includes an id attribute and a text segment. Time can also be the time of the storyworld, signalled by the attribute base. Location We use the location element from ISO-Space. This also includes an id attribute that is unique for each location mention and a text span. Character Here we use the named entity type person from the ACE English Annotation Guidelines for Entities, the only di↵erence being that we allow animals and non-humans as characters if they play the role of characters in the narrative. Person annotations have a unique id attribute for each character mention and a text segment. Narrative Progression Links Narrative progression links (nplinks) link two scenes whose SDSs are textually adjacent. There are di↵erent types of nplink depending on the type of temporal relation between the two scenes. SceneML (§3.2) recognises four types of links. Sequence links are assigned when the scene change happens because of a change in location or characters, e.g. a character moves to another place, but where the events follow in time directly after those in the linked scene; Analepsis links are used when there is a flashback in one scene to a scene in the past. Prolepsis links are used when then we are taken forward in time (i.e. flashforward); Concurrent links are assigned between two scenes when the transition happens because there is another thread of the story happening at the same time so the transition takes us to di↵erent characters and a di↵erent place but at the same time. An extended example of the XML realisation of SceneML annotation can be found in SceneML §4. 3 Methods and Results 3.1 Methods A small-scale pilot study was carried out to investigate how well-defined our definitions and annotation framework are with respect to scene boundary identification. Two chapters (chapters 3 and 4) of “Bunnies from the Future” [Cor16], a children’s story for reading ages 10-13, were selected to be annotated. Three annotators, distinct from the authors, were identified (postgraduate students, non-native speakers of English) in addition to one of the authors. They were given annotation guidelines based on the framework explained in section 2 and were instructed to use the Brat3 annotation tool to annotate the two chapters following the guidelines, with two simplifying exceptions: (1) annotators were asked not to annotate the scene abstract discourse element; (2) they were asked not to annotate explicitly any relations. I.e., they were instructed to annotate SDSs and for each SDS they were asked to annotate the first mention of the time and location of and the characters participating in the events described in the SDS. The first of these simplifications was imposed because the Brat annotation tool, which is a very easy-to-use tool for annotation text spans, does not support the creation of zero-span annotations. That functionality is necessary for creating abstract discourse elements, which can then be linked to spans in the text. We could not 3 http://brat.nlplab.org/ 9 Figure 1: Screenshot showing an example of the pilot annotation using Brat. find any tool that would allow this and was equally easy-to-use, and did not have the resources to create our own. The problem can be circumvented by linking all SDSs in one scene together with a “same-scene-as” relational link, thus extensionally specifying a scene as the set of its SDSs. The second simplification was imposed for several reasons. First, for this first annotation exercise we were primarily concerned with determining whether or not annotators could accurately identify and agree on SDS boundaries. The problem of determining what the narrative links between scenes are is very much a secondary problem compared to this. As it turns out, all but one of the scenes in the chosen texts consist of a single SDS and follow each other in temporal sequence. Single-SDS scenes are likely a characteristic of this level of children’s stories and one reason why we chose such stories for this pilot, though we did not deliberately chose chapters because they had few scenes with multiple SDSs. Second, given that virtually all scenes consist of a single SDS, the need to link same-scene SDSs is low. This led us to ignore it altogether for this exercise, though clearly it will need to be addressed when dealing with more complex narrative structure. Finally, given the absence of explicit scene elements in the annotation, the times, characters and locations needed to be annotated in each SDS participating in a scene. This has the disadvantage of forcing the annotator to annotate these things multiple times per scene, when there are multiple SDSs per scene, but means that (1) the relations linking the scene elements time, character and location to the scene can be inferred from simple presence of annotated strings of these types within an SDS and (2) allows the annotator, and subsequently analyst, to validate that distinct scenes involve a distinction in at least one of time, location and characters (in fact identity of these elements could be used as another method to indirectly link multiple SDSs in a single scene together, though annotator error could make this unreliable). Figure 1 shows an example annotation from chapter 4; entity segments are highlighted. 3.2 Results Tables 1, 2 and 3 show the results of the annotation exercise. Table 1 shows information on the numbers of entity mentions annotated by each annotator and the averages of these numbers. Note that the chapters di↵er in size: chapter 3 is 124 sentences long (2756 words) and chapter 4 is only 65 (1775 words). The columns SDS, Character, Location and Time indicate the number of entity mentions annotated for each category by each annotator. Two averages are computed: (1) entity mentions per chapter averaged over annotators (2) average entity mentions per SDS for each annotator averaged across all annotators. Table 2 shows inter-annotator agreement results. Pair-wise inter-annotator agreement results are computed using Cohen’s kappa. Averaged inter-annotator agreement results between all annotators are computed using Fleiss’s kappa. 1 refers to the kappa score for segments of type SDS. For SDSs the kappa score is computed by 10 Table 1: Statistics about the annotations. SDS, Character, Time and Location columns refer to the number of segments marked as each entity type per chapter for each annotator. Averages of these numbers by chapter/by SDS are also shown. Chapter 3 Chapter 4 SDS Char Time Loc SDS Char Time Loc Ann1 4 19 2 5 5 14 4 6 Ann2 5 21 3 3 6 28 4 6 Ann3 13 30 8 12 10 28 4 10 Ann4 8 21 0 5 9 19 1 6 Av/Chapter 7.5 22.75 3.25 6.25 7.5 22.25 3.25 7.25 Av/SDS – 3.67 0.4 0.78 – 3.33 0.55 1.11 Table 2: Inter-annotator agreement results. 1 refers to the kappa score for SDS, 2 refers to kappa score for all other entities together. Ann2 Ann3 Ann4 Ch3 Ch4 Ch3 Ch4 Ch3 Ch4 1 2 1 2 1 2 1 2 1 2 1 2 Ann1 0.60 0.54 0.33 0.27 0.15 0.27 0.20 0.300.23 0.19 0.10 0.21 Ann2 0.27 0.20 0.42 0.390.19 0.24 0.35 0.30 Ann3 0.72 0.33 0.95 0.52 Average 1 Ch3: 0.36 Average 2 Ch3: 0.29 Average 1 Ch4: 0.41 Average 2 Ch4: 0.34 considering each sentence as a potential candidate for a scene segment boundary. So, each sentence is represented by either a 1 or 0, 1 if the sentence either contains a scene segment boundary or is preceded or followed by one, 0 otherwise. 2 refers to the kappa score computed for all other entity types (Time, Character and Location). Here we treat the problem of recognising these three entity types as a token classification problem, following the widely used named entity recognition approach of IOB tagging (see, e.g., [JM09]). Each word is tagged as either Time, Character, Location or Outside, where the Outside tag is given to words that are not part of any entity mention4 . Table 3 shows percentage agreement results between each annotator pair for each entity type. These are computed by creating a confusion matrix of token entity type labels (including type Outside) for each annotator pair and dividing each value on the diagonal in this confusion matrix by the corresponding row total. 3.3 Discussion Here we discuss disagreement between the annotators with a view to determining how our annotation guidelines and/or processes should be improved and whether or not there are any underlying conceptual problems with our approach. In the general the Cohen’s kappa scores show what is commonly interpreted as fair (0.21–0.40) to moderate (0.41–0.60) agreement with a few cases of slight and substantial agreement around the edges. However, as the qualitative interpretation of kappa scores is contentious, we use these scores primarily as a diagnostic tool, highlighting areas of relative agreement and disagreement. Looking the scores overall, several observations can be made. First, the annotator pair (Ann3,Ann4) agree much more than any other annotator pair. Second, generally and on average, 1 scores are higher than 2 scores, i.e. agreement on SDSs is higher than agreement on entities. Regarding the percentage agreement results on entity annotation in Table 3, it can be seen that for most cases, Character and Time entities have higher agreement results than Location entities (agreement for Outside tags will always be high given the unbalanced nature of the data, i.e. most tokens are outside of any entity 4 For simplicity we do not include a ‘B’ tag, as instances of contiguous distinct entity mentions of the same type are extremely rare. 11 Table 3: Percentage agreement results between annotator pairs for each entity type by token. Here O refers to the Outside tag. Ann2 Ann3 Ann4 Character Location O Time Character Location O Time Character Location O Time Ann1 0.24 0.15 0.99 0.8 0.28 0.15 0.99 0.26 0.33 0.05 0.99 0 Ann2 0.36 0.09 0.99 0.33 0.48 0.05 0.98 1 Ann3 0.73 0.88 0.98 1 mention). Note that these figures are heavily dependent on agreement in SDS annotation. If one annotator annotates two SDSs where another annotates just one, the first annotator will have twice the number of time, location and character annotations, since these entity types are to be annotated for each SDS. Hence low scores are to expected where agreement in the number of SDSs is not high. Indeed we can see that for annotators 3 and 4, where SDS agreement is higher than for any other annotator pair, agreement on entities is also very much higher than for any other annotator pairing. Analysis of the annotations reveals two underlying causes of the observed disagreement: (1) lack of under- standing of the guidelines and task and (2) lack of clarity or specificity in the guidelines. These are often not easy to distinguish. Lack of understanding It emerged in questioning following the annotation exercise that some annotators (Ann1 and Ann2) relied only on the authors’ verbal explanation of the guidelines and task and had either not read the guidelines at all or had not read them carefully. E.g., in some places we find two distinct location entities tagged in the same SDS, in clear contradiction of the guidelines, suggesting either the annotator did not pay enough attention or simply did not understand. Although our annotators are all good speakers of English and studying at the postgraduate level in well-respected English-speaking universities, being a non-native speaker of English led to misinterpretations of some sentences or expressions in the text that obviously caused mistakes in annotation For example, one annotator labelled earth as a scene location in the idiomatic expression how on earth had he ... where clearly it is not. In another case, in the example sentence Sorry, old chap, had an attack of the wobbles. Dashed embarrassing, the word Dashed was annotated as a character. Lack of clarity and detail in the guidelines Some annotators included definite articles in the annotated entity mention, e.g. the stone ages was annotated as a time by one annotator and stone ages by another (to assess the e↵ect of such minor variation in annotation we re-measured inter-annotator agreement after removing all stop words and found that results slightly improved). Some time entities were annotated as the time of a scene while in fact they just reference other times. E.g., in Do you not have good fabrics in the future? the word future was annotated as the time of the scene. Regarding tagging characters, confusion arose as to whether to annotate the fullest form of the character mention, the first mention, or every mention. All of these small divergences need to be eliminated by more fully and explicitly addressing them in the guidelines. Two deeper issues were detected with respect to scene boundary determination and account for most of the variation between annotators with respect to SDS boundary placement. One is to do with “scene transition segments”, typically short phrases or a sentence or two that indicate a character is moving from one scene to another. For example: and soon I emerged back into the corridor looking like a new man. Should this text be annotated as belonging to the preceding or succeeding scene, or is it a scene in its own right or part of no scene, but a new “scene transition” element that should be added to the annotation scheme? The other issue is to do with granularity of scene segmentation. E.g. if one minor character leaves a scene does this imply a new scene, given our definition of scene as “a unit of a story in which the elements of time, location, and main characters are constant.”? Or should this be viewed as too minor a change to count as a scene change? Again, if, for example, a character goes into a dressing room o↵ another space and his changing clothes in the dressing room is described while he continues talking to another character outside the dressing room ([Cor16], Chp. 4), is this a significant enough change of location to constitute a scene change? Further work needs to be done to articulate a clear position with respect to these edge cases. 12 4 Conclusion We have presented a pilot annotation experiment in which annotators are asked to use SceneML to annotate several chapters in a children’s story. Our aim was to assess the viability of the scene segmentation task and the adequacy of our guidelines. Results show that the task is feasible, but suggest several changes need to be made to the annotation process and guidelines to improve inter-annotator agreement. First, annotators should be better trained, and filtered from the annotator pool if their English language understanding or understanding of the task is too weak. This can be assessed by asking them to do a trial annotation exercise, which is reviewed by expert annotators before allowing them to proceed with real annotation. Second, the guidelines should be refined to reduce confusion, addressing the specific issues identified in the preceding section. With these changes we are convinced agreement between annotators on the task can be substantially increased. Once agreement between annotators has been assured at a higher level, our planned future work includes a number of activities. First, we will extend the annotation to include all SceneML elements, including narrative progression links and explicit linking of SDSs. Second, we plan to apply the scheme to a much wider range of texts, including historical and contemporary fiction for adults, as well as biography. Non-textual narrative genres, such as film, will also be considered. Third, we plan to investigate automating the process of scene annotation after getting sufficient manually annotated data to train a predictive model, using supervised machine learning techniques. Other potential future activities include investigating the application of the scheme to other languages. There is no reason in principle why SceneML should be restricted to the English language, though entity annotation guidelines for target languages need to be available. Application to other languages would be welcome and would serve to confirm the universality of the approach. References [AGKK11] Rakesh Agrawal, Sreenivas Gollapudi, Anitha Kannan, and Krishnaram Kenthapadi. Enriching textbooks with images. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 1847–1856, New York, NY, USA, 2011. ACM. [CL02] Charles B. Callaway and James C. Lester. Narrative prose generation. Artif. Intell., 139(2):213–252, August 2002. [Cor16] J. Corcoran. Bunnies from the Future. www.freekidsbooks.org, 2016. [DFUL17] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image de- scriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pages 2970–2979, 2017. [FL10] Yansong Feng and Mirella Lapata. Topic models for image annotation and text illustration. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 831–839. Association for Computational Linguis- tics, 2010. [GA19] Robert Gaizauskas and Tarfah Alrashid. Sceneml: A proposal for annotating scenes in narrative text. In Workshop on Interoperable Semantic Annotation (ISA-15), page 13, 2019. [JM09] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009. [JWL04] Dhiraj Joshi, James Z Wang, and Jia Li. The story picturing engine: finding elite images to illustrate a story using mutual reinforcement. In Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval, pages 119–126. ACM, 2004. [KPD+ 11] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer, 2011. [Sch10] Wolf Schmid. Narratology: An introduction. Walter de Gruyter, Berlin, 2010. 13 [YTDIA11] Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 444–454. Association for Computational Linguistics, 2011. [ZKZ+ 15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 19–27, Washington, DC, USA, 2015. IEEE Computer Society. 14