136 Calibrating a Metric for Similarity of Stories against Human Judgment ∗ Raquel Hervás, Antonio A. Sánchez-Ruiz, Pablo Gervás, Carlos León Dep. Ingenierı́a del Software e Inteligencia Artificial Universidad Complutense de Madrid (Spain) raquelhb@fdi.ucm.es, antsanch@fdi.ucm.es, pgervas@sip.ucm.es, cleon@fdi.ucm.es Abstract. The identification of similarity is crucial for reusing expe- rience, where it provides the criterion for which elements to reuse in a given context, and for creativity, where generation of artifacts that are similar to those that already existed is not considered creative. Yet sim- ilarity is difficult to compute between complex artifacts such as stories. The present paper compares the judgment on similarity between stories explained by a human judge with a similarity metric for stories based on plan refinements. The need to identify the features that humans con- sider important when judging story similarity is paramount on the road to selecting appropriate metrics for the various tasks. Keywords: similarity, novelty, stories, plans. 1 Introduction Appropriate metrics for similarity are fundamental tools in many fields of Artifi- cial Intelligence. For instance, there are several data mining and machine learning methods that are based on the similarity between the elements being considered. In case-based reasoning, similarity metrics are crucial for the retrieval and reuse of previous cases. Similarity is also fundamental for computational creativity because artifacts that are very similar to previously existing ones might not be considered creative. For this reason, it is important to take into account whether the metrics considered for a particular task adequately represent the concept of similarity that humans faced with the same task would apply. The present pa- per compares the judgment on similarity between stories explained by a human judge with a particular similarity metric for stories. The main goal is to identify which of the features that a human considers when evaluating story similarity are already taken into account by the metric, and which ones are not. The results of this comparison should provide a check list that might later on be applied to evaluate the appropriateness of other metrics. ∗ The research reported in this paper was partially supported by the Project WHIM 611560 funded by the European Commission, Framework Programme 7, the ICT theme, and the Future and Emerging Technologies FET programme; and by the Spanish Ministry of Economy and Competitiveness under grant TIN2014-55006-R. Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes. In Proceedings of the ICCBR 2015 Workshops. Frankfurt, Germany. 137 We focus on the structural similarity of stories represented as plans composed of actions corresponding to the events in the story. In order to do so, we apply a similarity metric based on plan refinements and compare the obtained results for a pair of stories with the similarities found by a human expert. The key point of this comparison is that the metric does not only calculate a numerical similarity between the compared stories, but provides a report of the found similarities. This report is then compared with the observations obtained by the human expert. The comparison allows us to see if the automatic metric has been able to grasp the same features the expert considered important, and if structural similarity is enough for comparing computer-generated stories. 2 Previous Work on Similarity for Stories Existing work on similarity for stories has focused on two different axes: story similarity for retrieval and classification of stories, and story similarity applied to the assessment of their novelty in a computational creativity setting. 2.1 Similarity Metrics for Story Generation In general, there is relative consensus on the fact that comparing stories can be made at different levels. Comparing stories at a relatively abstract level is common, to the point of comparing not the exact sequence of events but the overall plot, or even the relations between the characters. This aspect of narrative has been addressed by structuralist and cognitive Narratology. In particular, comparing narratives has been a long term goal of Computa- tional Narrative, and several approaches have been taken with varying results [2, 10, 8]. Different aspects beyond pure literary composition have been tack- led: structure alignment in bioinformatics [1], event mapping [3], and other ap- proaches like considering story similarity in terms of the common summary that might be abstracted from the two stories being compared [9]. 2.2 Similarity Metrics for Assessing Novelty of Stories With respect to the assessment of creativity, a fundamental pillar is whether the results of a creative process have produced novel artifacts [14]. Research on the evaluation of creativity has addressed this point as an important requirement for the scientific exploration of creativity, and an important one for computational approaches. In [11], novelty of a given story is assessed in terms of new elements that appear in the story, or instances where existing elements have been replaced by elements of a different type. In [12], novelty of stories is considered in terms of their differences with an initial set of reference stories, based on the sequence of actions, the structure of the story in terms of emotional relations and tensions between the characters, and the occurrence of repetitive patterns. 138 Story 1 Story 2 Common structure shows id371 id372 declare-war id818 id819 declare-war id818 id819 offers-exchange id371 id372 id373 sings id207 murder decides-to-react ?x1 not-perform-service id373 decides-to-react id142 sets-out ?x1 negative-result id373 sets-out id142 wins ?x1 consumes id373 id44 wins id142 brings-peace ?x1 acquires id373 magical-abilities brings-peace id142 arrives ?x1 ?x2 declare-war id818 id819 arrives id142 id730 disguised ?x1 dispatches id189 id373 disguised id142 unrecognised ?x1 tells id189 id373 past-misfortune unrecognised id142 claims id672 won id818 decides-to-react id373 claims id672 won id818 sets ?x3 ?x1 sets-out id373 sets id165 id142 involves difficult-task ?x4 wins id373 involves difficult-task strength solve ?x1 difficult-task brings-peace id373 solve id142 difficult-task before dead-line arrives id373 id728 before dead-line returns ?x1 disguised id373 returns id142 arrives ?x1 id730 unrecognised id373 arrives id142 id730 disguised ?x1 claims id672 won id818 disguised id142 unrecognised ?x1 sets id161 id373 unrecognised id142 claims id672 won id818 involves difficult-task kissing claims id672 won id818 exposed id672 marked id373 exposed id672 not-solve id672 difficult-task solve id373 difficult-task not-solve id672 difficult-task before dead-line new-physical-appearance id142 returns id373 punished id818 arrives id373 id730 tied-to id818 horse-tail disguised id373 unrecognised id373 claims id672 won id818 exposed id672 not-solve id672 difficult-task Table 1: Table of events in each of the stories and the shared set of events. 3 A Callibration Exercise for Story Similarity Although there are many possible representations for stories and many different metrics have been considered for story similarity, the present effort has been focused on a particular representation format as used by an existing story gen- erator, and a specific metric that allows automatic computation. These choices were circumstantial on ease of access and are not considered optimal, but the effort should produce valuable insights that can later be extended to other al- ternatives. 3.1 Story Representation in the Propper System The Propper system [5] constitutes a computational implementation of a story generator based on Propp’s description of how his morphology might be used to generate stories [13]. It produces stories as a sequence of states described in terms of predicates that hold in the state. Characters, objects or locations are represented as unique identifiers in the predicates. This representation format has been considered generic enough to allow for an initial calibration exercise, considering that other formats may easily be converted into this one. The representation includes predicates representing narrative events and predicates describing properties of the characters that hold in particular states 139 of the story. These appear jointly in the stream of predicates for the story, but have been separated in the presentation of stories in this paper for clarity. The predicates presented here result from an effort of reverse engineering of the stories that Propp describes as examples of the application of his framework to analyse existing Russian folk tales. The first two columns of Table 1 present two examples of the stories produced by the Propper system. Predicates in this table describe actions or events in the story. Table 2 represents non-narrative facts that are true for the arguments of the actions in Table 1. Story 1 Story 2 Common structure hero id373 villain id818 villain id818 donor id371 victim id819 victim id819 magical-agent id372 hero id142 hero ?x1 magical-agent id44 seeker-hero id142 location ?x2 villain id818 location id730 false-hero id672 victim id819 court id730 unknown ?x3 seeker-hero id373 groom id142 task-type ?x4 dispatcher id189 false-hero id672 court id730 location id728 location id730 home id728 court id730 apprentice id373 artisan groom id142 false-hero id672 location id730 court id730 groom id373 Table 2: Table of characters, locations and objects in the two stories and the shared set 3.2 Human Interpretation of the Stories In order to compare the human interpretation of the stories with an automati- cally extracted report, we asked a human expert to write both stories in English and compare them. It is important to mention that the expert was familiar with this type of representation based on predicates, but she had to figure out the meaning of the predicates based solely on their names. Story 1 This story has the following main characters: a hero (373), a villain (818), and a false hero (672). In addition, a donor (371), a victim (819) and a dispatcher (189) appear as secondary characters. The hero (373) is first offered a magical agent by a donor (371) if he performs a service. He does not perform the service but he obtains another magical agent anyway, which he consumes to acquire magical abilities. Then, a villain (818) appears who declares war to a victim (819). The victim does not appear again. 140 Meanwhile, a dispatcher (189) talks about a past misfortune. The hero decides to react, sets out and wins (the war?), bringing peace with him. After that, the hero goes home, but he is disguised as the apprentice of an artisan and is not recognised. He finds a false hero (672) at home, who claims that he defeated the villain. The hero is marked, solves a difficult task and returns to the court, this time disguised but as a groom. The false hero still claims that he defeated the villain, but he is exposed and it is known that he did not solved a difficult task. Story 2 This story has the following main characters: a hero (142), a villain (818), and a false hero (672). In addition, a victim (819) appears only at the beginning. The story starts with the villain (818) declaring war to the victim (819). The hero (142) decides to react, becomes a seeker hero, sets out and wins (the war?). He brings peace and arrives to the court. But he is disguised as a groom and he is not recognized. At the court, the false hero (672) claims that he defeated the villain. Someone (165) sets the hero a difficult task that involves strength. He solves the difficult task before the deadline, and returns to the court. Again he is disguised as a groom and he is not recognized. And again, the false hero claims that he defeated the villain. However, the false hero is exposed and does not solve a difficult task. The hero gets a new physical appearance (undisguised?), and the villain is punished being tied to a horse tail. Next, we asked the expert to compare both stories and describe the main similarities and differences between them. Both stories are similar in their characters and roles: a hero, a villain, and a false hero who claims to have defeated the villain. In addition, in both stories the villain declares war to a victim, and the hero wins the war and brings peace. After that the hero returns (home or to the court) disguised (as a groom or as an apprentice), and he finds that a false hero claims to have defeated the villain. But at the end the false hero is exposed in both stories. Also, in both stories the hero makes two different journeys: one to win the war and return home/court, and one to solve a difficult task and then returning to court. From the point of view of the differences, Story 1 involves magic. The hero tries twice to obtain a magical agent, and the second time he achieves it and gets magical abilities. However, they are not used in the story. The main difference in Story 2 is that at the end the villain is explicitly punished by being tied to a horse tail. It is interesting to note that the first things mentioned by the expert both in the descriptions and the comparison are the characters, although in the com- parison only the most important characters are mentioned, as the others are considered less important for the plot. 141 In addition, the descriptions are based on the most important events in the story, so not all events are considered equally important. The comparison also shows that there is a high similarity between both stories in terms of characters and some of the narrative arcs. For example, the hero returns in both stories but to different places and with different disguises. However, these differences (place and disguise) are not considered as important and the expert finds similarity in what is happening even when the stories are not exactly the same. One of the main differences between the stories is that one of them involves magic, but it is not considered so important because magic is not used in the rest of the story. Finally, the differences in the endings are explicitly addressed in the comparison. This means that the end of the story is an important part of it. 3.3 Computing the Common Structure of Two Stories using Plan Refinements A story in its more basic form can be represented as a sequence of actions, i.e., as a plan. There are different approaches to compute the similarity of two plans. In this paper we use the similarity measure based on plan refinements presented in [15] because it does not only provide a numerical similarity value but an explicit description of the common structure shared by both plans. This common structure can be seen as a directed graph in which each node represents an action and each directed edge represents an ordering constraint. Two actions are connected in the graph only if both actions appear in that order in the plans being compared. Besides the actions and their order, this similarity measure also considers the action parameters and, if they are different in both plans, it is able to infer their common type according to a domain taxonomy. In this way, we are able to detect objects, characters and locations in different stories that have a different name but play the same role in the story. The similarity measure computes this common structure performing succes- sive refinements in the space of partial plans [7]. There are five different types of refinements that specialize a partial plan: to add a new action, to add a new ordering constraint between two existing actions, to specialize the type of a vari- able representing an action parameter according to a domain taxonomy, to unify two different variables, and to replace a variable with a domain constant. The similarity measure works as follows. Let us suppose we want to compare two plans (or stories) p1 and p2 . The similarity measure begins with an empty partial plan (a plan with no actions) that represents any possible plan and thus it is more general than p1 and p2 . Then the partial plan is specialized using a refinement operator (adding new actions and ordering constraints or specializing the action’s parameters) until we reach another partial plan that cannot be specialized anymore while being more general that both p1 and p2 . This partial plan is the most specific generalizer of p1 and p2 , M SGpp1 , p2 q, and represents the common structure shared by the two plans. The length of the refinement chain from the empty plan to the M SGpp1 , p2 q is an indicator of how similar 142 the two plans are. In the same way, the length of the refinement chain from the M SGpp1 , p2 q to each one of the two plans is an indicator of how much information is contained only in one of them but not in the other. The similarity value is computed as the ration between the amount of information shared and the total amount of information contained in the two plans. The last columns of Tables 1 and 2 show the common structure computed by the similarity measure. In this case, the two stories are very similar and the inferred common graph of actions is so simple that, in fact, it can be represented as a sequence of actions. Constants representing characters, locations and ob- jects common to both stories are kept in the common structure, and the other constants are replaced by variables with generalized types (variable names begin with ‘?’). The common structure of both stories could be summarized as follows. A villain declares war on a victim, what triggers the intervention of a hero that defeats him and brings peace back. Then the hero travels disguised and see how a false hero claims that he, and not the original hero, has defeated the villain. The hero leaves, solves a difficult task before some deadline, and comes back disguised. The false hero is exposed in court because he was not able to solve the difficult task. 4 Discussion There are a number of issues that the similarity metric considered here does not take into account. First, the point in the story in which a particular sequence of actions takes place may lead to different results. A marriage at the start of the story sets the scene for later actions, but at the end of the story it usually acts as a reward for the efforts of some character. This influence of context is not considered in the metric that has been described. Second, some events are more significant than others. The presence of a murder in a given story is more significant than that of more mundane events such as setting off on a journey. This aspect might be captured by some kind of weighting of the importance of specific events. The described metric does not allow for this type of behaviour. The judgment expressed by the human placed considerable emphasis on the relative importance of the elements that appear in the stories. Characters are mentioned first, then specific actions. In both cases, a certain degree of abstrac- tion is applied to identify conceptual similarity even between instances that are different. This suggests that taxonomical reasoning might be a useful tool for assessing similarity and that, as expected, abstraction is fundamental in story similarity. These two aspects suggest that automatic story comparison needs to address lifting between different levels of abstraction to be able to match those features that humans are able to match. It also seems that the abstract matching at different levels is a fundamental cognitive tool for comparing stories in humans. 143 This conclusion relates to the approach in [9] of considering similarity between stories in terms of a shared summary, but extended to summarisation with an important degree of abstraction. The work in [11], by virtue of being based on description logic ontologies, does include the possibility of taxonomical reasoning being applied in the process of measuring similarity. It is clear that this particular approach should be explored in more detail in future work. The version of the Propper system that has been employed here provides only limited description of the characters. The descriptions considered are restricted to specification of the roles played in the narrative by particular characters, and a number of properties of particular arguments that are relevant for the correct chaining of later actions with their context of occurrence via their set of preconditions. An important problem from the point of view of assessing the novelty of cre- ative processes is the need to consider an existing set of artifacts as a reference. Generated artifacts are only novel if they are not similar to existing ones. How- ever, from a computational point of view, the approach of keeping a record of all existing artifacts of a given type, and computing the similarity of any newly generated artifacts with this set is not practical [4]. Indexing solutions may be used to improve efficiency, but even so, solutions based on some level of abstrac- tion, away from specific instances and addressing more generic characterisations of the artifacts (in this particular case, stories) would prove more practical in this context. Conformance or departure from Concepts such as conventional end- ings, genre conventions, or character stereotypes may play a fundamental role in assessing the novelty of stories beyond sequences of actions. Overall, it seems that there are a number of aspects of stories that are relevant when attempting to establish similarity between two instances of story. Just how many such aspects should be included in a particular implementation as a similarity metric may depend substantially on the purpose for which it is intended. In the particular case of similarity metrics employed for case-based reasoning, the choice of which aspects of similarity to model should be guided by the particular aspects of the case that will be reused. If the cases are intended to provide story structure, the similarity should focus on story structure. If the cases are intended to inform decisions on the set of characters to employ, the similarity should focus on the set of characters. In relation to the point raised above concerning abstraction, it is important to note that focusing on particular aspects of story similarity may require specific types of abstraction to implement the described lifting operation. Where similarity metrics are used for evaluating novelty in Computational Creativity settings, their use is much broader and it becomes more difficult to focus on particular aspects. Nevetheless, as it is very important to consider issues of efficiency, abstraction as means of reducing the range of attributes that need to be compared will clearly play a fundamental role in practical implementations. 144 5 Conclusions and Future Work The present work describes a process by which a computational system for com- puting the similarity between narrative structures is compared and calibrated against human judgment. A number of issues considered by the human judge but not covered by the system have been discovered. These should be considered as a check list for the consideration of alternative metrics, and possibly as driving guidelines for the development of more elaborate metrics specific to the assessment of story similarity. The work described in this paper has addressed sequential single narrative threads. More complex narratives usually involve parallel story lines which merge or split at several points in the overall narrative. Whether the current metrics are valid for comparing similarity between this kind of narratives or not is yet an open question. Additionally, the use of different structures for stories also opens a new path, namely the application of the current process to stories that, while outputting an equivalent format, are generated by other story generation systems, probably conveying different semantics in the sequence of events, and possibly richer relations between characters. From this point of view, more recent versions of the Propper system [6] address specifically the description of characters as they occur in the story, and they should be explored in further work to extend the metric for similarity to consider differences between the characters of two stories. For that work, it may be necessary to focus on differences between characters fulfilling equivalent narrative roles in the different stories. State is also fundamental in narrative composition and analysis. Narrative understanding of statements like “John squashed the spider” heavily depend on the relation between John and the spider (was it his mascot?). This kind of information must be taken into account in a general model of story similarity. In all cases, further research must look into more metrics for story comparison and employ more experts to analyse how humans evaluate narratives. Following the intuition that we, as humans, perform a complex set of comparisons for evaluating similarity at different levels can lead to the discovery of plausible metrics and plausible aggregation methods into one single judgment. References 1. Fay, M.: Story comparison via simultaneous matching and alignment. In: Workshop on Computational Models of Narrative, 2012 Language Resources and Evaluation Conference (LREC’2012). Istambul, Turkey (2012) 2. Fisseni, B., Lowe, B.: Which dimensions of narratives are relevant for human judg- ments of story equivalence? In: Workshop on Computational Models of Narra- tive, 2012 Language Resources and Evaluation Conference (LREC’2012). Istambul, Turkey (2012) 3. Fisseni, B., Lowe, B.: Event mappings for comparing formal frameworks of narra- tives. Logique et Analyse (57) (2014) 145 4. Gervás, P.: Dynamic inspiring sets for sustained novelty in poetry generation. In: Second International Conference on Computational Creativity. México City, México (2011) 5. Gervás, P.: Propp’s morphology of the folk tale as a grammar for generation. In: Workshop on Computational Models of Narrative, a satellite workshop of CogSci 2013: The 35th meeting of the Cognitive Science Society. Universität Hamburg Hamburg, Germany (2013) 6. Gervás, P.: Computational drafting of plot structures for russian folk tales. Cog- nitive Computation (2015) 7. Kambhampati, S., Knoblock, C.A., Yang, Q.: Planning as refinement search: A uni- fied framework for evaluating design tradeoffs in partial-order planning. Artificial Intelligence 76(1), 167–238 (1995) 8. Krakauer, C., Winston, P.: Story retrieval and comparison using concept patterns. In: Workshop on Computational Models of Narrative, 2012 Language Resources and Evaluation Conference (LREC’2012). Istambul, Turkey (2012) 9. Kypridemou, E., Michael, L.: Narrative similarity as common summary: Eval- uation of behavioral and computational aspects. LLC 29(4), 532–560 (2014), http://dx.doi.org/10.1093/llc/fqu046 10. Michael, L.: Similarity of narratives. In: Workshop on Computational Models of Narrative, 2012 Language Resources and Evaluation Conference (LREC’2012). Is- tambul, Turkey (2012) 11. Peinado, F., Francisco, V., Hervás, R., Gervás, P.: Assessing the novelty of computer-generated narratives using empirical metrics. MINDS AND MACHINES 20(4), 588 (2010) 12. Pérez y Pérez, R., Ortiz, O., Luna, W.A., Negrete, S., Pealoza, E., Castellanos, V., vila, R.: A system for evaluating novelty in computer generated narratives. In: Proceedings of the Second International Conference on Computational Creativity. pp. 63–68. Mxico City, Mxico (2011) 13. Propp, V.: Morphology of the Folk Tale. Akademija, Leningrad (1928) 14. Ritchie, G.: Some Empirical Criteria for Attributing Creativity to a Computer Program. Minds & Machines 17, 67–99 (2007) 15. Sánchez-Ruiz, A.A., Ontañón, S.: Least common subsumer trees for plan retrieval. In: Case-Based Reasoning Research and Development - 22nd International Con- ference, ICCBR 2014, Cork, Ireland, September 29, 2014 - October 1, 2014. Pro- ceedings. pp. 405–419 (2014)