Empirical Evidence of the Limits of Automatic Assessment of Fictional Ideation A. Tapscott1? , J. Gómez1 , C. León1 , J. Smailović2 , M. Žnidaršič2 , P. Gervás1 1 Facultad de Informática, Universidad Complutense de Madrid 2 Department of Knowledge Technologies, Jožef Stefan Institute Abstract. Automatic evaluation of fictional ideation systems and their output is a topic relevant to Computational Creativ- ity. Models and techniques have been proposed for this task, but their applicability is limited to the field of fictional ideation. In this paper we describe an evaluation procedure for fictional ideation, which compares human validation of the ideas with a number of automatically generated metrics obtained from them. We report on the observed limits of this procedure. The results suggest that, besides technical limitations, providing a stable evaluation method is fundamentally incomplete unless the full creative phenomenon is modelled, including aspects that are be- yond current technical capabilities. Keywords: Automatic evaluation, ideation, empirical study, narrative, computational creativity 1 Introduction Evaluation of creative processes and artefacts is key to computational creativity. Explicitly reflecting on the relative value and novelty is crucial if machines are to produce content that would be deemed creative [6]. As such, addressing eval- uation is fundamental for computational creativity that can successfully fulfill human needs. This crucial aspect contrasts with the relative scarcity of systems explicitly generating rich evaluation of their own generated material or inner processes. Some systems arguably control the quality of their artifacts by carrying out a process that ensures a minimum relative quality, but an explicit evaluation ar- guably represents a qualitative advantage, both theoretical (as studied by com- putational creativity frameworks [29]) and practical ([4]). Although the semantics of creativity are elusive and usually problematic, the vision that quality and novelty influence the perception of the creativity of an artifact (at least from the point of view of observation) is commonly accepted. Still, quality and novelty vary depending on the domain and context. Theoretical ? Supported by the project WHIM (611560) and PROSECCO (600653), funded by the European Commission, Framework Program 7, the ICT theme, and the Future Emerging Technologies FET program. Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes. discussion on this exists and it is seminal in the field [1, 2], while other works attempt to offer either formal or procedural techniques for evaluating creativity [18, 25, 30]. These efforts address the evaluation of creativity in generic terms, and they are of limited applicability for the evaluation of the quality of specific artifacts generated automatically. It might be the case that the assumption that there is a global definition of creativity applicable to every creative domain is not possible, but we still need more empirical evidence supporting whether this is so. Moreover, even when working within a domain in which there is an agreed def- inition of characteristics assumed to play a role in creativity (let us say quality), addressing explicit automatic evaluation can be a costly task, even more costly than creating the generative system that is being evaluated. It is not uncom- mon that being able to generate appropriate artefacts is doable, while yielding an explicit, measurable evaluation is not (for instance, in images generated by evolutionary computing [15]). This paper reports on an empirical study in which the output of an automatic ideation system is assessed by computational means. When compared to human evaluation, the conceptual and practical limits of the approach were evidenced. This led to an in-depth analysis of the challenges, which is provided in Section 5. 2 Previous Work While all scientific exploration requires thorough evaluation of the steps taken, doing so in creativity represents a challenge. How to assess creativity itself is a commonly discussed aspect of the whole phenomena of creative generation. While most authors agree on the correlation between a number of features and the perception of creativity, there is no consensus either on what these features are or how they really correlate. Moreover, adding computers to the problem makes it even more difficult to know whether a system has been successful or not. There is still a debate on what parts should be evaluated, the influence of the programmer on the output, the very definition of creative behavior, the decision of whether to focus on the process or the artifacts (or both), and many others. The few examples present in the literature describing actual evaluation of automatic creative systems usually focus on less ambitious, more measurable aspects. This makes these systems less useful from a general perspective, but they nonetheless provide insight on the current capabilities of computer systems to assess their own production. There is, however, a number of proposals that try to provide guidelines to evaluate creative systems. For instance, Ritchie [24, 25] addresses the issue of evaluating when a program can be considered creative by outlining a set of empirical criteria to measure the creativity of the program in terms of its output. He makes it very clear that he is restricting his analysis to the questions of what factors are to be observed, and how these might relate to creativity, specifically stating that he does not intend to build a model of creativity. Ritchie’s criteria are defined in terms of two observable properties of the results produced by the program: novelty (to what extent is the produced item dissimilar to existing examples of that genre) and quality (to what extent is the produced item a high- quality example of that genre). To measure these aspects, two rating schemes are introduced, which rate the typicality of a given item (item is typical) and its quality (item is good). Another important issue that affects the assessment of creativity in creative programs is the concept of inspiring set, the set of (usually highly valued) artifacts that the programmer is guided by when designing a creative program. Ritchie’s criteria are phrased in terms of: what proportion of the results rates well according to each rating scheme, ratios between various subsets of the result (defined in terms of their ratings), and whether the elements in these sets were already present or not in the inspiring set. Ritchie’s criteria have been used in subsequent evaluations of creative systems output [7, 21, 8]. Pease et al. [19] discuss relevant factors to evaluating systems in terms of creativity. The proposed framework mainly takes into account input provided, output produced and process employed. Each of these categories are detailed in depth, detailing their required measures. Before detailing the measurement methods, Pease et al. provide assumptions regarding creativity, also admitting their ’somewhat arbitrary’ nature. The evaluation tests proposed deal with two main aspects: how close does the test predict human evaluation of creativity and how possible and practical it is to apply the test to a system. Overall, this work suggests that the very definition of creativity is subjective and that evaluating systems in a general way is problematic. Colton et al. [5] propose an extension of Ritchie’s criteria [24] that attempts to determine the impact of the input data on the creative artifact produced by a system. This more agnostic approach attempts to obtain an objective measure by comparing the output of the system to the inspirational material used as input. This investigation attempts to discriminate systems that overfit or shuffle input data (fine-tuning) instead of producing genuine novel artifacts. Among other conclusions, the authors state that comparing creative systems might not be viable, suggesting their criteria to be used as guidelines for program construction rather than post-hoc evaluation. The creative tripod framework, proposed by Colton [3], is built around the premise that a creative system must demonstrate skill, imagination and appreci- ation. These qualities are not required to be possessed by the system, but rather to be perceived as possessed by the system. This is an important remark by Colton to avoid debates around the definition of creativity. The framework also includes the programmer, the system and the consumer, however Colton is only interested in the program’s behavior. Pease and Colton [18] propose an alternative to the Turing Test to assess computational systems’ creativity, the FACE (Frame, Aesthetic, Concept, Ex- pression of concept) and IDEA (Iterative Development Execution Appreciation) model. The model includes creative acts and audiences, with relevant measures such as popularity, appeal, provocation, opinion, subversion and shock. Putting the focus on the reaction produced by the creative artifact, this model attempts to avoid the shortcomings of the Turing Test by going further than merely assess- ing the capacity of a creative system to imitate human behavior. By including the audience into the model, this approach acknowledges the highly subjective nature of creativity evaluation. SPECS [9], introduced by Jordanous as “a standardised and systematic methodology for evaluating computational creativity”, represents a substantial effort to provide a standard for evaluating the creativity of a system in the field of computational creativity and address the multi-faceted and subjective nature of creativity. Its flexible nature allows SPECS to adapt to the demands of the researchers’ field, applying the required demands and standards. The method- ology informs researchers of their system’s strength and weaknesses, providing useful feedback for achieving creative results. 2.1 Evaluation of Automatically Generated Narrative Automatic generation of narratives has been a long-standing goal of Artificial Intelligence since its very beginning. There are a number of systems described in the literature, but the evaluation of these systems – be it its output, its creative process or whatever other aspect – is seldom found. This is most likely due to the fact that the average quality or variety of the generated stories is not really comparable to those written by most humans, not necessarily professional writers. The Mexica system [23] includes procedures for the dynamic assessment of the novelty of a story in progress with respect to previously known stories. Novelty is considered in terms of how the stories differ in terms of the actions they include and their frequency of appearance. In Pérez et al [22] three different characteristics are considered as relevant for measuring story novelty: sequence of actions, structure of the story, and use of characters and actions. Peinado & Gervás [20] carried out an empirical study of how generated stories were perceived by a set of human volunteer evaluators. Human judges blindly compared one of the generated basic stories to two alternatives: one rendered directly from a stored fabula of the knowledge base and another randomly gen- erated. Values were collected for: linguistic quality (how well is the text written), coherence (how well is the sequence of events linked), interest (how interesting is the topic of the story for the reader) and originality (how different is the story from others). León & Gervás [11] propose a model, intended as a tool to drive automatic story generation, of how quality is evaluated in stories. This paper proposes a computational model for story evaluation in which an evaluation function receives stories and outputs a value as the rating for that story. The value for this function is computed from values assigned to: accumulation of contributions from individual events depending on the meaning of the event – aspects such as whether the reader wants to continue reading the story, or how much danger or love the reader perceives in the story –, appearance of patterns or relationships between the events of a story – aspects such as causality, humour or relative chronology – and inference – which captures the ability to interpret stories by adding material to explain what they are told even if it is not explicitly present in the story. The evaluation function has been implemented as a rule based system. Ware, Young et. al. [27] propose a formal model for narrative conflict with seven dimensions from various narratological sources meant to aid in distinguish- ing one conflict from another: participant, subject, duration, balance, directness, intensity and resolution. Their experimental results [28] suggest the model pre- dicts these seven dimensions of narrative conflict similarly to human criteria. Their good results predicting human-perceived narrative conflict suggest a sim- ilar approach may be viable for measures related to creativity. 3 Evaluating Automatic Ideation Original ideation is central to any creative process. Coming up with innovative ideas that potentially trigger the creation of new material is fundamental to human creativity. It is not uncommon to focus creative processes on the iden- tification of a single, valuable idea that unlocks new paths leading to finished artifacts. Although human creative teams usually rely on pure ideation to foster creativity, there have only been a few small, ad-hoc studies of how to automate ideation until recent times. Section 3.1 describes an effort to provide a system able to produce novel ideas. 3.1 The What-If Machine Llano et al. have recently proposed an automatic ideation system [13, 14, 12]. This computational system is designed to produce relatively valuable and novel ideas autonomously. This system, the What-If Machine 1 , includes a module for analysing the ideas and generating narrative metrics, and a module for comput- ing a predictive machine learning model. This model is trained against collected human evaluations of what-ifs, and is intended to learn a robust function from narrative metrics to perceived overall quality. Two main hypotheses guide the design of the What-if Machine and the presented research: 1. There is a strong correlation between the perceived overall quality and the perceived narrative potential, in the sense that if the audience perceives high narrative potential, it will also perceive a high overall quality. The overall quality is defined in terms of the analyzed response from humans (i.e. no specific model beyond what humans say about quality is assumed), and the narrative potential is assumed to be directly proportional to the amount and quality of the stories a certain what-if can trigger or inspire. 2. There is a set of computable metrics whose values correlate (directly or indirectly) with the overall quality and the narrative potential. 1 The What-if Machine: http://www.whim-project.eu/. The What-If Machine is, to the best of our knowledge, the only attempt to implement a computer system able to produce novel what-if ideas. The What-If Machine is a distributed computer system in which several modules collaborate in order to output rendered what-ifs. Five modules compose the system: 1. The ideation module produces, using a knowledge base, what-if ideas for- malized as mini-narratives. 2. The mini-narratives are fed into the narrative-based metric generation, which generates values for a set of metrics which hypothetically have a corre- lation with human perception of quality. These metrics are based on narrative properties of the what-ifs. 3. The mini-narratives, now enriched with its corresponding metrics, are sent to a crowd-sourcing evaluation module, which applies machine learning to create and refine models for predicting overall quality against human ratings. 4. The world view creation, providing knowledge for what-if generation, story creation and metric computation. 5. The finished, filtered what-ifs are finally passed to a rendering module, which creates artifacts from the final what-ifs (stories, texts or images, for instance). A subset of the What-If Machine (modules 1, 2 and 3) was used to generate the material for the study, which is described in detail in Section 4. 4 Study A pilot study was performed to determine the feasibility of predicting the per- ceived quality and narrative potential in the artifacts created by a computable creative system. Both magnitudes have been introduced in the previous section, and in order to avoid influencing our subjects, no definition for them is provided in the questionnaires (as seen in Fig. 1). This naive approach is a result of our focus on the model and its capability to predict human assessment instead of introducing our own views or definitions. The study was conducted to obtain the human rating of perceived quality and narrative potential. Using both measures, a machine learning process will search for correlations between some metrics (detailed in the next section) and the perceived quality and perceived narrative potential. This should allow us to determine what mea- sures are relevant to predict human-perceived quality and narrative potential to produce what-ifs that present both qualities to human observers. 4.1 Metrics Since we have no certainty about what metrics extracted from each what-if’s mini-narrative may impact over the perceived quality and narrative potential, we focused on generating the maximum amount of computable features. The impact of these features on the perceived quality and narrative potential may be obtained with machine learning techniques (we refer to these features as metrics). This approach is similar to the one used by Nowak for image classification [17] that generates a high number of arbitrary features from each image. A mini-narrative is a structure that contains a set of narrative points linked to schemas like setting or resolution. Each narrative point is a set of narrative statements that provide information about characters or events through predicates (e.g. dog is old or dog learns to play a piano). Narrative statements may be related to one another (caused by or inferred by another statement). The next list includes the set of implemented features along with their de- scription: – Length: mini-narrative narrative points amount. – SettingQuality: Amount of schemas divided by 3. – ExplicitFact: the amount of narrative statements in the mini-narrative. – RatioCharacters: the character/statement ratio. – Originality: hits returned by the full text of the mini-narrative in the Bing search engine. – OriginalityAccurate: hits returned by the exact full text of the mini- narrative in the Bing search engine. – Divergence: average hits returned by the mini-narrative statements in the Bing search engine. – DivergenceMinimum: minimum hits returned by the mini-narrative state- ments in the Bing search engine. – Evolution: amount of learnTo predicates found in the mini-narrative. – Handicap: amount of negated capableOf predicates found in the mini- narrative. – InterestingLife: amount of negated doesFor predicates found in the mini- narrative. – TotalStoriesGenerated: amount of stories generated by the story genera- tor from the current mini-narrative. – StoryCharacters: average number of characters in the generated stories. – Names: StanfordNLP [16] queries for the what-if’s names. – NamesRatio: Names/ExplicitFact ratio. – Valence: Sum per statement, each statement codified as +1 if a fact is positive, -1 if negative and 0 otherwise). – ValenceAverage: Valence/ExplicitFact ratio. – JointWordsProbability: joint probability average for each set of words using ngrams. For this metric we use the Project Oxford2 services. – JointWordsProbabilityMinimum: the minimum joint probability for the set of words using ngrams from Project Oxford. – RealityDistortionRatio: events in the mini-narrative that negate a fact from the knowledge base are considered a reality distortion. This metric pro- vides the reality distortion amount/ExplicitFact ratio. 2 https://www.projectoxford.ai/ – FictionalAdditionsRatio: any event in the mini-narrative that is miss- ing from the knowledge base is considered a fictional addition. This metric provides the fictional addition amount/ExplicitFact ratio. – FictionalRatio: reality distortion amount plus fictional addition amount/ExplicitFact. – ResolutionTriggerRatio: resolution events solve conflicts from the mini- narrative. Provides the resolution event amount/ExplicitFact ratio. – MainCharacterEventsRatio: protagonist statements are statements in which this actor plays any role. This metric provides the protagonist state- ment amount/ExplicitFact ratio. 4.2 Methodology A set of 890 what-ifs were generated by the What-If Machine. All of their source mini-narratives were processed by the metric generation system. A total of 15 different questionnaires were created, each including 10 what-ifs rendered as text from the original set of 890. 150 what-ifs were included in the evaluation set. 101 volunteers received a link that randomly redirects to one of the 15 possible questionnaires through email. Given the simplicity of the questions, Google Forms was our platform of choice. The platform was robust and stable and all of the answers were successfully stored in a Google Sheet document automatically. There was no active supervision for each subject given the remote nature and limitations of the Google Forms platform. 4.3 Questionnaire The questionnaire informed subjects about their participation in a study related to computer-generated content (Figure 1). Some demographic information was queried (age, gender and English level) and then they were asked to evaluate the overall quality (on a 0-5 Likert scale) of each what-ifs plus its narrative potential (yes/no binary answer). A text box accepting any comment was also provided in order to gather additional qualitative information. You are about to evaluate some of the preliminary results of the “WHIM: The What-If Machine” research project from the European Union. The overall ob- jective of the What-If Machine is to automatically generate fictional ideas with cultural value. You will be presented a number of what-if style ideas and we kindly ask you to rate them according to the following features: – Overall quality: from 0 (no quality) to 5 (superb quality). – Narrative potential (yes/no). – Any observation you can provide. Completing the questionnaire should not take more than 10 minutes. We really appreciate your contribution to the project. Fig. 1. Information presented to the user in the evaluation questionnaire. 4.4 Results 101 subjects participated in the study. Statistical analysis of the results re- vealed no significant differences between evaluators in terms of English level, age or gender. For instance, the quality (Q) for gender yielded µ(Q)male = 2.66, σ(Q)male = 0.75; µ(Q)f emale = 2.69, σ(Q)f emale = 0.89. The corresponding results for English and age are comparable. Questionnaires provided 1,007 Quality and 1,004 Narrative Potential rank- ings for the 150 What-Ifs used. What-Ifs were ranked between 1 and 27 times. For the Narrative Potential (P ) measurements, we mapped “Yes” to +1, “Not sure” to 0, and “No” to -1. Overall measures resulted in µ(Q) = 2, 4 and σ(Q) = 1, 3 for Quality and µ(P ) = −0, 05 and σ(P ) = 0, 89 for Narrative Potential. Individual What-Ifs aggregated ranking values were used for calculating: – Pairwise correlations between perceived Quality and perceived Narrative Po- tential, perceived Quality or perceived Narrative Potential and the metrics, and between individual metrics. – Global measure of attribute importance for these metrics in predictive mod- eling of the average perceived Quality or perceived Narrative Potential. Pairwise correlations Metrics that provided the same values for all What-Ifs in the dataset were discarded. Correlation coefficients were calculated with the Pearson Product-Moment. There is a strong positive correlation between Quality and Narrative Potential averages (0.83) and medians (0.758). As seen in table 1, both measures correlate positively with some metrics, such as MainCharac- terEventsRatio and RatioCharacters and correlate negatively with others, such as ExplicitFact and Length. Table 1. The correlation coefficient between average/median Quality (Q) or Narrative Potential (P) labels and the metrics. The values are sorted by correlation coefficient values of the average Quality. Avg Q Mdn Q Avg P Mdn P MainCharEventsRatio 0.371 0.346 0.379 0.329 RatioCharacters 0.354 0.296 0.368 0.307 ResolutionTriggerRatio 0.342 0.303 0.305 0.261 TotalStoriesGenerated 0.312 0.250 0.321 0.264 JointWordsProbMin 0.308 0.289 0.367 0.314 ... ... ... ... ... ValenceAverage -0.219 -0.188 -0.296 -0.249 ValenceSum -0.258 -0.234 -0.323 -0.276 StoryCharacters -0.283 -0.269 -0.327 -0.285 ExplicitFact -0.379 -0.336 -0.406 -0.345 Length -0.379 -0.336 -0.406 -0.345 Importance for Predictive Modeling In order to determine the importance of each metric in predicting perceived Quality and Narrative Potential we used the Relief measure [10, 26], which is a method commonly used for feature selection in machine learning. This measure does not assume independence among the metrics, but takes their possible interdependence into account. The more the Relief scores are positive, the more a metric contributes to prediction of a target value (in our case, the value of average Quality or the average Potential). The ones that scored close to zero or negative are irrelevant and those with negative values have even a negative impact. According to the results in Table 2 it seems that most of the metrics have no use in predictive models of average Quality. For the average Narrative Potential, however, most of the metrics seem to be slightly informative . According to Relief ranks for the metrics results, usefulness of the metrics for average Quality is to some extent inversely proportional to their usefulness for the average Narrative Potential. The absolute values of the Relief scores depend on the characteristics of data and the parameters of the assessment, which makes it difficult to use ab- solute thresholds for judgements on the relevance of features. However, a strong correlation among the Quality and Narrative Potential values and a mismatch of the Relief scores of metrics for these two targets provide an indication that also the contributions of the positively scored metrics are likely to be too low to be considered relevant. Table 2. Relief measure results for average Quality (Relief Avg Q) and average Nar- rative Potential (Relief Avg P ). Rows sorted by Relief Avg Q. The best three results are in bold and the worst three are in italics. Metric Relief Avg Q Relief Avg P Handicap 0.027 -0.009 MainCharacterEventsRatio 0.007 0.004 NamesRatio 0.001 0.006 DivergenceMinimum 0.000 0.000 JointWordsProbabilityMinimum 0.000 0.000 Divergence 0.000 0.000 Originality -0.006 0.013 ... ... ... FictionalAdditionsRatio -0.075 0.028 InterestingLife -0.116 0.045 TotalStoriesGenerated -0.116 0.045 OriginalityAccurate -0.126 0.024 FictionalRatio -0.142 0.039 RatioCharacters -0.142 0.039 SettingQuality -0.147 0.024 Names -0.147 0.024 ValenceSum -0.174 0.033 5 Relative Limits of Evaluating Quality The results previously presented evidence that there is a strong correlation be- tween narrative potential and perceived overall quality of a what-if, which indi- cates that focusing on narrative plausibility as one of the main factors of quality can lead to better results. Moreover, some of the metrics are weakly correlated to narrative potential. However, these results are still inconclusive, and there is a number of aspects worth mentioning for their influence on the results. Automatically generating stories and computing useful values for metrics is heavily dependent on the available knowledge. The outcome of the system is constrained by the use of ConceptNet. The amount of relations that can be safely used in ConceptNet is small and the richness and depth of the chains of properties is limited regarding to its use as a source for narrative processing. This makes it necessary to address knowledge management from a different perspective. The WHIM project currently includes a whole module for providing robust knowledge to the rest of the modules, and the impact of the application of this subsystem on the creation and evaluation of what-if ideas will be reported once the results are ready. The generation process (for the what-ifs, the stories and the metrics) strongly influences the overall outcome. Many design decisions have been taken in order to provide a working, implemented prototype able to generate actual what-ifs, and these decisions set the kind of what-ifs generated, the complexity of the stories and many other aspects. The provided results are then the outcome of a specific implementation which does not claim any generality. However, the approach itself (namely the generation-metric computation-evaluation process) is presented as a generally applicable method for producing novel what-if ideas. The used metrics for labeling narrative properties do not cover all computable features. There is a large number of aspects that can be extracted from a what- if, and the narrative-based feature extraction module of the What-If Machine does not currently provide coverage for all of them. This is considered to be not strictly relevant with regard to the methodology and scope of the study. To test the second hypothesis (the existence of a correlation between a certain set of metrics and the overall quality and plausibility), the metrics must be improved. For that purpose, the presented study gives valuable insight on which direction to go next. The weak correlation between our metrics and the quality perceived by hu- mans suggested that considering more sophisticated metrics was necessary. Some of them were considered: 1. Humanization: An approximation of how much human-like the main char- acter is, assuming that fictional scenarios use characters that, while behaving like humans, can be non-human. 2. Empathy: How much empathy will a reader feel about the characters. 3. Tragedy: The amount of tragedy in the story. 4. Reality: How real and current the context is. An approximation of fictionally in terms of context. 5. TimeSpan: The time span the story covers. It could be minutes, days or years. Modelling and implementing these metrics proved to be beyond technical capabilities because it required complex, rich knowledge bases (1, 4), reliable text understanding systems (5), sophisticated emotional models (2) or formal versions of narratological models (3). All of these resources are currently not available. 6 Conclusions The current paper has presented a pilot study trying to gain insight on two hy- potheses, namely that (1) human evaluation on overall quality of what-if ideas correlates to the perception of narrative potential and that (2) there is a set of computable metrics that also correlate to this perception. The study has evidenced that there is a strong correlation between quality and narrative po- tential for humans (1), but failed to prove such a strong correlation between the current metrics and the human ratings. These results have been analysed and discussed in terms of the limited potential of the current implementation of both the fictional ideation procedure and the method employed to evaluate it. Actual implementations lack the required complexity to approximate evaluations with a relatively acceptable level of accuracy, mainly due to the limited technical capabilities of current computational solutions. References 1. Boden, M.: Computational Models of Creativity. Handbook of Creativity pp. 351– 373 (1999) 2. Boden, M.: Creative Mind: Myths and Mechanisms. Routledge, New York, NY, 10001 (2003) 3. Colton, S.: Creativity Versus the Perception of Creativity in Computational Sys- tems. Proceedings of the AAAI Spring Symposium on Creative Systems (Colton 2002), 14–20 (2008) 4. Colton, S.: The painting fool: Stories from building an automated painter. Com- puters and Creativity 9783642317, 3–38 (2012) 5. Colton, S., Pease, A., Ritchie, G.: The effect of input knowledge on creativity. Technical Reports of the Navy Center for (2001), http://www.inf.ed.ac.uk/publications/online/0055.pdf 6. Colton, S., Wiggins, G.: Computational creativity: The final frontier? ECAI (2012) 7. Gervás, P.: Linguistic creativity at different levels of decision in sentence produc- tion. In: Proceedings of the AISB 02 Symposium on AI and Creativity in Arts and Science, 3rd-5th April 2002, Imperial College. pp. 79–88 (2002) 8. Haenen, J., Rauchas, S.: Investigating artificial creativity by gen- erating melodies, using connectionist knowledge representation. In: The Third Joint Workshop on Computational Creativity (2006), http://ccg.doc.gold.ac.uk/events/ecai06/proceedings/Haenen.pdf 9. Jordanous, A.: A Standardised Procedure for Evaluating Creative Sys- tems: Computational Creativity Evaluation Based on What it is to be Creative. Cognitive Computation 4(3), 246–279 (2012), http://dblp.uni- trier.de/db/journals/cogcom/cogcom4.html#Jordanous12 10. Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning. pp. 249–256 (1992) 11. León, C., Gervás, P.: The Role of Evaluation-Driven rejection in the Successful Exploration of a Conceptual Space of Stories. Minds and Machines 20(4), 615–634 (2010) 12. Llano, M.T., Colton, S., Hepworth, R., Gow, J.: Automated Fictional Ideation via Knowledge Base Manipulation. Cognitive Computation pp. 1–22 (2016) 13. Llano, M.T., Cook, M., Guckelsberger, C.: Towards the automatic generation of fictional ideas for games. Experimental AI in . . . (2014) 14. Llano, M.T., Hepworth, R.: Automating fictional ideation using ConceptNet. Pro- ceedings of the . . . (2014) 15. Machado, P., Martins, T., Amaro, H., Abreu, P.: Beyond interactive evolution: Expressing intentions through fitness functions. Leonardo (2015) 16. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations). pp. 55–60 (2014) 17. Nowak, E., Jurie, F., Triggs, B.: Sampling Strategies for Bag-of-Features Image Classification. pp. 490–503. Springer Berlin Heidelberg (2006) 18. Pease, A., Colton, S.: On impact and evaluation in computational creativity: A discussion of the Turing test and an alternative proposal. AISB 2011: Computing and Philosophy pp. 15–22 (2011) 19. Pease, A., Winterstein, D., Colton, S.: Evaluating machine creativity. In: Workshop on Creative Systems, 4th (2001) 20. Peinado, F., Gervás, P.: Evaluation of Automatic Generation of Basic Stories. New Generation Computing, Computational Paradigms and Computational Intel- ligence. Special issue: Computational Creativity 24(3), 289–302 (2006) 21. Pereira, F.C., Hervás, R., Gervás, P., Cardoso, A.: A Multiagent Text Generator with Simple Rhetorical Habilities. In: Proc. of the AAAI-06 Workshop on Com- putational Aesthetics: AI Approaches to Beauty and Happiness, July 2006. AAAI Press (2006) 22. Pérez, R.y., Ortiz, O., Luna, W., Negrete, S.: A system for evaluating novelty in computer generated narratives. Creativity (2011) 23. Pérez y Pérez, R.: MEXICA: A Computer Model of Creativity in Writing. Ph.D. thesis, The University of Sussex (1999) 24. Ritchie, G.: Assessing creativity. In: Proceedings of the AISB Symposium on AI and Creativity in Arts and Science. pp. 3–11. York, UK 25. Ritchie, G.: Some Empirical Criteria for Attributing Creativity to a Computer Program. Minds & Machines 17, 67–99 (2007) 26. Robnik-Šikonja, M., Kononenko, I.: An adaptation of relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML 1997). pp. 296–304 (1997) 27. Ware, S.G., Young, R.M.: Validating a Plan-Based Model of Narrative Conflict. In: Proceedings of the International Conference on the Foundations of Digital Games. pp. 220–227. ACM Press, New York, New York, USA (2012) 28. Ware, S.G., Young, R.M., Harrison, B., Roberts, D.L.: Four Quantitative Metrics Describing Narrative Conflict.pdf. pp. 18–29. Springer Berlin Heidelberg (2012) 29. Wiggins, G.: A preliminary framework for description, analysis and comparison of creative systems. Knowledge-Based Systems 19(7) (2006) 30. Wiggins, G.: Searching for Computational Creativity. New Generation Computing, Computational Paradigms and Computational Intelligence. Special Issue: Compu- tational Creativity 24(3), 209–222 (2006)