Empirical Evidence of the Limits of Automatic Assessment of Fictional Ideation

Empirical Evidence of the Limits of Automatic Assessment of Fictional Ideation ATapscott Facultad de Informática Universidad Complutense de Madrid JGómez Facultad de Informática Universidad Complutense de Madrid CLeón Facultad de Informática Universidad Complutense de Madrid JSmailović Department of Knowledge Technologies Jožef Stefan Institute MŽnidaršič Department of Knowledge Technologies Jožef Stefan Institute PGervás Facultad de Informática Universidad Complutense de Madrid Empirical Evidence of the Limits of Automatic Assessment of Fictional Ideation 4557B448362D758C732ABC0D1A608FF6 GROBID - A machine learning software for extracting information from scholarly documents Automatic evaluation ideation empirical study narrative computational creativity

Automatic evaluation of fictional ideation systems and their output is a topic relevant to Computational Creativity. Models and techniques have been proposed for this task, but their applicability is limited to the field of fictional ideation. In this paper we describe an evaluation procedure for fictional ideation, which compares human validation of the ideas with a number of automatically generated metrics obtained from them. We report on the observed limits of this procedure. The results suggest that, besides technical limitations, providing a stable evaluation method is fundamentally incomplete unless the full creative phenomenon is modelled, including aspects that are beyond current technical capabilities.

Introduction

Evaluation of creative processes and artefacts is key to computational creativity. Explicitly reflecting on the relative value and novelty is crucial if machines are to produce content that would be deemed creative [6]. As such, addressing evaluation is fundamental for computational creativity that can successfully fulfill human needs.

This crucial aspect contrasts with the relative scarcity of systems explicitly generating rich evaluation of their own generated material or inner processes. Some systems arguably control the quality of their artifacts by carrying out a process that ensures a minimum relative quality, but an explicit evaluation arguably represents a qualitative advantage, both theoretical (as studied by computational creativity frameworks [29]) and practical ( [4]).

Although the semantics of creativity are elusive and usually problematic, the vision that quality and novelty influence the perception of the creativity of an artifact (at least from the point of view of observation) is commonly accepted. Still, quality and novelty vary depending on the domain and context. Theoretical

Previous Work

While all scientific exploration requires thorough evaluation of the steps taken, doing so in creativity represents a challenge. How to assess creativity itself is a commonly discussed aspect of the whole phenomena of creative generation. While most authors agree on the correlation between a number of features and the perception of creativity, there is no consensus either on what these features are or how they really correlate. Moreover, adding computers to the problem makes it even more difficult to know whether a system has been successful or not. There is still a debate on what parts should be evaluated, the influence of the programmer on the output, the very definition of creative behavior, the decision of whether to focus on the process or the artifacts (or both), and many others.

The few examples present in the literature describing actual evaluation of automatic creative systems usually focus on less ambitious, more measurable aspects. This makes these systems less useful from a general perspective, but they nonetheless provide insight on the current capabilities of computer systems to assess their own production.

There is, however, a number of proposals that try to provide guidelines to evaluate creative systems. For instance, Ritchie [24,25] addresses the issue of evaluating when a program can be considered creative by outlining a set of empirical criteria to measure the creativity of the program in terms of its output. He makes it very clear that he is restricting his analysis to the questions of what factors are to be observed, and how these might relate to creativity, specifically stating that he does not intend to build a model of creativity. Ritchie's criteria are defined in terms of two observable properties of the results produced by the program: novelty (to what extent is the produced item dissimilar to existing examples of that genre) and quality (to what extent is the produced item a highquality example of that genre). To measure these aspects, two rating schemes are introduced, which rate the typicality of a given item (item is typical) and its quality (item is good). Another important issue that affects the assessment of creativity in creative programs is the concept of inspiring set, the set of (usually highly valued) artifacts that the programmer is guided by when designing a creative program. Ritchie's criteria are phrased in terms of: what proportion of the results rates well according to each rating scheme, ratios between various subsets of the result (defined in terms of their ratings), and whether the elements in these sets were already present or not in the inspiring set. Ritchie's criteria have been used in subsequent evaluations of creative systems output [7,21,8].

Pease et al. [19] discuss relevant factors to evaluating systems in terms of creativity. The proposed framework mainly takes into account input provided, output produced and process employed. Each of these categories are detailed in depth, detailing their required measures. Before detailing the measurement methods, Pease et al. provide assumptions regarding creativity, also admitting their 'somewhat arbitrary' nature. The evaluation tests proposed deal with two main aspects: how close does the test predict human evaluation of creativity and how possible and practical it is to apply the test to a system. Overall, this work suggests that the very definition of creativity is subjective and that evaluating systems in a general way is problematic.

Colton et al. [5] propose an extension of Ritchie's criteria [24] that attempts to determine the impact of the input data on the creative artifact produced by a system. This more agnostic approach attempts to obtain an objective measure by comparing the output of the system to the inspirational material used as input. This investigation attempts to discriminate systems that overfit or shuffle input data (fine-tuning) instead of producing genuine novel artifacts. Among other conclusions, the authors state that comparing creative systems might not be viable, suggesting their criteria to be used as guidelines for program construction rather than post-hoc evaluation.

The creative tripod framework, proposed by Colton [3], is built around the premise that a creative system must demonstrate skill, imagination and appreciation. These qualities are not required to be possessed by the system, but rather to be perceived as possessed by the system. This is an important remark by Colton to avoid debates around the definition of creativity. The framework also includes the programmer, the system and the consumer, however Colton is only interested in the program's behavior.

Pease and Colton [18] propose an alternative to the Turing Test to assess computational systems' creativity, the FACE (Frame, Aesthetic, Concept, Expression of concept) and IDEA (Iterative Development Execution Appreciation) model. The model includes creative acts and audiences, with relevant measures such as popularity, appeal, provocation, opinion, subversion and shock. Putting the focus on the reaction produced by the creative artifact, this model attempts to avoid the shortcomings of the Turing Test by going further than merely assessing the capacity of a creative system to imitate human behavior. By including the audience into the model, this approach acknowledges the highly subjective nature of creativity evaluation.

SPECS [9], introduced by Jordanous as "a standardised and systematic methodology for evaluating computational creativity", represents a substantial effort to provide a standard for evaluating the creativity of a system in the field of computational creativity and address the multi-faceted and subjective nature of creativity. Its flexible nature allows SPECS to adapt to the demands of the researchers' field, applying the required demands and standards. The methodology informs researchers of their system's strength and weaknesses, providing useful feedback for achieving creative results.

Evaluation of Automatically Generated Narrative

Automatic generation of narratives has been a long-standing goal of Artificial Intelligence since its very beginning. There are a number of systems described in the literature, but the evaluation of these systems -be it its output, its creative process or whatever other aspect -is seldom found. This is most likely due to the fact that the average quality or variety of the generated stories is not really comparable to those written by most humans, not necessarily professional writers.

The Mexica system [23] includes procedures for the dynamic assessment of the novelty of a story in progress with respect to previously known stories. Novelty is considered in terms of how the stories differ in terms of the actions they include and their frequency of appearance.

In Pérez et al [22] three different characteristics are considered as relevant for measuring story novelty: sequence of actions, structure of the story, and use of characters and actions.

Peinado & Gervás [20] carried out an empirical study of how generated stories were perceived by a set of human volunteer evaluators. Human judges blindly compared one of the generated basic stories to two alternatives: one rendered directly from a stored fabula of the knowledge base and another randomly generated. Values were collected for: linguistic quality (how well is the text written), coherence (how well is the sequence of events linked), interest (how interesting is the topic of the story for the reader) and originality (how different is the story from others).

León & Gervás [11] propose a model, intended as a tool to drive automatic story generation, of how quality is evaluated in stories. This paper proposes a computational model for story evaluation in which an evaluation function receives stories and outputs a value as the rating for that story. The value for this function is computed from values assigned to: accumulation of contributions from individual events depending on the meaning of the event -aspects such as whether the reader wants to continue reading the story, or how much danger or love the reader perceives in the story -, appearance of patterns or relationships between the events of a story -aspects such as causality, humour or relative chronology -and inference -which captures the ability to interpret stories by adding material to explain what they are told even if it is not explicitly present in the story. The evaluation function has been implemented as a rule based system.

Ware, Young et. al. [27] propose a formal model for narrative conflict with seven dimensions from various narratological sources meant to aid in distinguishing one conflict from another: participant, subject, duration, balance, directness, intensity and resolution. Their experimental results [28] suggest the model predicts these seven dimensions of narrative conflict similarly to human criteria. Their good results predicting human-perceived narrative conflict suggest a similar approach may be viable for measures related to creativity.

Evaluating Automatic Ideation

Original ideation is central to any creative process. Coming up with innovative ideas that potentially trigger the creation of new material is fundamental to human creativity. It is not uncommon to focus creative processes on the identification of a single, valuable idea that unlocks new paths leading to finished artifacts. Although human creative teams usually rely on pure ideation to foster creativity, there have only been a few small, ad-hoc studies of how to automate ideation until recent times. Section 3.1 describes an effort to provide a system able to produce novel ideas.

The What-If Machine

Llano et al. have recently proposed an automatic ideation system [13,14,12]. This computational system is designed to produce relatively valuable and novel ideas autonomously. This system, the What-If Machine1 , includes a module for analysing the ideas and generating narrative metrics, and a module for computing a predictive machine learning model. This model is trained against collected human evaluations of what-ifs, and is intended to learn a robust function from narrative metrics to perceived overall quality. Two main hypotheses guide the design of the What-if Machine and the presented research:

1. There is a strong correlation between the perceived overall quality and the perceived narrative potential, in the sense that if the audience perceives high narrative potential, it will also perceive a high overall quality. A subset of the What-If Machine (modules 1, 2 and 3) was used to generate the material for the study, which is described in detail in Section 4.

Study

A pilot study was performed to determine the feasibility of predicting the perceived quality and narrative potential in the artifacts created by a computable creative system. Both magnitudes have been introduced in the previous section, and in order to avoid influencing our subjects, no definition for them is provided in the questionnaires (as seen in Fig. 1). This naive approach is a result of our focus on the model and its capability to predict human assessment instead of introducing our own views or definitions. The study was conducted to obtain the human rating of perceived quality and narrative potential.

Using both measures, a machine learning process will search for correlations between some metrics (detailed in the next section) and the perceived quality and perceived narrative potential. This should allow us to determine what measures are relevant to predict human-perceived quality and narrative potential to produce what-ifs that present both qualities to human observers.

Metrics

Since we have no certainty about what metrics extracted from each what-if's mini-narrative may impact over the perceived quality and narrative potential, we focused on generating the maximum amount of computable features. The impact of these features on the perceived quality and narrative potential may be obtained with machine learning techniques (we refer to these features as metrics). This approach is similar to the one used by Nowak for image classification [17] that generates a high number of arbitrary features from each image.

A mini-narrative is a structure that contains a set of narrative points linked to schemas like setting or resolution. Each narrative point is a set of narrative statements that provide information about characters or events through predicates (e.g. dog is old or dog learns to play a piano). Narrative statements may be related to one another (caused by or inferred by another statement).

The next list includes the set of implemented features along with their description:

-Length: mini-narrative narrative points amount.

-SettingQuality: Amount of schemas divided by 3.

-ExplicitFact: the amount of narrative statements in the mini-narrative.

-RatioCharacters: the character/statement ratio.

-Originality: hits returned by the full text of the mini-narrative in the Bing search engine. -OriginalityAccurate: hits returned by the exact full text of the mininarrative in the Bing search engine. -Divergence: average hits returned by the mini-narrative statements in the Bing search engine. -DivergenceMinimum: minimum hits returned by the mini-narrative statements in the Bing search engine. -Evolution: amount of learnTo predicates found in the mini-narrative.

-Handicap: amount of negated capableOf predicates found in the mininarrative. -InterestingLife: amount of negated doesFor predicates found in the mininarrative. -TotalStoriesGenerated: amount of stories generated by the story generator from the current mini-narrative. -StoryCharacters: average number of characters in the generated stories.

-Names: StanfordNLP [16] queries for the what-if's names.

-NamesRatio: Names/ExplicitFact ratio.

-Valence: Sum per statement, each statement codified as +1 if a fact is positive, -1 if negative and 0 otherwise). -ValenceAverage: Valence/ExplicitFact ratio.

-JointWordsProbability: joint probability average for each set of words using ngrams. For this metric we use the Project Oxford2 services. -JointWordsProbabilityMinimum: the minimum joint probability for the set of words using ngrams from Project Oxford. -RealityDistortionRatio: events in the mini-narrative that negate a fact from the knowledge base are considered a reality distortion. This metric provides the reality distortion amount/ExplicitFact ratio.

-FictionalAdditionsRatio: any event in the mini-narrative that is missing from the knowledge base is considered a fictional addition. This metric provides the fictional addition amount/ExplicitFact ratio. -FictionalRatio: reality distortion amount plus fictional addition amount/ExplicitFact. -ResolutionTriggerRatio: resolution events solve conflicts from the mininarrative. Provides the resolution event amount/ExplicitFact ratio. -MainCharacterEventsRatio: protagonist statements are statements in which this actor plays any role. This metric provides the protagonist statement amount/ExplicitFact ratio.

Methodology

A set of 890 what-ifs were generated by the What-If Machine. All of their source mini-narratives were processed by the metric generation system. A total of 15 different questionnaires were created, each including 10 what-ifs rendered as text from the original set of 890. 150 what-ifs were included in the evaluation set. 101 volunteers received a link that randomly redirects to one of the 15 possible questionnaires through email. Given the simplicity of the questions, Google Forms was our platform of choice. The platform was robust and stable and all of the answers were successfully stored in a Google Sheet document automatically. There was no active supervision for each subject given the remote nature and limitations of the Google Forms platform.

Questionnaire

The questionnaire informed subjects about their participation in a study related to computer-generated content (Figure 1). Some demographic information was queried (age, gender and English level) and then they were asked to evaluate the overall quality (on a 0-5 Likert scale) of each what-ifs plus its narrative potential (yes/no binary answer). A text box accepting any comment was also provided in order to gather additional qualitative information.

You are about to evaluate some of the preliminary results of the "WHIM: The What-If Machine" research project from the European Union. The overall objective of the What-If Machine is to automatically generate fictional ideas with cultural value. You will be presented a number of what-if style ideas and we kindly ask you to rate them according to the following features:

-Overall quality: from 0 (no quality) to 5 (superb quality). -Narrative potential (yes/no). -Any observation you can provide.

Completing the questionnaire should not take more than 10 minutes. We really appreciate your contribution to the project.

Results

101 subjects participated in the study. Statistical analysis of the results revealed no significant differences between evaluators in terms of English level, age or gender. For instance, the quality (Q) for gender yielded µ(Q) male = 2.66, σ(Q) male = 0.75; µ(Q) f emale = 2.69, σ(Q) f emale = 0.89. The corresponding results for English and age are comparable.

Questionnaires provided 1,007 Quality and 1,004 Narrative Potential rankings for the 150 What-Ifs used. What-Ifs were ranked between 1 and 27 times. For the Narrative Potential (P ) measurements, we mapped "Yes" to +1, "Not sure" to 0, and "No" to -1. Overall measures resulted in µ(Q) = 2, 4 and σ(Q) = 1, 3 for Quality and µ(P ) = −0, 05 and σ(P ) = 0, 89 for Narrative Potential. Individual What-Ifs aggregated ranking values were used for calculating:

-Pairwise correlations between perceived Quality and perceived Narrative Potential, perceived Quality or perceived Narrative Potential and the metrics, and between individual metrics. -Global measure of attribute importance for these metrics in predictive modeling of the average perceived Quality or perceived Narrative Potential.

Pairwise correlations Metrics that provided the same values for all What-Ifs in the dataset were discarded. Correlation coefficients were calculated with the Pearson Product-Moment. There is a strong positive correlation between Quality and Narrative Potential averages (0.83) and medians (0.758). As seen in table 1, both measures correlate positively with some metrics, such as MainCharac-terEventsRatio and RatioCharacters and correlate negatively with others, such as ExplicitFact and Length. Importance for Predictive Modeling In order to determine the importance of each metric in predicting perceived Quality and Narrative Potential we used the Relief measure [10,26], which is a method commonly used for feature selection in machine learning. This measure does not assume independence among the metrics, but takes their possible interdependence into account. The more the Relief scores are positive, the more a metric contributes to prediction of a target value (in our case, the value of average Quality or the average Potential). The ones that scored close to zero or negative are irrelevant and those with negative values have even a negative impact.

According to the results in Table 2 it seems that most of the metrics have no use in predictive models of average Quality. For the average Narrative Potential, however, most of the metrics seem to be slightly informative . According to Relief ranks for the metrics results, usefulness of the metrics for average Quality is to some extent inversely proportional to their usefulness for the average Narrative Potential. The absolute values of the Relief scores depend on the characteristics of data and the parameters of the assessment, which makes it difficult to use absolute thresholds for judgements on the relevance of features. However, a strong correlation among the Quality and Narrative Potential values and a mismatch of the Relief scores of metrics for these two targets provide an indication that also the contributions of the positively scored metrics are likely to be too low to be considered relevant. The results previously presented evidence that there is a strong correlation between narrative potential and perceived overall quality of a what-if, which indicates that focusing on narrative plausibility as one of the main factors of quality can lead to better results. Moreover, some of the metrics are weakly correlated to narrative potential. However, these results are still inconclusive, and there is a number of aspects worth mentioning for their influence on the results. Automatically generating stories and computing useful values for metrics is heavily dependent on the available knowledge. The outcome of the system is constrained by the use of ConceptNet. The amount of relations that can be safely used in ConceptNet is small and the richness and depth of the chains of properties is limited regarding to its use as a source for narrative processing. This makes it necessary to address knowledge management from a different perspective. The WHIM project currently includes a whole module for providing robust knowledge to the rest of the modules, and the impact of the application of this subsystem on the creation and evaluation of what-if ideas will be reported once the results are ready.

The generation process (for the what-ifs, the stories and the metrics) strongly influences the overall outcome. Many design decisions have been taken in order to provide a working, implemented prototype able to generate actual what-ifs, and these decisions set the kind of what-ifs generated, the complexity of the stories and many other aspects. The provided results are then the outcome of a specific implementation which does not claim any generality. However, the approach itself (namely the generation-metric computation-evaluation process) is presented as a generally applicable method for producing novel what-if ideas.

The used metrics for labeling narrative properties do not cover all computable features. There is a large number of aspects that can be extracted from a whatif, and the narrative-based feature extraction module of the What-If Machine does not currently provide coverage for all of them. This is considered to be not strictly relevant with regard to the methodology and scope of the study. To test the second hypothesis (the existence of a correlation between a certain set of metrics and the overall quality and plausibility), the metrics must be improved. For that purpose, the presented study gives valuable insight on which direction to go next.

The weak correlation between our metrics and the quality perceived by humans suggested that considering more sophisticated metrics was necessary. Some of them were considered:

1. Humanization: An approximation of how much human-like the main character is, assuming that fictional scenarios use characters that, while behaving like humans, can be non-human. 2. Empathy: How much empathy will a reader feel about the characters. 3. Tragedy: The amount of tragedy in the story. 4. Reality: How real and current the context is. An approximation of fictionally in terms of context.

TimeSpan:

The time span the story covers. It could be minutes, days or years.

Modelling and implementing these metrics proved to be beyond technical capabilities because it required complex, rich knowledge bases (1,4), reliable text understanding systems (5), sophisticated emotional models (2) or formal versions of narratological models (3). All of these resources are currently not available.

Conclusions

The current paper has presented a pilot study trying to gain insight on two hypotheses, namely that (1) human evaluation on overall quality of what-if ideas correlates to the perception of narrative potential and that (2) there is a set of computable metrics that also correlate to this perception. The study has evidenced that there is a strong correlation between quality and narrative potential for humans (1), but failed to prove such a strong correlation between the current metrics and the human ratings. These results have been analysed and discussed in terms of the limited potential of the current implementation of both the fictional ideation procedure and the method employed to evaluate it. Actual implementations lack the required complexity to approximate evaluations with a relatively acceptable level of accuracy, mainly due to the limited technical capabilities of current computational solutions.

Fig. 1 .1Fig. 1. Information presented to the user in the evaluation questionnaire.

The overall quality is defined in terms of the analyzed response from humans (i.e. no specific model beyond what humans say about quality is assumed), and the narrative potential is assumed to be directly proportional to the amount and quality of the stories a certain what-if can trigger or inspire. 2. There is a set of computable metrics whose values correlate (directly or indirectly) with the overall quality and the narrative potential.The What-If Machine is, to the best of our knowledge, the only attempt to implement a computer system able to produce novel what-if ideas. The What-If Machine is a distributed computer system in which several modules collaborate in order to output rendered what-ifs. Five modules compose the system:1. The ideation module produces, using a knowledge base, what-if ideas formalized as mini-narratives. 2. The mini-narratives are fed into the narrative-based metric generation, which generates values for a set of metrics which hypothetically have a correlation with human perception of quality. These metrics are based on narrative properties of the what-ifs. 3. The mini-narratives, now enriched with its corresponding metrics, are sent to a crowd-sourcing evaluation module, which applies machine learning to create and refine models for predicting overall quality against human ratings. 4. The world view creation, providing knowledge for what-if generation, story creation and metric computation. 5. The finished, filtered what-ifs are finally passed to a rendering module, which creates artifacts from the final what-ifs (stories, texts or images, for instance).

Table 1 .1The correlation coefficient between average/median Quality (Q) or Narrative Potential (P) labels and the metrics. The values are sorted by correlation coefficient values of the average Quality.

Avg Q Mdn Q Avg P Mdn PMainCharEventsRatio 0.371 0.346 0.379 0.329RatioCharacters0.354 0.296 0.368 0.307ResolutionTriggerRatio 0.342 0.303 0.305 0.261TotalStoriesGenerated 0.312 0.250 0.321 0.264JointWordsProbMin0.308 0.289 0.367 0.314. . .. . .. . .. . .. . .ValenceAverage-0.219 -0.188 -0.296 -0.249ValenceSum-0.258 -0.234 -0.323 -0.276StoryCharacters-0.283 -0.269 -0.327 -0.285ExplicitFact-0.379 -0.336 -0.406 -0.345Length-0.379 -0.336 -0.406 -0.345

Table 2 .2Relief measure results for average Quality (Relief Avg Q) and average Narrative Potential (Relief Avg P ). Rows sorted by Relief Avg Q. The best three results are in bold and the worst three are in italics.MetricRelief Avg Q Relief Avg PHandicap0.027-0.009MainCharacterEventsRatio0.0070.004NamesRatio0.0010.006DivergenceMinimum0.0000.000JointWordsProbabilityMinimum0.0000.000Divergence0.0000.000Originality-0.0060.013. . .. . .. . .FictionalAdditionsRatio-0.0750.028InterestingLife-0.1160.045TotalStoriesGenerated-0.1160.045OriginalityAccurate-0.1260.024FictionalRatio-0.1420.039RatioCharacters-0.1420.039SettingQuality-0.1470.024Names-0.1470.024ValenceSum-0.1740.033

The What-if Machine: http://www.whim-project.eu/. https://www.projectoxford.ai/

Supported by the project WHIM (611560) and PROSECCO (600653), funded by the European Commission, Framework Program 7, the ICT theme, and the Future Emerging Technologies FET program.

Computational Models of Creativity MBoden Handbook of Creativity 1999 Creative Mind: Myths and Mechanisms MBoden 2003 Routledge 10001 New York, NY Creativity Versus the Perception of Creativity in Computational Systems SColton Proceedings of the AAAI Spring Symposium on Creative Systems the AAAI Spring Symposium on Creative Systems

Colton

2002. 2008 The painting fool: Stories from building an automated painter SColton Computers and Creativity 9783642317 2012 The effect of input knowledge on creativity SColton APease GRitchie 2001 Technical Reports of the Navy Center for Computational creativity: The final frontier? ECAI SColton GWiggins 2012 Linguistic creativity at different levels of decision in sentence production PGervás Proceedings of the AISB 02 Symposium on AI and Creativity in Arts and Science Imperial College the AISB 02 Symposium on AI and Creativity in Arts and Science 3rd-5th April 2002. 2002 Investigating artificial creativity by generating melodies, using connectionist knowledge representation JHaenen SRauchas The Third Joint Workshop on Computational Creativity 2006 A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What it is to be Creative AJordanous Cognitive Computation 4 3 2012 A practical approach to feature selection KKira LRendell Proceedings of the ninth international workshop on Machine learning the ninth international workshop on Machine learning 1992 The Role of Evaluation-Driven rejection in the Successful Exploration of a Conceptual Space of Stories CLeón PGervás Minds and Machines 20 4 2010 Automated Fictional Ideation via Knowledge Base Manipulation MTLlano SColton RHepworth JGow Cognitive Computation 2016 Towards the automatic generation of fictional ideas for games MTLlano MCook CGuckelsberger Experimental AI in 2014 Automating fictional ideation using ConceptNet MTLlano RHepworth Proceedings 2014 PMachado TMartins HAmaro PAbreu Beyond interactive evolution: Expressing intentions through fitness functions Leonardo 2015 The stanford corenlp natural language processing toolkit CDManning MSurdeanu JBauer JRFinkel SBethard DMcclosky ACL (System Demonstrations) 2014 Sampling Strategies for Bag-of-Features Image Classification ENowak FJurie BTriggs 2006 Springer Berlin Heidelberg On impact and evaluation in computational creativity: A discussion of the Turing test and an alternative proposal APease SColton Computing and Philosophy 2011. 2011 AISB Evaluating machine creativity APease DWinterstein SColton Workshop on Creative Systems 2001 4 Evaluation of Automatic Generation of Basic Stories. New Generation Computing, Computational Paradigms and Computational Intelligence FPeinado PGervás Special issue: Computational Creativity 24 3 2006 A Multiagent Text Generator with Simple Rhetorical Habilities FCPereira RHervás PGervás ACardoso Proc. of the AAAI-06 Workshop on Computational Aesthetics: AI Approaches to Beauty and Happiness of the AAAI-06 Workshop on Computational Aesthetics: AI Approaches to Beauty and Happiness AAAI Press July 2006. 2006 A system for evaluating novelty in computer generated narratives RYPérez OOrtiz WLuna SNegrete Creativity 2011 MEXICA: A Computer Model of Creativity in Writing RPérez Y Pérez 1999 The University of Sussex Ph.D. thesis Assessing creativity GRitchie Proceedings of the AISB Symposium on AI and Creativity in Arts and Science the AISB Symposium on AI and Creativity in Arts and Science

York, UK

Some Empirical Criteria for Attributing Creativity to a Computer Program GRitchie Minds & Machines 17 2007 An adaptation of relief for attribute estimation in regression MRobnik-Šikonja IKononenko Machine Learning: Proceedings of the Fourteenth International Conference (ICML 1997. 1997 Validating a Plan-Based Model of Narrative Conflict SGWare RMYoung Proceedings of the International Conference on the Foundations of Digital Games the International Conference on the Foundations of Digital Games

New York, New York, USA

ACM Press 2012 Four Quantitative Metrics Describing Narrative Conflict SGWare RMYoung BHarrison DLRoberts 2012 Springer Berlin Heidelberg A preliminary framework for description, analysis and comparison of creative systems GWiggins Knowledge-Based Systems 19 7 2006 Searching for Computational Creativity. New Generation Computing GWiggins Computational Paradigms and Computational Intelligence 2006 24