<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Empirical Evidence of the Limits of Automatic Assessment of Fictional Ideation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Knowledge Technologies, Jozef Stefan Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Facultad de Informatica, Universidad Complutense de Madrid</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic evaluation of ctional ideation systems and their output is a topic relevant to Computational Creativity. Models and techniques have been proposed for this task, but their applicability is limited to the eld of ctional ideation. In this paper we describe an evaluation procedure for ctional ideation, which compares human validation of the ideas with a number of automatically generated metrics obtained from them. We report on the observed limits of this procedure. The results suggest that, besides technical limitations, providing a stable evaluation method is fundamentally incomplete unless the full creative phenomenon is modelled, including aspects that are beyond current technical capabilities.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic evaluation</kwd>
        <kwd>ideation</kwd>
        <kwd>empirical study</kwd>
        <kwd>narrative</kwd>
        <kwd>computational creativity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Evaluation of creative processes and artefacts is key to computational creativity.</p>
      <p>
        Explicitly re ecting on the relative value and novelty is crucial if machines are
to produce content that would be deemed creative [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As such, addressing
evaluation is fundamental for computational creativity that can successfully ful ll
human needs.
      </p>
      <p>This crucial aspect contrasts with the relative scarcity of systems explicitly
generating rich evaluation of their own generated material or inner processes.</p>
      <p>
        Some systems arguably control the quality of their artifacts by carrying out a
process that ensures a minimum relative quality, but an explicit evaluation
arguably represents a qualitative advantage, both theoretical (as studied by
computational creativity frameworks [29]) and practical ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>Although the semantics of creativity are elusive and usually problematic, the
vision that quality and novelty in uence the perception of the creativity of an
artifact (at least from the point of view of observation) is commonly accepted.</p>
      <p>Still, quality and novelty vary depending on the domain and context. Theoretical</p>
      <p>
        Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes.
discussion on this exists and it is seminal in the eld [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], while other works
attempt to o er either formal or procedural techniques for evaluating creativity
[
        <xref ref-type="bibr" rid="ref18 ref25">18, 25, 30</xref>
        ]. These e orts address the evaluation of creativity in generic terms,
and they are of limited applicability for the evaluation of the quality of speci c
artifacts generated automatically. It might be the case that the assumption that
there is a global de nition of creativity applicable to every creative domain is
not possible, but we still need more empirical evidence supporting whether this
is so.
      </p>
      <p>
        Moreover, even when working within a domain in which there is an agreed
definition of characteristics assumed to play a role in creativity (let us say quality ),
addressing explicit automatic evaluation can be a costly task, even more costly
than creating the generative system that is being evaluated. It is not
uncommon that being able to generate appropriate artefacts is doable, while yielding
an explicit, measurable evaluation is not (for instance, in images generated by
evolutionary computing [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]).
      </p>
      <p>This paper reports on an empirical study in which the output of an automatic
ideation system is assessed by computational means. When compared to human
evaluation, the conceptual and practical limits of the approach were evidenced.</p>
      <p>This led to an in-depth analysis of the challenges, which is provided in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous Work</title>
      <p>While all scienti c exploration requires thorough evaluation of the steps taken,
doing so in creativity represents a challenge. How to assess creativity itself is
a commonly discussed aspect of the whole phenomena of creative generation.
While most authors agree on the correlation between a number of features and
the perception of creativity, there is no consensus either on what these features
are or how they really correlate. Moreover, adding computers to the problem
makes it even more di cult to know whether a system has been successful or
not. There is still a debate on what parts should be evaluated, the in uence
of the programmer on the output, the very de nition of creative behavior, the
decision of whether to focus on the process or the artifacts (or both), and many
others.</p>
      <p>The few examples present in the literature describing actual evaluation of
automatic creative systems usually focus on less ambitious, more measurable
aspects. This makes these systems less useful from a general perspective, but
they nonetheless provide insight on the current capabilities of computer systems
to assess their own production.</p>
      <p>
        There is, however, a number of proposals that try to provide guidelines to
evaluate creative systems. For instance, Ritchie [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ] addresses the issue of
evaluating when a program can be considered creative by outlining a set of
empirical criteria to measure the creativity of the program in terms of its output.
He makes it very clear that he is restricting his analysis to the questions of what
factors are to be observed, and how these might relate to creativity, speci cally
stating that he does not intend to build a model of creativity. Ritchie's criteria
are de ned in terms of two observable properties of the results produced by
the program: novelty (to what extent is the produced item dissimilar to existing
examples of that genre) and quality (to what extent is the produced item a
highquality example of that genre). To measure these aspects, two rating schemes
are introduced, which rate the typicality of a given item (item is typical) and its
quality (item is good). Another important issue that a ects the assessment of
creativity in creative programs is the concept of inspiring set, the set of (usually
highly valued) artifacts that the programmer is guided by when designing a
creative program. Ritchie's criteria are phrased in terms of: what proportion
of the results rates well according to each rating scheme, ratios between various
subsets of the result (de ned in terms of their ratings), and whether the elements
in these sets were already present or not in the inspiring set. Ritchie's criteria
have been used in subsequent evaluations of creative systems output [
        <xref ref-type="bibr" rid="ref21 ref7 ref8">7, 21, 8</xref>
        ].
      </p>
      <p>
        Pease et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] discuss relevant factors to evaluating systems in terms of
creativity. The proposed framework mainly takes into account input provided,
output produced and process employed. Each of these categories are detailed
in depth, detailing their required measures. Before detailing the measurement
methods, Pease et al. provide assumptions regarding creativity, also admitting
their 'somewhat arbitrary' nature. The evaluation tests proposed deal with two
main aspects: how close does the test predict human evaluation of creativity and
how possible and practical it is to apply the test to a system. Overall, this work
suggests that the very de nition of creativity is subjective and that evaluating
systems in a general way is problematic.
      </p>
      <p>
        Colton et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose an extension of Ritchie's criteria [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] that attempts
to determine the impact of the input data on the creative artifact produced by a
system. This more agnostic approach attempts to obtain an objective measure by
comparing the output of the system to the inspirational material used as input.
This investigation attempts to discriminate systems that over t or shu e input
data ( ne-tuning) instead of producing genuine novel artifacts. Among other
conclusions, the authors state that comparing creative systems might not be
viable, suggesting their criteria to be used as guidelines for program construction
rather than post-hoc evaluation.
      </p>
      <p>
        The creative tripod framework, proposed by Colton [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], is built around the
premise that a creative system must demonstrate skill, imagination and
appreciation. These qualities are not required to be possessed by the system, but rather
to be perceived as possessed by the system. This is an important remark by
Colton to avoid debates around the de nition of creativity. The framework also
includes the programmer, the system and the consumer, however Colton is only
interested in the program's behavior.
      </p>
      <p>
        Pease and Colton [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] propose an alternative to the Turing Test to assess
computational systems' creativity, the FACE (Frame, Aesthetic, Concept,
Expression of concept) and IDEA (Iterative Development Execution Appreciation)
model. The model includes creative acts and audiences, with relevant measures
such as popularity, appeal, provocation, opinion, subversion and shock. Putting
the focus on the reaction produced by the creative artifact, this model attempts
to avoid the shortcomings of the Turing Test by going further than merely
assessing the capacity of a creative system to imitate human behavior. By including
the audience into the model, this approach acknowledges the highly subjective
nature of creativity evaluation.
      </p>
      <p>
        SPECS [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], introduced by Jordanous as \a standardised and systematic
methodology for evaluating computational creativity", represents a substantial
e ort to provide a standard for evaluating the creativity of a system in the eld
of computational creativity and address the multi-faceted and subjective nature
of creativity. Its exible nature allows SPECS to adapt to the demands of the
researchers' eld, applying the required demands and standards. The
methodology informs researchers of their system's strength and weaknesses, providing
useful feedback for achieving creative results.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Evaluation of Automatically Generated Narrative</title>
        <p>Automatic generation of narratives has been a long-standing goal of Arti cial
Intelligence since its very beginning. There are a number of systems described in
the literature, but the evaluation of these systems { be it its output, its creative
process or whatever other aspect { is seldom found. This is most likely due
to the fact that the average quality or variety of the generated stories is not
really comparable to those written by most humans, not necessarily professional
writers.</p>
        <p>
          The Mexica system [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] includes procedures for the dynamic assessment of
the novelty of a story in progress with respect to previously known stories.
Novelty is considered in terms of how the stories di er in terms of the actions
they include and their frequency of appearance.
        </p>
        <p>
          In Perez et al [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] three di erent characteristics are considered as relevant
for measuring story novelty: sequence of actions, structure of the story, and use
of characters and actions.
        </p>
        <p>
          Peinado &amp; Gervas [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] carried out an empirical study of how generated stories
were perceived by a set of human volunteer evaluators. Human judges blindly
compared one of the generated basic stories to two alternatives: one rendered
directly from a stored fabula of the knowledge base and another randomly
generated. Values were collected for: linguistic quality (how well is the text written),
coherence (how well is the sequence of events linked), interest (how interesting
is the topic of the story for the reader) and originality (how di erent is the story
from others).
        </p>
        <p>
          Leon &amp; Gervas [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] propose a model, intended as a tool to drive automatic
story generation, of how quality is evaluated in stories. This paper proposes
a computational model for story evaluation in which an evaluation function
receives stories and outputs a value as the rating for that story. The value for
this function is computed from values assigned to: accumulation of contributions
from individual events depending on the meaning of the event { aspects such as
whether the reader wants to continue reading the story, or how much danger or
love the reader perceives in the story {, appearance of patterns or relationships
between the events of a story { aspects such as causality, humour or relative
chronology { and inference { which captures the ability to interpret stories by
adding material to explain what they are told even if it is not explicitly present in
the story. The evaluation function has been implemented as a rule based system.
        </p>
        <p>
          Ware, Young et. al. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] propose a formal model for narrative con ict with
seven dimensions from various narratological sources meant to aid in
distinguishing one con ict from another: participant, subject, duration, balance, directness,
intensity and resolution. Their experimental results [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] suggest the model
predicts these seven dimensions of narrative con ict similarly to human criteria.
Their good results predicting human-perceived narrative con ict suggest a
similar approach may be viable for measures related to creativity.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluating Automatic Ideation</title>
      <p>Original ideation is central to any creative process. Coming up with innovative
ideas that potentially trigger the creation of new material is fundamental to
human creativity. It is not uncommon to focus creative processes on the
identi cation of a single, valuable idea that unlocks new paths leading to nished
artifacts. Although human creative teams usually rely on pure ideation to foster
creativity, there have only been a few small, ad-hoc studies of how to automate
ideation until recent times. Section 3.1 describes an e ort to provide a system
able to produce novel ideas.
3.1</p>
      <sec id="sec-3-1">
        <title>The What-If Machine</title>
        <p>
          Llano et al. have recently proposed an automatic ideation system [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">13, 14, 12</xref>
          ].
This computational system is designed to produce relatively valuable and novel
ideas autonomously. This system, the What-If Machine1, includes a module for
analysing the ideas and generating narrative metrics, and a module for
computing a predictive machine learning model. This model is trained against collected
human evaluations of what-ifs, and is intended to learn a robust function from
narrative metrics to perceived overall quality. Two main hypotheses guide the
design of the What-if Machine and the presented research:
1. There is a strong correlation between the perceived overall quality and the
perceived narrative potential, in the sense that if the audience perceives high
narrative potential, it will also perceive a high overall quality. The overall
quality is de ned in terms of the analyzed response from humans (i.e. no
speci c model beyond what humans say about quality is assumed), and the
narrative potential is assumed to be directly proportional to the amount and
quality of the stories a certain what-if can trigger or inspire.
2. There is a set of computable metrics whose values correlate (directly or
indirectly) with the overall quality and the narrative potential.
1 The What-if Machine: http://www.whim-project.eu/.
        </p>
        <p>The What-If Machine is, to the best of our knowledge, the only attempt to
implement a computer system able to produce novel what-if ideas. The What-If
Machine is a distributed computer system in which several modules collaborate
in order to output rendered what-ifs. Five modules compose the system:
1. The ideation module produces, using a knowledge base, what-if ideas
formalized as mini-narratives.
2. The mini-narratives are fed into the narrative-based metric generation,
which generates values for a set of metrics which hypothetically have a
correlation with human perception of quality. These metrics are based on narrative
properties of the what-ifs.
3. The mini-narratives, now enriched with its corresponding metrics, are sent to
a crowd-sourcing evaluation module, which applies machine learning to
create and re ne models for predicting overall quality against human ratings.
4. The world view creation, providing knowledge for what-if generation, story
creation and metric computation.
5. The nished, ltered what-ifs are nally passed to a rendering module,
which creates artifacts from the nal what-ifs (stories, texts or images, for
instance).</p>
        <p>A subset of the What-If Machine (modules 1, 2 and 3) was used to generate
the material for the study, which is described in detail in Section 4.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Study</title>
      <p>A pilot study was performed to determine the feasibility of predicting the
perceived quality and narrative potential in the artifacts created by a computable
creative system. Both magnitudes have been introduced in the previous section,
and in order to avoid in uencing our subjects, no de nition for them is provided
in the questionnaires (as seen in Fig. 1). This naive approach is a result of our
focus on the model and its capability to predict human assessment instead of
introducing our own views or de nitions. The study was conducted to obtain
the human rating of perceived quality and narrative potential.</p>
      <p>Using both measures, a machine learning process will search for correlations
between some metrics (detailed in the next section) and the perceived quality
and perceived narrative potential. This should allow us to determine what
measures are relevant to predict human-perceived quality and narrative potential to
produce what-ifs that present both qualities to human observers.
4.1</p>
      <sec id="sec-4-1">
        <title>Metrics</title>
        <p>
          Since we have no certainty about what metrics extracted from each what-if's
mini-narrative may impact over the perceived quality and narrative potential,
we focused on generating the maximum amount of computable features. The
impact of these features on the perceived quality and narrative potential may be
obtained with machine learning techniques (we refer to these features as metrics ).
This approach is similar to the one used by Nowak for image classi cation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
that generates a high number of arbitrary features from each image.
        </p>
        <p>A mini-narrative is a structure that contains a set of narrative points
linked to schemas like setting or resolution. Each narrative point is a set of
narrative statements that provide information about characters or events
through predicates (e.g. dog is old or dog learns to play a piano). Narrative
statements may be related to one another (caused by or inferred by another
statement).</p>
        <p>The next list includes the set of implemented features along with their
description:
{ Length: mini-narrative narrative points amount.
{ SettingQuality: Amount of schemas divided by 3.
{ ExplicitFact: the amount of narrative statements in the mini-narrative.
{ RatioCharacters: the character/statement ratio.
{ Originality: hits returned by the full text of the mini-narrative in the Bing
search engine.
{ OriginalityAccurate: hits returned by the exact full text of the
mininarrative in the Bing search engine.
{ Divergence: average hits returned by the mini-narrative statements in the</p>
        <p>
          Bing search engine.
{ DivergenceMinimum: minimum hits returned by the mini-narrative
statements in the Bing search engine.
{ Evolution: amount of learnTo predicates found in the mini-narrative.
{ Handicap: amount of negated capableOf predicates found in the
mininarrative.
{ InterestingLife: amount of negated doesFor predicates found in the
mininarrative.
{ TotalStoriesGenerated: amount of stories generated by the story
generator from the current mini-narrative.
{ StoryCharacters: average number of characters in the generated stories.
{ Names: StanfordNLP [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] queries for the what-if's names.
{ NamesRatio: Names/ExplicitFact ratio.
{ Valence: Sum per statement, each statement codi ed as +1 if a fact is
positive, -1 if negative and 0 otherwise).
{ ValenceAverage: Valence/ExplicitFact ratio.
{ JointWordsProbability: joint probability average for each set of words
using ngrams. For this metric we use the Project Oxford2 services.
{ JointWordsProbabilityMinimum: the minimum joint probability for the
set of words using ngrams from Project Oxford.
{ RealityDistortionRatio: events in the mini-narrative that negate a fact
from the knowledge base are considered a reality distortion. This metric
provides the reality distortion amount/ExplicitFact ratio.
2 https://www.projectoxford.ai/
{ FictionalAdditionsRatio: any event in the mini-narrative that is
missing from the knowledge base is considered a ctional addition. This metric
provides the ctional addition amount/ExplicitFact ratio.
{ FictionalRatio: reality distortion amount plus ctional addition amount/ExplicitFact.
{ ResolutionTriggerRatio: resolution events solve con icts from the
mininarrative. Provides the resolution event amount/ExplicitFact ratio.
{ MainCharacterEventsRatio: protagonist statements are statements in
which this actor plays any role. This metric provides the protagonist
statement amount/ExplicitFact ratio.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Methodology</title>
        <p>A set of 890 what-ifs were generated by the What-If Machine. All of their source
mini-narratives were processed by the metric generation system. A total of 15
di erent questionnaires were created, each including 10 what-ifs rendered as
text from the original set of 890. 150 what-ifs were included in the evaluation
set. 101 volunteers received a link that randomly redirects to one of the 15
possible questionnaires through email. Given the simplicity of the questions,
Google Forms was our platform of choice. The platform was robust and stable
and all of the answers were successfully stored in a Google Sheet document
automatically. There was no active supervision for each subject given the remote
nature and limitations of the Google Forms platform.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Questionnaire</title>
        <p>The questionnaire informed subjects about their participation in a study related
to computer-generated content (Figure 1). Some demographic information was
queried (age, gender and English level) and then they were asked to evaluate the
overall quality (on a 0-5 Likert scale) of each what-ifs plus its narrative potential
(yes/no binary answer). A text box accepting any comment was also provided
in order to gather additional qualitative information.</p>
        <p>You are about to evaluate some of the preliminary results of the \WHIM: The
What-If Machine" research project from the European Union. The overall
objective of the What-If Machine is to automatically generate ctional ideas with
cultural value. You will be presented a number of what-if style ideas and we kindly
ask you to rate them according to the following features:
{ Overall quality: from 0 (no quality) to 5 (superb quality). { Narrative potential
(yes/no). { Any observation you can provide.</p>
        <p>Completing the questionnaire should not take more than 10 minutes. We really
appreciate your contribution to the project.
101 subjects participated in the study. Statistical analysis of the results
revealed no signi cant di erences between evaluators in terms of English level,
age or gender. For instance, the quality (Q) for gender yielded (Q)male = 2:66,
(Q)male = 0:75; (Q)female = 2:69, (Q)female = 0:89. The corresponding
results for English and age are comparable.</p>
        <p>
          Questionnaires provided 1,007 Quality and 1,004 Narrative Potential
rankings for the 150 What-Ifs used. What-Ifs were ranked between 1 and 27 times. For
the Narrative Potential (P ) measurements, we mapped \Yes" to +1, \Not sure"
to 0, and \No" to -1. Overall measures resulted in (Q) = 2; 4 and (Q) = 1; 3 for
Quality and (P ) = 0; 05 and (P ) = 0; 89 for Narrative Potential. Individual
What-Ifs aggregated ranking values were used for calculating:
{ Pairwise correlations between perceived Quality and perceived Narrative
Potential, perceived Quality or perceived Narrative Potential and the metrics,
and between individual metrics.
{ Global measure of attribute importance for these metrics in predictive
modeling of the average perceived Quality or perceived Narrative Potential.
Pairwise correlations Metrics that provided the same values for all What-Ifs
in the dataset were discarded. Correlation coe cients were calculated with the
Pearson Product-Moment. There is a strong positive correlation between Quality
and Narrative Potential averages (0.83) and medians (0.758). As seen in table
1, both measures correlate positively with some metrics, such as
MainCharacterEventsRatio and RatioCharacters and correlate negatively with others, such
as ExplicitFact and Length.
Importance for Predictive Modeling In order to determine the importance of
each metric in predicting perceived Quality and Narrative Potential we used the
Relief measure [
          <xref ref-type="bibr" rid="ref10 ref26">10, 26</xref>
          ], which is a method commonly used for feature selection
in machine learning. This measure does not assume independence among the
metrics, but takes their possible interdependence into account. The more the
Relief scores are positive, the more a metric contributes to prediction of a target
value (in our case, the value of average Quality or the average Potential). The
ones that scored close to zero or negative are irrelevant and those with negative
values have even a negative impact.
        </p>
        <p>According to the results in Table 2 it seems that most of the metrics have no
use in predictive models of average Quality. For the average Narrative Potential,
however, most of the metrics seem to be slightly informative . According to Relief
ranks for the metrics results, usefulness of the metrics for average Quality is to
some extent inversely proportional to their usefulness for the average Narrative
Potential. The absolute values of the Relief scores depend on the characteristics
of data and the parameters of the assessment, which makes it di cult to use
absolute thresholds for judgements on the relevance of features. However, a strong
correlation among the Quality and Narrative Potential values and a mismatch
of the Relief scores of metrics for these two targets provide an indication that
also the contributions of the positively scored metrics are likely to be too low to
be considered relevant.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Relative Limits of Evaluating Quality</title>
      <p>The results previously presented evidence that there is a strong correlation
between narrative potential and perceived overall quality of a what-if, which
indicates that focusing on narrative plausibility as one of the main factors of quality
can lead to better results. Moreover, some of the metrics are weakly correlated
to narrative potential. However, these results are still inconclusive, and there is
a number of aspects worth mentioning for their in uence on the results.</p>
      <p>Automatically generating stories and computing useful values for metrics
is heavily dependent on the available knowledge. The outcome of the system is
constrained by the use of ConceptNet. The amount of relations that can be safely
used in ConceptNet is small and the richness and depth of the chains of properties
is limited regarding to its use as a source for narrative processing. This makes it
necessary to address knowledge management from a di erent perspective. The
WHIM project currently includes a whole module for providing robust knowledge
to the rest of the modules, and the impact of the application of this subsystem
on the creation and evaluation of what-if ideas will be reported once the results
are ready.</p>
      <p>The generation process (for the what-ifs, the stories and the metrics) strongly
in uences the overall outcome. Many design decisions have been taken in order
to provide a working, implemented prototype able to generate actual what-ifs,
and these decisions set the kind of what-ifs generated, the complexity of the
stories and many other aspects. The provided results are then the outcome of
a speci c implementation which does not claim any generality. However, the
approach itself (namely the generation-metric computation-evaluation process)
is presented as a generally applicable method for producing novel what-if ideas.</p>
      <p>The used metrics for labeling narrative properties do not cover all computable
features. There is a large number of aspects that can be extracted from a
whatif, and the narrative-based feature extraction module of the What-If Machine
does not currently provide coverage for all of them. This is considered to be not
strictly relevant with regard to the methodology and scope of the study. To test
the second hypothesis (the existence of a correlation between a certain set of
metrics and the overall quality and plausibility), the metrics must be improved.
For that purpose, the presented study gives valuable insight on which direction
to go next.</p>
      <p>The weak correlation between our metrics and the quality perceived by
humans suggested that considering more sophisticated metrics was necessary. Some
of them were considered:
1. Humanization: An approximation of how much human-like the main
character is, assuming that ctional scenarios use characters that, while behaving
like humans, can be non-human.
2. Empathy: How much empathy will a reader feel about the characters.
3. Tragedy: The amount of tragedy in the story.
4. Reality: How real and current the context is. An approximation of ctionally
in terms of context.
5. TimeSpan: The time span the story covers. It could be minutes, days or
years.</p>
      <p>Modelling and implementing these metrics proved to be beyond technical
capabilities because it required complex, rich knowledge bases (1, 4), reliable
text understanding systems (5), sophisticated emotional models (2) or formal
versions of narratological models (3). All of these resources are currently not
available.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The current paper has presented a pilot study trying to gain insight on two
hypotheses, namely that (1) human evaluation on overall quality of what-if ideas
correlates to the perception of narrative potential and that (2) there is a set
of computable metrics that also correlate to this perception. The study has
evidenced that there is a strong correlation between quality and narrative
potential for humans (1), but failed to prove such a strong correlation between the
current metrics and the human ratings. These results have been analysed and
discussed in terms of the limited potential of the current implementation of both
the ctional ideation procedure and the method employed to evaluate it. Actual
implementations lack the required complexity to approximate evaluations with
a relatively acceptable level of accuracy, mainly due to the limited technical
capabilities of current computational solutions.
29. Wiggins, G.: A preliminary framework for description, analysis and comparison of
creative systems. Knowledge-Based Systems 19(7) (2006)
30. Wiggins, G.: Searching for Computational Creativity. New Generation Computing,
Computational Paradigms and Computational Intelligence. Special Issue:
Computational Creativity 24(3), 209{222 (2006)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Boden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Computational Models of Creativity. Handbook of Creativity pp.
          <volume>351</volume>
          {
          <issue>373</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Boden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Creative Mind: Myths and Mechanisms</article-title>
          . Routledge, New York, NY,
          <volume>10001</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Creativity Versus the Perception of Creativity in Computational Systems</article-title>
          .
          <source>Proceedings of the AAAI Spring Symposium on Creative Systems (Colton</source>
          <year>2002</year>
          ),
          <volume>14</volume>
          {
          <fpage>20</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The painting fool: Stories from building an automated painter</article-title>
          .
          <source>Computers and Creativity</source>
          <volume>9783642317</volume>
          ,
          <issue>3</issue>
          {
          <fpage>38</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pease</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritchie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The e ect of input knowledge on creativity. Technical Reports of the Navy Center for (</article-title>
          <year>2001</year>
          ), http://www.inf.ed.ac.uk/publications/online/0055.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiggins</surname>
          </string-name>
          , G.:
          <article-title>Computational creativity: The nal frontier? ECAI (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gervas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Linguistic creativity at di erent levels of decision in sentence production</article-title>
          .
          <source>In: Proceedings of the AISB 02 Symposium on AI and Creativity in Arts and Science, 3rd-5th April</source>
          <year>2002</year>
          ,
          <string-name>
            <given-names>Imperial</given-names>
            <surname>College</surname>
          </string-name>
          . pp.
          <volume>79</volume>
          {
          <issue>88</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Haenen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauchas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Investigating arti cial creativity by generating melodies, using connectionist knowledge representation</article-title>
          .
          <source>In: The Third Joint Workshop on Computational Creativity</source>
          (
          <year>2006</year>
          ), http://ccg.doc.gold.ac.uk/events/ecai06/proceedings/Haenen.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jordanous</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What it is to be Creative</article-title>
          .
          <source>Cognitive Computation</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>246</volume>
          {
          <fpage>279</fpage>
          (
          <year>2012</year>
          ), http://dblp.unitrier.de/db/journals/cogcom/cogcom4.html#Jordanous12
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kira</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rendell</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A practical approach to feature selection</article-title>
          .
          <source>In: Proceedings of the ninth international workshop on Machine learning</source>
          . pp.
          <volume>249</volume>
          {
          <issue>256</issue>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Leon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gervas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The Role of Evaluation-Driven rejection in the Successful Exploration of a Conceptual Space of Stories</article-title>
          .
          <source>Minds and Machines</source>
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <volume>615</volume>
          {
          <fpage>634</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Llano</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hepworth</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gow</surname>
          </string-name>
          , J.:
          <source>Automated Fictional Ideation via Knowledge Base Manipulation. Cognitive</source>
          Computation pp.
          <volume>1</volume>
          {
          <issue>22</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Llano</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guckelsberger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards the automatic generation of ctional ideas for games. Experimental AI in</article-title>
          . . . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Llano</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hepworth</surname>
          </string-name>
          , R.:
          <article-title>Automating ctional ideation using ConceptNet</article-title>
          .
          <source>Proceedings of the . . .</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Machado</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amaro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abreu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Beyond interactive evolution: Expressing intentions through tness functions</article-title>
          .
          <source>Leonardo</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClosky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The stanford corenlp natural language processing toolkit</article-title>
          .
          <source>In: ACL (System Demonstrations)</source>
          . pp.
          <volume>55</volume>
          {
          <issue>60</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Sampling Strategies for Bag-of-Features Image Classi cation</article-title>
          . pp.
          <volume>490</volume>
          {
          <fpage>503</fpage>
          . Springer Berlin Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Pease</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>On impact and evaluation in computational creativity: A discussion of the Turing test and an alternative proposal</article-title>
          .
          <source>AISB</source>
          <year>2011</year>
          :
          <article-title>Computing</article-title>
          and Philosophy pp.
          <volume>15</volume>
          {
          <issue>22</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Pease</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winterstein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colton</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Evaluating machine creativity</article-title>
          .
          <source>In: Workshop on Creative Systems</source>
          ,
          <volume>4th</volume>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Peinado</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gervas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Evaluation of Automatic Generation of Basic Stories. New Generation Computing, Computational Paradigms</article-title>
          and
          <string-name>
            <given-names>Computational</given-names>
            <surname>Intelligence</surname>
          </string-name>
          . Special issue:
          <source>Computational Creativity</source>
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <volume>289</volume>
          {
          <fpage>302</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hervas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gervas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardoso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Multiagent Text Generator with Simple Rhetorical Habilities</article-title>
          .
          <source>In: Proc. of the AAAI-06 Workshop on Computational Aesthetics: AI</source>
          Approaches to Beauty and Happiness,
          <string-name>
            <surname>July</surname>
          </string-name>
          <year>2006</year>
          . AAAI Press (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Perez</surname>
          </string-name>
          , R.y.,
          <string-name>
            <surname>Ortiz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luna</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Negrete</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A system for evaluating novelty in computer generated narratives</article-title>
          .
          <source>Creativity</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Perez</surname>
          </string-name>
          y Perez, R.:
          <article-title>MEXICA: A Computer Model of Creativity in Writing</article-title>
          .
          <source>Ph.D. thesis</source>
          , The University of Sussex (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Ritchie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Assessing creativity</article-title>
          .
          <source>In: Proceedings of the AISB Symposium on AI and Creativity in Arts and Science</source>
          . pp.
          <volume>3</volume>
          {
          <fpage>11</fpage>
          . York, UK
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ritchie</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Some Empirical Criteria for Attributing Creativity to a Computer Program</article-title>
          .
          <source>Minds &amp; Machines</source>
          <volume>17</volume>
          ,
          <issue>67</issue>
          {
          <fpage>99</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Robnik-Sikonja</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Kononenko, I.:
          <article-title>An adaptation of relief for attribute estimation in regression</article-title>
          .
          <source>In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML</source>
          <year>1997</year>
          ). pp.
          <volume>296</volume>
          {
          <issue>304</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Ware</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          :
          <article-title>Validating a Plan-Based Model of Narrative Con ict</article-title>
          .
          <source>In: Proceedings of the International Conference on the Foundations of Digital Games</source>
          . pp.
          <volume>220</volume>
          {
          <fpage>227</fpage>
          . ACM Press, New York, New York, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Ware</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harrison</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Four Quantitative Metrics Describing Narrative Con ict</article-title>
          .pdf. pp.
          <volume>18</volume>
          {
          <fpage>29</fpage>
          . Springer Berlin Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>