Towards a structured evaluation of improv-bots:
                 Improvisational theatre as a non-goal-driven dialogue system
                                 Maria Skeppstedt1,2 , Magnus Ahltorp3
                    1
                      Computer Science Department, Linnaeus University, Växjö, Sweden
               2
                 Applied Computational Linguistics, University of Potsdam, Potsdam, Germany
                            3
                              Magnus Ahltorp Datakonsult, Stockholm, Sweden
                             maria.skeppstedt@lnu.se, magnus@ahltorpdata.se

                          Abstract                                        no method for comparing different approaches for dialogue
                                                                          generation. According to Serban et al. [2016], even the more
     We have here suggested a structured procedure                        general question of which evaluation method to use for non-
     for evaluating artificially produced improvisational                 goal-driven dialogue systems (for which improvisational the-
     theatre dialogue. We have, in addition, provided                     atre could be claimed to be a sub-category), is an open one.
     some examples of dialogues generated within the                         The aim of this paper is therefore to i) provide a suggestion
     evaluation framework suggested. Although the end                     for a structured procedure for evaluating artificially produced
     goal of a bot that produces improvisational theatre                  improvisational theatre dialogue, and ii) give some examples
     should be to perform against human actors, we con-                   of dialogues generated within the evaluation framework sug-
     sider the task of having two improv-bots perform                     gested.
     against each other as a setting for which it is eas-
     ier to carry out a reliable evaluation. To better ap-
     proximate the end goal of having two independent
                                                                          2     Previous work
     entities that act against each other, we suggest that                Creating artificially generated human dialogue is a classical
     these two bots should not be allowed to be trained                   task within the research field of artificial intelligence, with
     on the same training data. In addition, we sug-                      the ultimate aim of a bot being able to pass the Turing test.
     gest the use of the two initial dialogue lines from                  Dialogue could either be created in the form of a goal-driven
     human-written dialogues as input for the artificially                dialogue system, i.e., a system that is meant to be used to per-
     generated scenes, as well as to use the same human-                  form a specific task, such as booking a ticket, or in the form
     written dialogues in the evaluation procedure for                    a non-goal-driven system, for which no such task is given.
     the artificially generated theatre dialogues.
                                                                          2.1    Conversational modelling and dialogue
                                                                                 systems
1   Introduction                                                          One implementation method for the task of generating dia-
Improvisational theatre (or impro/improv) is an art form in               logue is to use actual lines (possibly slightly modified) from
which unscripted theatre is performed. Dialogue, charac-                  an existing dialogue corpus. This approach was, for instance,
ters and actions are typically created spontaneously. Through             applied by Banchs and Li [2012]. They constructed a vector
collaboratively creating a story, the actors can make a new               space model-vector from the previous lines in the dialogue,
scene evolve in front of the audience [Wikipedia contributors,            i.e., lines either automatically generated or provided by the
2018].                                                                    human dialogue participant, and measured its distance to vec-
    Seen from the perspective of artificial intelligence research,        tors constructed in the same fashion from the dialogues in the
improvisational theatre is a sub-genre of human interaction               dialogue corpus. The corpus dialogue which had the clos-
that is more forgiving than interaction in general. Errors made           est vector representation was then retrieved, and the dialogue
in general interaction are typically seen as a failure, and in            line from the corpus, which was given in response to the ones
the case of a dialogue system, errors might lead to a dialogue            retrieved, was returned as the next line in the dialogue.
breakdown. In contrast, errors made within an improvisa-                     Another solution is to generate new sentences, that do not
tional theatre scene are encouraged, and can form an input to             necessarily have to have been present in the corpus used
how the scene evolves. It might, therefore, be interesting to             for training. For this task, neural network techniques are
find out how artificially constructed improvisational theatre             typically applied [Vinyals and Le, 2015; Li et al., 2016;
bots, which are likely to make errors to a larger extent than a           Serban et al., 2016]. For instance, the seq2seq architecture
human, are perceived in this special setting.                             (perhaps best known for its ability to carry out machine trans-
    Although there is previous work on the construction of ar-            lation [Sutskever et al., 2014; Luong et al., 2017]), has been
tificially generated improvisational theatre, there are, to the           applied for conversational modelling/dialogue generation.
best of our knowledge, no descriptions of structured meth-                   The second approach is intuitively more appealing, since it
ods for the evaluation of the dialogues created, and thereby              gives more flexibility to what kinds of lines that can be gen-


                                                                     37
erated. Previous studies have, however, shown examples of                   We do, however, not consider this ambitious approach to be
the generative approach resulting in utterances that are fairly          appropriate for the goal of objectively evaluating, and thereby
general, as well as examples of that the same utterances are             in the long run improving, the generation of improvised dia-
often repeated. That is, the content that is most commonly oc-           logue. The main reason for this is that the competence of
curring in the training corpus is that which is most typically           the human actor impacts the quality of the resulting dialogue,
being generated, and the potential for flexibility does not au-          since skilful improvisers have a larger ability to fit strange
tomatically lead to a larger creativity. Instead, dialogue lines         utterances from a co-actor into an improvised scene. There
that are generated mainly on the basis of what is very repre-            is, for instance, an improvisational theatre game [improwiki,
sentative to the corpus might thus be boring in the context of           2018b], where the actors are given a set of pre-written, out-
improvisational theatre (and possibly also in most other ap-             of-context lines, which they are to incorporate in a natural
plications of non-goal-driven dialogue systems). It has been             way into the scene. A human actor in an improvisational the-
possible to solve the problem of repeated lines, through the             atre dialogue is thus very different from a human interact-
application of reinforcement learning that rewards diversity,            ing with a standard, task-oriented dialogue system. In addi-
but the examples provided in the paper describing this ap-               tion, the quality of the text-to-speech system and the speech
proach still include dialogue lines that are rather generic [Li          recognition might influence the audience’s perception of the
et al., 2016].                                                           dialogue, and thereby their evaluation of the quality of the
   In addition, we suspect that the generative approach might            dialogue content.
require larger dialogue corpora to give usable results, de-
spite that out-of-domain resources, such as large external               3     Evaluation procedure suggested
monologue corpora to initialise word embeddings, have been
shown useful [Serban et al., 2016]. Li et al., for instance, used        Given the problems of including a human actor in a more
the OpenSubtitles parallel corpus, which consists of around              structured evaluation, we suggest the following procedure for
80 million source-target pairs, for their generative approach.           evaluating automatically generated improvisational theatre,
   Since it is relevant to be able to provide automatically gen-         in which the task is narrowed down to the generation of di-
erated improvisational dialogues also for languages for which            alogue and in which the dialogue is initialised in a manner
there does not exist a large dialogue corpus and possibly not            which increases the possibilities to carry out a reliable evalu-
even a large out-of-domain corpus, or for sub-genres within              ation.
a language (e.g., improvised Shakespeare [The Improvised
Shakespeare Company, 2018] or Strindberg [Strindbergs in-                3.1    Interaction between two bots
tima teater, 2012]), it is also important to explore the perfor-         A more reliable evaluation method needs to remove the hu-
mance of methods that are less resource demanding. There-                man influence, and the easiest approach for achieving that
fore, along with exploring generative approaches, it might               would be to replace the human actor with another improvi-
also be relevant to compare these (for different in-domain or            sation bot, i.e., the set-up would be two improvisation bots
out-of-domain training data sizes) to methods that create dia-           talking to each other. However, since the end goal is to con-
logues through the use of existing dialogue lines.                       struct a bot that is able to act against a human actor, the func-
                                                                         tionality of the bots should not be allowed to be dependent on
2.2   Artificial improvisational theatre                                 any one of the bots having full knowledge of the other bot.
The use of artificial intelligence as a part of improvisa-               Instead, the shared knowledge between the two bots should
tional theatre has recently been explored by Mathewson and               aim to approximate the shared knowledge between two hu-
Mirowski [2017]. Their work included the creation of a di-               man improvisational actors.
alogue system that allowed a human improvisation actor to                   To approximate that level of shared knowledge, we sug-
communicate with a robot that produced lines in response to              gest that the two bots that are to be evaluated should not be
lines uttered by the human actor. Two versions of the robot              allowed to be trained on the same training data. The data
dialogue were constructed, one version that selected existing            could be taken from the same text genre, but is should not be
lines in the training corpus, and one version that relied on text        the exact same data. That is, in the same manner as two hu-
generation techniques.                                                   mans that learn the same language are exposed to text from
   The ambitious approach by Mathewson and Mirowski thus                 the same genre, i.e., the very wide genre of utterances from
included the use of speech recognition and a text-to-speech              many different registers in the language, but are not exposed
system, which functioned in real-time in front of an audi-               to the exact same utterances.
ence. We believe that this set-up is an appropriate goal for
artificial intelligence-powered improvisational theatre, in par-         3.2    Starting the improvised scene
ticular their choice of including a human actor as one of the            Improvisational theatre is often carried out with the use of a
participants in the dialogue. We suspect, albeit without be-             set of constraints, typically in the form of an input that the
ing able to provide any substantial basis for this suspicion,            actors can use as a starting point for their scene. For instance,
that watching a human produce lines in real time is one of the           the audience could provide an input in the form of a sugges-
main fascinations of improvisational theatre, and that many              tion for a location at which the scene is to take place. Another
audience members would quickly lose interest in a play if                example is input in form of body postures that the actors use
they were aware of that it only included artificial actors and           as the starting point for a scene [Johnstone, 1999, pp. 186–
artificially generated dialogue.                                         187].


                                                                    38
   The evaluation method we suggest is to use the two initial            might therefore be combined into a metric in the form of the
dialogue lines from human written dialogues as input for the             entertainment value of the dialogues. The evaluator should,
scene. This is a form of input that can be easily automated              therefore, when estimating whether a dialogue has been pro-
on a larger scale (as opposed to using non-textual input such            duced by a human, also assess how entertaining the dialogue
as body postures). In addition, the two initial lines provide            is. This is likely to be a more subjective measure. How-
background data that the dialogue bots can use for generating            ever, given a hypothetical situation in which the artificially
new lines, which simplifies the task somewhat.                           generated dialogue often is perceived as being generated by
   Most importantly, however, using the first two lines of               a human, but these dialogues are consistently being given a
human-written dialogue as input, will result in that the arti-           lower entertainment value score than the human-written di-
ficially generated dialogue has a comparable human-written               alogues, then this would give an indication of that there is
dialogue against which it can be evaluated. To make them                 something important missing in the dialogues generated. The
as comparable as possible, the improvisation bots could be               easiest solution is, probably, to use a binary score, e.g., to let
instructed to generate approximately the same number of di-              the evaluator determine whether the dialogue was boring or
alogue lines as the number of lines included in the human                not.
dialogue.                                                                   There are also other types of measures that could be ap-
                                                                         plied for evaluating generated dialogues, e.g., measures that
3.3   Evaluating the scene from the perspective of its                   are related to techniques taught within improvisational the-
      likelihood of having been produced by humans                       atre. An actor should, for instance, aim to be collaborative,
With the use of this set-up for dialogue generation, for which           e.g., give offers to and accept offers from the co-actors [John-
there will be comparable human-written and automatically                 stone, 1987, pp. 94–108]. To help the audience follow a
generated texts available, the evaluation can be carried out             scene, what roles the actors play, what their relationship is,
as follows:                                                              where the scene is played and what the objectives of the char-
   The two initial dialogue lines are randomly sampled from              acters are, should also be established early on in a scene [im-
a set of (preferably short) human-written dialogues, and one             prowiki, 2018a]. It would be a very interesting task to con-
or several pairs of bot systems use these two initial lines to           struct an improvisational theatre bot that could achieve such
produce a generated dialogue.                                            improv-theatre tasks. With these more specific tasks, how-
   A human evaluator is then presented with a set of short di-           ever, the system is perhaps no longer a non-goal-driven di-
alogue texts, of which some (e.g., half of them) have been se-           alogue system, but starts to resemble a goal-driven system.
lected from the human-written dialogues from which the two               Creating such a system is thus a separate task, for which a
initial starter-lines were sampled, and some from the auto-              separate framework for evaluation should be developed.
matically generated dialogues. The task of the human evalu-
ator is then to, for each text, decide whether the dialogue has          4   Implementation
been generated by a machine or produced by humans. The
same human evaluator should not be presented with a human-               In the long run, we aim to implement and evaluate a resource-
written dialogue and an automatically generated one that be-             intensive method as well, e.g., a method that uses seq2seq
gins with the same two initial lines. With this restriction, the         to generate new text. However, to illustrate the evaluation
situation that the evaluator carries out a direct comparison be-         method, we here implemented a dialogue creation strategy
tween the two texts is avoided. An evaluation through com-               built on selecting the most appropriate line from a dialogue
parison would be a less realistic task, since the final aim is to        corpus. This method uses i) a moderate-size dialogue corpus,
produce a dialogue that could pass as human-produced, not a              and ii) a distributional semantics space that is constructed
dialogue that is more human-like than a text that has actually           from a very large out-of-domain corpus. We apply a dialogue
been produced by a human. Employing at least three human                 generation method that is built on several different sub-ideas,
evaluators, would be a prerequisite for all automatically gen-           which we hope might serve as inspiration for future work, but
erated texts being shown to a human, and that enough texts               an evaluation of the contribution of each idea is not within the
are annotated to allow for inter-annotator agreement calcula-            scope of this paper.
tions.                                                                      As corpus, we used the Cornell movie-dialogues corpus
   Naturally, the set of dialogues from which the two initial            [Danescu-Niculescu-Mizil and Lee, 2011], and as distribu-
dialogue lines are sampled to use as evaluation data, can not            tional semantics space we use the word2vec space that has
be allowed to be included in the data sets used for training the         been pre-trained on a very large corpus of Google News
improv-bots.                                                             and which has been made available by Mikolov et al. [2013;
                                                                         2013].
3.4   Evaluating the scene from other perspectives                          Due to the spontaneous and collaborative nature of impro-
There are, of course, other aspects than the resemblance                 visational theatre, we believe that each dialogue line in this
to a human-produced dialogue that the dialogues generated                genre in average is likely to be shorter than lines in scripted
should be evaluated for. Two parameters, mentioned in the                theatre. We, therefore, extracted a subset of dialogue line
background, are the level of diversity among the lines gen-              triplets from the Cornell movie-dialogues corpus, where each
erated and how general the lines produced are. Repetitive                of the lines had to conform to the following set of length cri-
and generic dialogue lines are both examples of phenomena                teria: A line was allowed to contain a maximum of two sen-
that might produce a boring scene, and these two parameters              tences, and in case it contained two sentences, the first of


                                                                    39
these two sentences was allowed to contain a maximum of                    tions for the three initial and ending tokens of the most recent
two tokens. The last sentence (that is, the only sentence for              line (the weights were determined by inspecting the output
one-sentence lines and the second for two-sentence lines) was              of the algorithm on the development data). Vector elements
allowed to contain a maximum of twelve tokens. Sentence                    were also added to indicate whether a line contained any of
splitting and tokenisation was carried out with NLTK [Bird,                the question words who, where, when, why, what, which, how
2002].                                                                     or a question mark.
   In the Cornell movie-dialogues corpus, there were only 262                  When there were several dialogue line pairs in the training
dialogues that contained at least six dialogue lines and for               data that matched the lines in the generated dialogues equally
which all of the lines conformed to the length criteria we had             well (allowing for a maximum Euclidean distance difference
established for the experiment. These 262 dialogues were,                  of 0.08 between different candidates), and which resulted in
therefore, saved to use as the set of evaluation data, i.e., data          many candidates for the next line, we applied an unsupervised
which could be used in the evaluation of the automatically                 outlier detection to this set of candidates, using scikit-learn’s
generated dialogues. Line triplets from the rest of the corpus             OneClassSVM [Pedregosa et al., 2011]. The set of outliers
were divided into two groups, one group to use as training                 were then removed from the candidate list.
data for Actor A and another group to use as training data for                 For the number of candidates that were still present in the
Actor B. We divided the triplets film-wise, so that all triplets           candidate list after outliers had been removed, we tried to in-
from the same film were assigned either as training data to                corporate the co-operative spirit of improvisational theatre for
Actor A or to Actor B. In addition, 100 of the dialogues were              selecting which of them to use. This was accomplished by se-
not added to the training data set, but were used for an infor-            lecting the candidate line, for which, when this line (together
mal evaluation during the development, i.e., used as the two               with its preceding line) was submitted as input the algorithm,
first input lines to run the dialogue generation during devel-             the closest neighbour was found. The motivation for this was
opment. A total of 10,322 line triplets were used to train the             that when a line was selected to which the co-actor would be
functionality for Actor A and a total of 10,884 line triplets for          more likely to find a good answer, the dialogue would run
the functionality of Actor B.                                              more smoothly, i.e., just as in real improvisational theatre.
   A context in the form of the line most recently uttered in                  We also applied two simple rules to improve the dialogues,
the dialogue and the line before that was used as input data               i) to avoid to end a dialogue with a line ending with a question
for predicting the next line in the dialogue. The first two lines          mark, ii) and to avoid repeating a line in the dialogue. These
of each training data triplet were used to represent these two             rules were, however, not strictly enforced, and when there
most recent lines, and the third line to represent the line to be          were no other candidates of approximately the same quality
predicted. The core of the method for prediction was thus to               as a line ending with a question mark or as a repeated line,
retrieve the training data triplet for which the two first lines           these were still used.
were most similar to the two most recent lines in the gen-                     Word2vec vectors were accessed through the Gensim li-
erated dialogue, and to use the third line in the triplet as the           brary [Řehůřek and Sojka, 2010]. The search for dialogue
next line in the generated dialogue. Similarity of dialogue line           line pairs in the training data, i.e., the dialogue line pairs that
pairs was determined through converting the two lines into a               were closest to the data given when constructing new dia-
semantic vector representation, and using the Euclidean dis-               logues, was sped up by training a scikit-learn NearestNeigh-
tance between the vectors as the similarity measure.                       bors classifier [Pedregosa et al., 2011].
   The vector representation for the previous, and the most
recently uttered line in the generated dialogue (as well as for
the first and second lines in the training data triplets), were
                                                                           5   Example output
constructed as follows: For the previous line, the average of              In Table 1, we present 6 generated dialogues, which were ran-
the word2vec vectors representing the tokens in the line were              domly sampled from the set of 262 dialogues that had been
used as the line representation. Tokens present in a standard              set aside as evaluation data. The first two lines are given from
English stop word list were removed before creating the av-                the corpus dialogue, and the left-hand column presents the
erage vector. For the most recently uttered line, the same rep-            generated version while the right-hand column presents the
resentation was used, except that stop words were retained.                human-written corpus version. The last two examples show
We believe that also words that are normally considered as                 the output of our algorithm and the output presented by Li et
stop words are important when interpreting the exact content               al. [2016]. Similarly as when generating lines starting from
of the most recently uttered dialogue line, while they might               human-written dialogue, we provided the first two lines in the
be less important for the content of an earlier line which we              dialogues published by Li et al. as input to our system.
included to provide a topical context.                                        Our suggested formal evaluation of these dialogues would
   In addition to the averaged vectors, we used the word2vec               thus be to present half of the dialogues in Table 1 to Evalua-
representation of the three first tokens in the most recently ut-          tor 1 and the other half to Evaluator 2, who are to determine
tered line, as well as the three last tokens in the line, as we be-        i) whether the dialogue is produced by a human or not, and
lieve that these might be more important than the other words              ii) whether the dialogue is boring. When informally evalu-
for capturing the surface form of the conversation. All of                 ating these dialogues, we would say that most dialogues in
these six vector representations were then concatenated into               the right-hand column would pass as human made, except the
one long vector. The averaged vectors were slightly down-                  strange dialogue 2, while hardly any of the dialogues in the
weighted, to give more importance to the vector representa-                left-hand column would be classified as produced by humans.


                                                                      40
Table 1: The automatically generated dialogues compared to the human-written dialogues, and (for the two last examples), compared to the
output of previously published generated dialogue examples. The same human evaluator would either be shown the text in the left-hand
column or the text in the right-hand column, and determine i) whether it has been produced by a human, and ii) whether it is boring.

          Computer-generated                                             Human-written
     1    A: I was first to respond.                                     A: I was first to respond.
          B: What were you doing out here?                               B: What were you doing out here?
          A: My debrief –                                                A: Drive by patrol.
          B: Change?                                                     B: How is she?
          A: With a vengeance, I hated them.                             A: She’s tough.
          B: She was the Queen of the Netherlands.                       B: Have to be. The shit she’s gone through.

     2    A: I NEED SOME                                                 A: I NEED SOME
          B: YOUNG BLOOD                                                 B: YOUNG BLOOD
          A: On your sleeve.                                             A: I NEED SOME
          B: Well, I’m sure she kept her word and left you satisfied.    B: YOUNG BLOOD
          A: No. No you’re not.                                          A: I NEED IT NOW I NEED SOME
          B: Yes, Mr. Hillyer, I am.                                     B: YOUNG BLOOD
          A: Guess we better get going, huh?                             A: I NEED SOME
          B: Just a little something’ somethin’.                         B: YOUNG BLOOD

     3    A: I’d better go in.                                           A: I’d better go in.
          B: I only live up there and across the roof.                   B: I only live up there and across the roof.
          A: Did you find something?                                     A: Thanks anyway.
          B: Who?                                                        B: Listen, Edie, am I going to see you again?
          A: If he’s not out, why are they here?                         A: What for?
          B: I have no idea.                                             B: I don’t know.
          A: Why not?                                                    A: I really don’t know.

     4    A: This is ridiculous.                                         A: This is ridiculous.
          B: Government cutbacks.                                        B: Government cutbacks.
          A: Office work?                                                A: Why can’t we do it through the mail?
          B: You bet it did.                                             B: We missed the deadline.
          A: You drive away business roarin’ in like that.               A: Can’t we do it on the phone?
          B: Bravo!                                                      B: I don’t think so.

     5    A: Look at my ring.                                            A: Look at my ring.
          B: That’s a fine-looking ring.                                 B: That’s a fine-looking ring.
          A: Busy tomorrow?                                              A: A lady threw it to me.
          B: Yes. Things are pretty hectic right now, but thanks.        B: Most surely that was a nice lady to give a ring to a little girl.
          A: I know!                                                     A: It’s a pretty ring.
          B: I’m gonna have to ask you to leave.                         B: I wouldn’t be surprised if it were a true wishing ring.

     6    A: You all right?                                              A: You all right?
          B: Yes, thank you.                                             B: Yes, thank you.
          A: Hello, Helen.                                               A: What was that all about?
          B: I’m afraid I spoke out of turn last night.                  B: He wants me to perform for him.
          A: From whom?                                                  A: Perform?
          B: The main guy, Dieter–                                       B: You know.
          A: I don’t have time.                                          A: You on the game?
          B: I can see why.                                              B: God no. I’m a hairdresser.

          Computer-generated here                                        Computer-generated in previous studies
     7    A: How old are you?                                            A: How old are you?
          B: I’m 16. Why are you asking?                                 B: I’m 16. Why are you asking?
          A: Why are you smiling?                                        A: I thought you were 12.
          B: To tell you that we’re interested in the property.          B: What made you think so?
          A: Oh, Dmitri.I wish that I could.                             A: I don’t know what you are talking about.
          B: He didn’t say we need guns.                                 B: You don’t know what you are saying.

     8    A: Where are you going?                                        A: Where are you going?
          B: I’m going to the police station.                            B: I’m going to the police station.
          A: I’ll help you.                                              A: I’ll come with you.
          B: I’ll scream.                                                B: No, no, no, no, you’re not going anywhere.
          A: I mean, we get paid tonight, don’t we?                      A: Why?
          B: What the fuck is that supposed –                            B: I need you to stay here.
          A: What?                                                       A: I don’t know what you are talking about.


                                                                        41
None of the dialogues would, however, be classified as bor-                 Proceedings of the Workshop on Cognitive Modeling and
ing, except maybe the first of the two dialogues provided by                Computational Linguistics, ACL 2011, 2011.
Li et al. [2016], as it starts to generate very generic lines to-        [improwiki, 2018a] improwiki.                             crow.
wards the end of the dialogue.                                              https://improwiki.com/en/wiki/improv/crow, 2018.
                                                                         [improwiki, 2018b] improwiki.                Drop     a    line.
6   Conclusion and outlook                                                  https://improwiki.com/en/wiki/improv/drop a line,
The generated dialogues presented here portray a collection                 2018.
of somewhat strange exchanges, and would not be useful in                [Johnstone, 1987] Keith Johnstone. Impro : improvisation
the context of simulating a real conversation. They might,                  and the theatre. Routledge, New York, 1987.
however, function as absurd dialogues that, for instance,                [Johnstone, 1999] Keith. Johnstone. Impro for storytellers :
could be used as improvised scene starters. We believe, how-
                                                                            theatresports and the art of making things happen. Faber,
ever, that the more structured form of evaluating a non-goal-
                                                                            London, [new ed.] edition, 1999.
driven dialogue system that we present and exemplify could
be generally useful. The evaluation structure might be pos-              [Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Dan Ju-
sible to apply in the setting of a shared task, in which the                rafsky, Michel Galley, and Jianfeng Gao. Deep reinforce-
participants not only produce dialogues of this type, but also              ment learning for dialogue generation. In Proceedings
participate in the evaluation by classifying the dialogues pro-             of the 2016 Conference on Empirical Methods in Natural
duced by other participating groups.                                        Language Processing, pages 1192–1202, Austin, Texas,
   The next step is to implement a more resource-intensive                  November 2016. Association for Computational Linguis-
method, e.g., a method built on seq2seq or some other neural                tics.
network-based technique. We also intend to extend our initial            [Luong et al., 2017] Minh-Thang Luong, Eugene Brevdo,
attempts of achieving dialogue generation with the help of                  and Rui Zhao. Neural machine translation (seq2seq) tu-
a moderately sized dialogue corpus. We have, for instance,                  torial. https://github.com/tensorflow/nmt, 2017.
not yet attempted any post-processing of the selected lines              [Mathewson and Mirowski, 2017] Kory W. Mathewson and
to make them fit better into the dialogue, e.g., to make the
                                                                            Piotr Mirowski. Improvised theatre alongside artificial in-
pronoun gender and number agree between the lines, or to
                                                                            telligences. In In proceedings of AAAI Conference on Ar-
match the use of helper verbs.
                                                                            tificial Intelligence. Association for the Advancement of
   Although the ultimate goal would be to achieve an improv-                Artificial Intelligence, 2017.
bot that could act seamlessly with a human actor, it would
also be interesting to explore the suspicion we introduced in            [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Cor-
the background, i.e., that an audience would quickly lose in-               rado, and Jeffrey Dean. Efficient estimation of word rep-
terest in a play if they were aware of that it consisted solely             resentations in vector space. CoRR, abs/1301.3781, 2013.
of artificially generated dialogue. For instance, if two pup-            [Mikolov, 2013] Tomas                                 Mikolov.
pets were given two starting lines by the audience, and from                https://code.google.com/archive/p/word2vec/,
these starting lines played a scene with automatically gen-                 word2vec on Google code                                2013.
erated human-like dialogues, would the audience still find it            [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux,
interesting?                                                                Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
                                                                            Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Acknowledgements                                                            Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Pas-
We would like to thank Jonas Sjöbergh, as well as the anony-               sos, David Cournapeau, Matthieu Brucher, Matthieu Per-
mous reviewers, for valuable input to the content of this paper.            rot, and Edouard Duchesnay. Scikit-learn: Machine learn-
                                                                            ing in Python. Journal of Machine Learning Research,
                                                                            12:2825–2830, 2011.
References
                                                                         [Řehůřek and Sojka, 2010] Radim Řehůřek and Petr Sojka.
[Banchs and Li, 2012] Rafael E. Banchs and Haizhou Li.                      Software Framework for Topic Modelling with Large Cor-
   Iris: a chat-oriented dialogue system based on the vector                pora. In Proceedings of the LREC 2010 Workshop on
   space model. In Proceedings of the Association for Com-                  New Challenges for NLP Frameworks, pages 45–50, Paris,
   putational Linguistics, System Demonstrations, 2012.                     France, May 2010. European Language Resources Asso-
[Bird, 2002] Steven Bird. Nltk: The natural language toolkit.               ciation (ELRA).
   In Proceedings of the ACL Workshop on Effective Tools                 [Serban et al., 2016] Iulian V Serban, Alessandro Sordoni,
   and Methodologies for Teaching Natural Language Pro-                     Yoshua Bengio, Aaron Courville, and Joelle Pineau.
   cessing and Computational Linguistics, Stroudsburg, PA,                  Building end-to-end dialogue systems using generative hi-
   USA, 2002. Association for Computational Linguistics.                    erarchical neural network models. Proceedings of AAAI,
[Danescu-Niculescu-Mizil and Lee, 2011] Cristian                            2016.
   Danescu-Niculescu-Mizil and Lillian Lee. Chameleons                   [Strindbergs intima teater, 2012] Strindbergs intima teater.
   in imagined conversations: A new approach to under-                      http://strindbergsintimateater.se/festival-i-maj-2012/,
   standing coordination of linguistic style in dialogs. In                 2012.


                                                                    42
[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
   Quoc V Le. Sequence to sequence learning with neural
   networks. In Advances in neural information processing
   systems, pages 3104–3112, 2014.
[The Improvised Shakespeare Company, 2018]
   The        Improvised       Shakespeare        Company.
   http://improvisedshakespeare.com, 2018.
[Vinyals and Le, 2015] Oriol Vinyals and Quoc V. Le. A
   neural conversational model. CoRR, abs/1506.05869,
   2015.
[Wikipedia contributors, 2018] Wikipedia contributors. Im-
   provisational theatre — Wikipedia, the free encyclopedia,
   2018. [Online; accessed 27-June-2018].


                                                               43