Towards a structured evaluation of improv-bots: Improvisational theatre as a non-goal-driven dialogue system Maria Skeppstedt1,2 , Magnus Ahltorp3 1 Computer Science Department, Linnaeus University, Växjö, Sweden 2 Applied Computational Linguistics, University of Potsdam, Potsdam, Germany 3 Magnus Ahltorp Datakonsult, Stockholm, Sweden maria.skeppstedt@lnu.se, magnus@ahltorpdata.se Abstract no method for comparing different approaches for dialogue generation. According to Serban et al. [2016], even the more We have here suggested a structured procedure general question of which evaluation method to use for non- for evaluating artificially produced improvisational goal-driven dialogue systems (for which improvisational the- theatre dialogue. We have, in addition, provided atre could be claimed to be a sub-category), is an open one. some examples of dialogues generated within the The aim of this paper is therefore to i) provide a suggestion evaluation framework suggested. Although the end for a structured procedure for evaluating artificially produced goal of a bot that produces improvisational theatre improvisational theatre dialogue, and ii) give some examples should be to perform against human actors, we con- of dialogues generated within the evaluation framework sug- sider the task of having two improv-bots perform gested. against each other as a setting for which it is eas- ier to carry out a reliable evaluation. To better ap- proximate the end goal of having two independent 2 Previous work entities that act against each other, we suggest that Creating artificially generated human dialogue is a classical these two bots should not be allowed to be trained task within the research field of artificial intelligence, with on the same training data. In addition, we sug- the ultimate aim of a bot being able to pass the Turing test. gest the use of the two initial dialogue lines from Dialogue could either be created in the form of a goal-driven human-written dialogues as input for the artificially dialogue system, i.e., a system that is meant to be used to per- generated scenes, as well as to use the same human- form a specific task, such as booking a ticket, or in the form written dialogues in the evaluation procedure for a non-goal-driven system, for which no such task is given. the artificially generated theatre dialogues. 2.1 Conversational modelling and dialogue systems 1 Introduction One implementation method for the task of generating dia- Improvisational theatre (or impro/improv) is an art form in logue is to use actual lines (possibly slightly modified) from which unscripted theatre is performed. Dialogue, charac- an existing dialogue corpus. This approach was, for instance, ters and actions are typically created spontaneously. Through applied by Banchs and Li [2012]. They constructed a vector collaboratively creating a story, the actors can make a new space model-vector from the previous lines in the dialogue, scene evolve in front of the audience [Wikipedia contributors, i.e., lines either automatically generated or provided by the 2018]. human dialogue participant, and measured its distance to vec- Seen from the perspective of artificial intelligence research, tors constructed in the same fashion from the dialogues in the improvisational theatre is a sub-genre of human interaction dialogue corpus. The corpus dialogue which had the clos- that is more forgiving than interaction in general. Errors made est vector representation was then retrieved, and the dialogue in general interaction are typically seen as a failure, and in line from the corpus, which was given in response to the ones the case of a dialogue system, errors might lead to a dialogue retrieved, was returned as the next line in the dialogue. breakdown. In contrast, errors made within an improvisa- Another solution is to generate new sentences, that do not tional theatre scene are encouraged, and can form an input to necessarily have to have been present in the corpus used how the scene evolves. It might, therefore, be interesting to for training. For this task, neural network techniques are find out how artificially constructed improvisational theatre typically applied [Vinyals and Le, 2015; Li et al., 2016; bots, which are likely to make errors to a larger extent than a Serban et al., 2016]. For instance, the seq2seq architecture human, are perceived in this special setting. (perhaps best known for its ability to carry out machine trans- Although there is previous work on the construction of ar- lation [Sutskever et al., 2014; Luong et al., 2017]), has been tificially generated improvisational theatre, there are, to the applied for conversational modelling/dialogue generation. best of our knowledge, no descriptions of structured meth- The second approach is intuitively more appealing, since it ods for the evaluation of the dialogues created, and thereby gives more flexibility to what kinds of lines that can be gen- 37 erated. Previous studies have, however, shown examples of We do, however, not consider this ambitious approach to be the generative approach resulting in utterances that are fairly appropriate for the goal of objectively evaluating, and thereby general, as well as examples of that the same utterances are in the long run improving, the generation of improvised dia- often repeated. That is, the content that is most commonly oc- logue. The main reason for this is that the competence of curring in the training corpus is that which is most typically the human actor impacts the quality of the resulting dialogue, being generated, and the potential for flexibility does not au- since skilful improvisers have a larger ability to fit strange tomatically lead to a larger creativity. Instead, dialogue lines utterances from a co-actor into an improvised scene. There that are generated mainly on the basis of what is very repre- is, for instance, an improvisational theatre game [improwiki, sentative to the corpus might thus be boring in the context of 2018b], where the actors are given a set of pre-written, out- improvisational theatre (and possibly also in most other ap- of-context lines, which they are to incorporate in a natural plications of non-goal-driven dialogue systems). It has been way into the scene. A human actor in an improvisational the- possible to solve the problem of repeated lines, through the atre dialogue is thus very different from a human interact- application of reinforcement learning that rewards diversity, ing with a standard, task-oriented dialogue system. In addi- but the examples provided in the paper describing this ap- tion, the quality of the text-to-speech system and the speech proach still include dialogue lines that are rather generic [Li recognition might influence the audience’s perception of the et al., 2016]. dialogue, and thereby their evaluation of the quality of the In addition, we suspect that the generative approach might dialogue content. require larger dialogue corpora to give usable results, de- spite that out-of-domain resources, such as large external 3 Evaluation procedure suggested monologue corpora to initialise word embeddings, have been shown useful [Serban et al., 2016]. Li et al., for instance, used Given the problems of including a human actor in a more the OpenSubtitles parallel corpus, which consists of around structured evaluation, we suggest the following procedure for 80 million source-target pairs, for their generative approach. evaluating automatically generated improvisational theatre, Since it is relevant to be able to provide automatically gen- in which the task is narrowed down to the generation of di- erated improvisational dialogues also for languages for which alogue and in which the dialogue is initialised in a manner there does not exist a large dialogue corpus and possibly not which increases the possibilities to carry out a reliable evalu- even a large out-of-domain corpus, or for sub-genres within ation. a language (e.g., improvised Shakespeare [The Improvised Shakespeare Company, 2018] or Strindberg [Strindbergs in- 3.1 Interaction between two bots tima teater, 2012]), it is also important to explore the perfor- A more reliable evaluation method needs to remove the hu- mance of methods that are less resource demanding. There- man influence, and the easiest approach for achieving that fore, along with exploring generative approaches, it might would be to replace the human actor with another improvi- also be relevant to compare these (for different in-domain or sation bot, i.e., the set-up would be two improvisation bots out-of-domain training data sizes) to methods that create dia- talking to each other. However, since the end goal is to con- logues through the use of existing dialogue lines. struct a bot that is able to act against a human actor, the func- tionality of the bots should not be allowed to be dependent on 2.2 Artificial improvisational theatre any one of the bots having full knowledge of the other bot. The use of artificial intelligence as a part of improvisa- Instead, the shared knowledge between the two bots should tional theatre has recently been explored by Mathewson and aim to approximate the shared knowledge between two hu- Mirowski [2017]. Their work included the creation of a di- man improvisational actors. alogue system that allowed a human improvisation actor to To approximate that level of shared knowledge, we sug- communicate with a robot that produced lines in response to gest that the two bots that are to be evaluated should not be lines uttered by the human actor. Two versions of the robot allowed to be trained on the same training data. The data dialogue were constructed, one version that selected existing could be taken from the same text genre, but is should not be lines in the training corpus, and one version that relied on text the exact same data. That is, in the same manner as two hu- generation techniques. mans that learn the same language are exposed to text from The ambitious approach by Mathewson and Mirowski thus the same genre, i.e., the very wide genre of utterances from included the use of speech recognition and a text-to-speech many different registers in the language, but are not exposed system, which functioned in real-time in front of an audi- to the exact same utterances. ence. We believe that this set-up is an appropriate goal for artificial intelligence-powered improvisational theatre, in par- 3.2 Starting the improvised scene ticular their choice of including a human actor as one of the Improvisational theatre is often carried out with the use of a participants in the dialogue. We suspect, albeit without be- set of constraints, typically in the form of an input that the ing able to provide any substantial basis for this suspicion, actors can use as a starting point for their scene. For instance, that watching a human produce lines in real time is one of the the audience could provide an input in the form of a sugges- main fascinations of improvisational theatre, and that many tion for a location at which the scene is to take place. Another audience members would quickly lose interest in a play if example is input in form of body postures that the actors use they were aware of that it only included artificial actors and as the starting point for a scene [Johnstone, 1999, pp. 186– artificially generated dialogue. 187]. 38 The evaluation method we suggest is to use the two initial might therefore be combined into a metric in the form of the dialogue lines from human written dialogues as input for the entertainment value of the dialogues. The evaluator should, scene. This is a form of input that can be easily automated therefore, when estimating whether a dialogue has been pro- on a larger scale (as opposed to using non-textual input such duced by a human, also assess how entertaining the dialogue as body postures). In addition, the two initial lines provide is. This is likely to be a more subjective measure. How- background data that the dialogue bots can use for generating ever, given a hypothetical situation in which the artificially new lines, which simplifies the task somewhat. generated dialogue often is perceived as being generated by Most importantly, however, using the first two lines of a human, but these dialogues are consistently being given a human-written dialogue as input, will result in that the arti- lower entertainment value score than the human-written di- ficially generated dialogue has a comparable human-written alogues, then this would give an indication of that there is dialogue against which it can be evaluated. To make them something important missing in the dialogues generated. The as comparable as possible, the improvisation bots could be easiest solution is, probably, to use a binary score, e.g., to let instructed to generate approximately the same number of di- the evaluator determine whether the dialogue was boring or alogue lines as the number of lines included in the human not. dialogue. There are also other types of measures that could be ap- plied for evaluating generated dialogues, e.g., measures that 3.3 Evaluating the scene from the perspective of its are related to techniques taught within improvisational the- likelihood of having been produced by humans atre. An actor should, for instance, aim to be collaborative, With the use of this set-up for dialogue generation, for which e.g., give offers to and accept offers from the co-actors [John- there will be comparable human-written and automatically stone, 1987, pp. 94–108]. To help the audience follow a generated texts available, the evaluation can be carried out scene, what roles the actors play, what their relationship is, as follows: where the scene is played and what the objectives of the char- The two initial dialogue lines are randomly sampled from acters are, should also be established early on in a scene [im- a set of (preferably short) human-written dialogues, and one prowiki, 2018a]. It would be a very interesting task to con- or several pairs of bot systems use these two initial lines to struct an improvisational theatre bot that could achieve such produce a generated dialogue. improv-theatre tasks. With these more specific tasks, how- A human evaluator is then presented with a set of short di- ever, the system is perhaps no longer a non-goal-driven di- alogue texts, of which some (e.g., half of them) have been se- alogue system, but starts to resemble a goal-driven system. lected from the human-written dialogues from which the two Creating such a system is thus a separate task, for which a initial starter-lines were sampled, and some from the auto- separate framework for evaluation should be developed. matically generated dialogues. The task of the human evalu- ator is then to, for each text, decide whether the dialogue has 4 Implementation been generated by a machine or produced by humans. The same human evaluator should not be presented with a human- In the long run, we aim to implement and evaluate a resource- written dialogue and an automatically generated one that be- intensive method as well, e.g., a method that uses seq2seq gins with the same two initial lines. With this restriction, the to generate new text. However, to illustrate the evaluation situation that the evaluator carries out a direct comparison be- method, we here implemented a dialogue creation strategy tween the two texts is avoided. An evaluation through com- built on selecting the most appropriate line from a dialogue parison would be a less realistic task, since the final aim is to corpus. This method uses i) a moderate-size dialogue corpus, produce a dialogue that could pass as human-produced, not a and ii) a distributional semantics space that is constructed dialogue that is more human-like than a text that has actually from a very large out-of-domain corpus. We apply a dialogue been produced by a human. Employing at least three human generation method that is built on several different sub-ideas, evaluators, would be a prerequisite for all automatically gen- which we hope might serve as inspiration for future work, but erated texts being shown to a human, and that enough texts an evaluation of the contribution of each idea is not within the are annotated to allow for inter-annotator agreement calcula- scope of this paper. tions. As corpus, we used the Cornell movie-dialogues corpus Naturally, the set of dialogues from which the two initial [Danescu-Niculescu-Mizil and Lee, 2011], and as distribu- dialogue lines are sampled to use as evaluation data, can not tional semantics space we use the word2vec space that has be allowed to be included in the data sets used for training the been pre-trained on a very large corpus of Google News improv-bots. and which has been made available by Mikolov et al. [2013; 2013]. 3.4 Evaluating the scene from other perspectives Due to the spontaneous and collaborative nature of impro- There are, of course, other aspects than the resemblance visational theatre, we believe that each dialogue line in this to a human-produced dialogue that the dialogues generated genre in average is likely to be shorter than lines in scripted should be evaluated for. Two parameters, mentioned in the theatre. We, therefore, extracted a subset of dialogue line background, are the level of diversity among the lines gen- triplets from the Cornell movie-dialogues corpus, where each erated and how general the lines produced are. Repetitive of the lines had to conform to the following set of length cri- and generic dialogue lines are both examples of phenomena teria: A line was allowed to contain a maximum of two sen- that might produce a boring scene, and these two parameters tences, and in case it contained two sentences, the first of 39 these two sentences was allowed to contain a maximum of tions for the three initial and ending tokens of the most recent two tokens. The last sentence (that is, the only sentence for line (the weights were determined by inspecting the output one-sentence lines and the second for two-sentence lines) was of the algorithm on the development data). Vector elements allowed to contain a maximum of twelve tokens. Sentence were also added to indicate whether a line contained any of splitting and tokenisation was carried out with NLTK [Bird, the question words who, where, when, why, what, which, how 2002]. or a question mark. In the Cornell movie-dialogues corpus, there were only 262 When there were several dialogue line pairs in the training dialogues that contained at least six dialogue lines and for data that matched the lines in the generated dialogues equally which all of the lines conformed to the length criteria we had well (allowing for a maximum Euclidean distance difference established for the experiment. These 262 dialogues were, of 0.08 between different candidates), and which resulted in therefore, saved to use as the set of evaluation data, i.e., data many candidates for the next line, we applied an unsupervised which could be used in the evaluation of the automatically outlier detection to this set of candidates, using scikit-learn’s generated dialogues. Line triplets from the rest of the corpus OneClassSVM [Pedregosa et al., 2011]. The set of outliers were divided into two groups, one group to use as training were then removed from the candidate list. data for Actor A and another group to use as training data for For the number of candidates that were still present in the Actor B. We divided the triplets film-wise, so that all triplets candidate list after outliers had been removed, we tried to in- from the same film were assigned either as training data to corporate the co-operative spirit of improvisational theatre for Actor A or to Actor B. In addition, 100 of the dialogues were selecting which of them to use. This was accomplished by se- not added to the training data set, but were used for an infor- lecting the candidate line, for which, when this line (together mal evaluation during the development, i.e., used as the two with its preceding line) was submitted as input the algorithm, first input lines to run the dialogue generation during devel- the closest neighbour was found. The motivation for this was opment. A total of 10,322 line triplets were used to train the that when a line was selected to which the co-actor would be functionality for Actor A and a total of 10,884 line triplets for more likely to find a good answer, the dialogue would run the functionality of Actor B. more smoothly, i.e., just as in real improvisational theatre. A context in the form of the line most recently uttered in We also applied two simple rules to improve the dialogues, the dialogue and the line before that was used as input data i) to avoid to end a dialogue with a line ending with a question for predicting the next line in the dialogue. The first two lines mark, ii) and to avoid repeating a line in the dialogue. These of each training data triplet were used to represent these two rules were, however, not strictly enforced, and when there most recent lines, and the third line to represent the line to be were no other candidates of approximately the same quality predicted. The core of the method for prediction was thus to as a line ending with a question mark or as a repeated line, retrieve the training data triplet for which the two first lines these were still used. were most similar to the two most recent lines in the gen- Word2vec vectors were accessed through the Gensim li- erated dialogue, and to use the third line in the triplet as the brary [Řehůřek and Sojka, 2010]. The search for dialogue next line in the generated dialogue. Similarity of dialogue line line pairs in the training data, i.e., the dialogue line pairs that pairs was determined through converting the two lines into a were closest to the data given when constructing new dia- semantic vector representation, and using the Euclidean dis- logues, was sped up by training a scikit-learn NearestNeigh- tance between the vectors as the similarity measure. bors classifier [Pedregosa et al., 2011]. The vector representation for the previous, and the most recently uttered line in the generated dialogue (as well as for the first and second lines in the training data triplets), were 5 Example output constructed as follows: For the previous line, the average of In Table 1, we present 6 generated dialogues, which were ran- the word2vec vectors representing the tokens in the line were domly sampled from the set of 262 dialogues that had been used as the line representation. Tokens present in a standard set aside as evaluation data. The first two lines are given from English stop word list were removed before creating the av- the corpus dialogue, and the left-hand column presents the erage vector. For the most recently uttered line, the same rep- generated version while the right-hand column presents the resentation was used, except that stop words were retained. human-written corpus version. The last two examples show We believe that also words that are normally considered as the output of our algorithm and the output presented by Li et stop words are important when interpreting the exact content al. [2016]. Similarly as when generating lines starting from of the most recently uttered dialogue line, while they might human-written dialogue, we provided the first two lines in the be less important for the content of an earlier line which we dialogues published by Li et al. as input to our system. included to provide a topical context. Our suggested formal evaluation of these dialogues would In addition to the averaged vectors, we used the word2vec thus be to present half of the dialogues in Table 1 to Evalua- representation of the three first tokens in the most recently ut- tor 1 and the other half to Evaluator 2, who are to determine tered line, as well as the three last tokens in the line, as we be- i) whether the dialogue is produced by a human or not, and lieve that these might be more important than the other words ii) whether the dialogue is boring. When informally evalu- for capturing the surface form of the conversation. All of ating these dialogues, we would say that most dialogues in these six vector representations were then concatenated into the right-hand column would pass as human made, except the one long vector. The averaged vectors were slightly down- strange dialogue 2, while hardly any of the dialogues in the weighted, to give more importance to the vector representa- left-hand column would be classified as produced by humans. 40 Table 1: The automatically generated dialogues compared to the human-written dialogues, and (for the two last examples), compared to the output of previously published generated dialogue examples. The same human evaluator would either be shown the text in the left-hand column or the text in the right-hand column, and determine i) whether it has been produced by a human, and ii) whether it is boring. Computer-generated Human-written 1 A: I was first to respond. A: I was first to respond. B: What were you doing out here? B: What were you doing out here? A: My debrief – A: Drive by patrol. B: Change? B: How is she? A: With a vengeance, I hated them. A: She’s tough. B: She was the Queen of the Netherlands. B: Have to be. The shit she’s gone through. 2 A: I NEED SOME A: I NEED SOME B: YOUNG BLOOD B: YOUNG BLOOD A: On your sleeve. A: I NEED SOME B: Well, I’m sure she kept her word and left you satisfied. B: YOUNG BLOOD A: No. No you’re not. A: I NEED IT NOW I NEED SOME B: Yes, Mr. Hillyer, I am. B: YOUNG BLOOD A: Guess we better get going, huh? A: I NEED SOME B: Just a little something’ somethin’. B: YOUNG BLOOD 3 A: I’d better go in. A: I’d better go in. B: I only live up there and across the roof. B: I only live up there and across the roof. A: Did you find something? A: Thanks anyway. B: Who? B: Listen, Edie, am I going to see you again? A: If he’s not out, why are they here? A: What for? B: I have no idea. B: I don’t know. A: Why not? A: I really don’t know. 4 A: This is ridiculous. A: This is ridiculous. B: Government cutbacks. B: Government cutbacks. A: Office work? A: Why can’t we do it through the mail? B: You bet it did. B: We missed the deadline. A: You drive away business roarin’ in like that. A: Can’t we do it on the phone? B: Bravo! B: I don’t think so. 5 A: Look at my ring. A: Look at my ring. B: That’s a fine-looking ring. B: That’s a fine-looking ring. A: Busy tomorrow? A: A lady threw it to me. B: Yes. Things are pretty hectic right now, but thanks. B: Most surely that was a nice lady to give a ring to a little girl. A: I know! A: It’s a pretty ring. B: I’m gonna have to ask you to leave. B: I wouldn’t be surprised if it were a true wishing ring. 6 A: You all right? A: You all right? B: Yes, thank you. B: Yes, thank you. A: Hello, Helen. A: What was that all about? B: I’m afraid I spoke out of turn last night. B: He wants me to perform for him. A: From whom? A: Perform? B: The main guy, Dieter– B: You know. A: I don’t have time. A: You on the game? B: I can see why. B: God no. I’m a hairdresser. Computer-generated here Computer-generated in previous studies 7 A: How old are you? A: How old are you? B: I’m 16. Why are you asking? B: I’m 16. Why are you asking? A: Why are you smiling? A: I thought you were 12. B: To tell you that we’re interested in the property. B: What made you think so? A: Oh, Dmitri.I wish that I could. A: I don’t know what you are talking about. B: He didn’t say we need guns. B: You don’t know what you are saying. 8 A: Where are you going? A: Where are you going? B: I’m going to the police station. B: I’m going to the police station. A: I’ll help you. A: I’ll come with you. B: I’ll scream. B: No, no, no, no, you’re not going anywhere. A: I mean, we get paid tonight, don’t we? A: Why? B: What the fuck is that supposed – B: I need you to stay here. A: What? A: I don’t know what you are talking about. 41 None of the dialogues would, however, be classified as bor- Proceedings of the Workshop on Cognitive Modeling and ing, except maybe the first of the two dialogues provided by Computational Linguistics, ACL 2011, 2011. Li et al. [2016], as it starts to generate very generic lines to- [improwiki, 2018a] improwiki. crow. wards the end of the dialogue. https://improwiki.com/en/wiki/improv/crow, 2018. [improwiki, 2018b] improwiki. Drop a line. 6 Conclusion and outlook https://improwiki.com/en/wiki/improv/drop a line, The generated dialogues presented here portray a collection 2018. of somewhat strange exchanges, and would not be useful in [Johnstone, 1987] Keith Johnstone. Impro : improvisation the context of simulating a real conversation. They might, and the theatre. Routledge, New York, 1987. however, function as absurd dialogues that, for instance, [Johnstone, 1999] Keith. Johnstone. Impro for storytellers : could be used as improvised scene starters. We believe, how- theatresports and the art of making things happen. Faber, ever, that the more structured form of evaluating a non-goal- London, [new ed.] edition, 1999. driven dialogue system that we present and exemplify could be generally useful. The evaluation structure might be pos- [Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Dan Ju- sible to apply in the setting of a shared task, in which the rafsky, Michel Galley, and Jianfeng Gao. Deep reinforce- participants not only produce dialogues of this type, but also ment learning for dialogue generation. In Proceedings participate in the evaluation by classifying the dialogues pro- of the 2016 Conference on Empirical Methods in Natural duced by other participating groups. Language Processing, pages 1192–1202, Austin, Texas, The next step is to implement a more resource-intensive November 2016. Association for Computational Linguis- method, e.g., a method built on seq2seq or some other neural tics. network-based technique. We also intend to extend our initial [Luong et al., 2017] Minh-Thang Luong, Eugene Brevdo, attempts of achieving dialogue generation with the help of and Rui Zhao. Neural machine translation (seq2seq) tu- a moderately sized dialogue corpus. We have, for instance, torial. https://github.com/tensorflow/nmt, 2017. not yet attempted any post-processing of the selected lines [Mathewson and Mirowski, 2017] Kory W. Mathewson and to make them fit better into the dialogue, e.g., to make the Piotr Mirowski. Improvised theatre alongside artificial in- pronoun gender and number agree between the lines, or to telligences. In In proceedings of AAAI Conference on Ar- match the use of helper verbs. tificial Intelligence. Association for the Advancement of Although the ultimate goal would be to achieve an improv- Artificial Intelligence, 2017. bot that could act seamlessly with a human actor, it would also be interesting to explore the suspicion we introduced in [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Cor- the background, i.e., that an audience would quickly lose in- rado, and Jeffrey Dean. Efficient estimation of word rep- terest in a play if they were aware of that it consisted solely resentations in vector space. CoRR, abs/1301.3781, 2013. of artificially generated dialogue. For instance, if two pup- [Mikolov, 2013] Tomas Mikolov. pets were given two starting lines by the audience, and from https://code.google.com/archive/p/word2vec/, these starting lines played a scene with automatically gen- word2vec on Google code 2013. erated human-like dialogues, would the audience still find it [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux, interesting? Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Acknowledgements Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Pas- We would like to thank Jonas Sjöbergh, as well as the anony- sos, David Cournapeau, Matthieu Brucher, Matthieu Per- mous reviewers, for valuable input to the content of this paper. rot, and Edouard Duchesnay. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. References [Řehůřek and Sojka, 2010] Radim Řehůřek and Petr Sojka. [Banchs and Li, 2012] Rafael E. Banchs and Haizhou Li. Software Framework for Topic Modelling with Large Cor- Iris: a chat-oriented dialogue system based on the vector pora. In Proceedings of the LREC 2010 Workshop on space model. In Proceedings of the Association for Com- New Challenges for NLP Frameworks, pages 45–50, Paris, putational Linguistics, System Demonstrations, 2012. France, May 2010. European Language Resources Asso- [Bird, 2002] Steven Bird. Nltk: The natural language toolkit. ciation (ELRA). In Proceedings of the ACL Workshop on Effective Tools [Serban et al., 2016] Iulian V Serban, Alessandro Sordoni, and Methodologies for Teaching Natural Language Pro- Yoshua Bengio, Aaron Courville, and Joelle Pineau. cessing and Computational Linguistics, Stroudsburg, PA, Building end-to-end dialogue systems using generative hi- USA, 2002. Association for Computational Linguistics. erarchical neural network models. Proceedings of AAAI, [Danescu-Niculescu-Mizil and Lee, 2011] Cristian 2016. Danescu-Niculescu-Mizil and Lillian Lee. Chameleons [Strindbergs intima teater, 2012] Strindbergs intima teater. in imagined conversations: A new approach to under- http://strindbergsintimateater.se/festival-i-maj-2012/, standing coordination of linguistic style in dialogs. In 2012. 42 [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [The Improvised Shakespeare Company, 2018] The Improvised Shakespeare Company. http://improvisedshakespeare.com, 2018. [Vinyals and Le, 2015] Oriol Vinyals and Quoc V. Le. A neural conversational model. CoRR, abs/1506.05869, 2015. [Wikipedia contributors, 2018] Wikipedia contributors. Im- provisational theatre — Wikipedia, the free encyclopedia, 2018. [Online; accessed 27-June-2018]. 43