=Paper=
{{Paper
|id=Vol-2848/HAI-GEN-Paper-3
|storemode=property
|title=How Novelists Use Generative Language Models: An Exploratory User Study
|pdfUrl=https://ceur-ws.org/Vol-2848/HAI-GEN-Paper-3.pdf
|volume=Vol-2848
|authors=Alex Calderwood,Vivian Qiu,Katy Ilonka Gero,Lydia B. Chilton
|dblpUrl=https://dblp.org/rec/conf/iui/CalderwoodQGC20
}}
==How Novelists Use Generative Language Models: An Exploratory User Study==
How Novelists Use Generative Language Models:
An Exploratory User Study
Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, Lydia B. Chilton
Columbia University
{adc2181,vivian.qiu}@columbia.edu,{katy,chilton}@cs.columbia.edu
(a) Screen capture of the ‘Talk to’ interface being used by a study par- (b) Screen capture of the ‘Write with’ interface being used by a study
ticipant. The ‘Talk to’ interface requires the writer to press a button to participant. The ‘Write with’ interface allows writers to trigger a sug-
generate a suggestion and displayed the result as a ‘completion’ which gestion using the ‘tab’ key. Suggestions are presented as a set of three
could not be edited. options; if selected a suggestion was inserted into as editable text.
Figure 1: Comparison of the two interfaces used in the user study. While the ‘Talk to’ interface (a) gave longer suggestions,
writers preferred ‘Write with’ (b) which allowed them to easily insert suggestions into the text document.
ABSTRACT ACM Reference Format:
Generative language models are garnering interest as creative tools. Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, Lydia B. Chilton. . How
Novelists Use Generative Language Models: An Exploratory User Study. In
We present a user study to explore how fiction writers use gener-
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy. ACM, New York, NY, USA,
ative language models during their writing process. We had four 5 pages.
professional novelists complete various writing tasks while having
access to a generative language model that either finishes their
sentence or generates the next paragraph of text. We report the
primary ways that novelists interact with these models, including: 1 INTRODUCTION
to generate ideas for describing scenes and characters, to create Spell checkers, auto-correct, and predictive keyboards have changed
antagonistic suggestions that force them to hone their descriptive how, and what, we write [1, 9]. Recently, a new wave of language
language, and as a constraint tool for challenging their writing models—statistical models that are able to “predict” the next word
practice. We identify six criteria for evaluating creative writing as- in a sentence—are garnering interest as creative generative tools.
sistants, and propose design guidelines for future co-writing tools. Websites that demo the abilities of language models such as GPT-2
[11] have gained popularity across the computer science landscape,
KEYWORDS but it remains unclear how professional writers view such systems.
Co-creativity; natural language processing; user interface; writing In 2019, two novelists described using similar language models to
tools; user-study. help them generate fresh ideas or surprisingly resonant descriptions.
Their self-reported experiences suggest that these language models
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
could act as creative partners for professional writers, but it remains
unclear how well these anecdotes generalize. In the past, sentence
completion-style tools for story writing have lacked the semantic
coherence necessary to make them useful [4].
In this work, we run a formal, albeit exploratory, user study
of four novelists writing in collaboration with a state-of-the-art
IUI ’20 Workshops, Cagliari, Italy,
Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, Lydia B. Chilton
language model. Our goal is to understand what professional writ- less coherent [12]. Manjavacas et al. fine-tune a language model
ers look for in suggestions, and in what ways these new language on a specific author to improve stylistic coherence [10]. Gero and
models do or do not meet this challenge. Figure 1 shows screen cap- Chilton narrow the use-case to metaphor generation and find the
tures from our study in which the novelists are using two different constrained context dramatically improves coherence [7].
writing interfaces. In the general fiction writing case, more often than not systems
We report the primary ways that novelists interact with genera- still fail to be both semantically coherent and artistically expressive.
tive language models, including: to generate ideas for describing Recent breakthroughs in natural language processing such as the
scenes and characters, to create antagonistic suggestions that force introduction of the ‘transformer’ neural network architecture [16]
them to hone their descriptive language, and as a constraint tool and BERT embeddings [6] have led to language models that are
for challenging their writing practice. We also unpack elements of remarkable at understanding the semantics of written language
their criteria for evaluating creative writing assistants, and propose and generating new text. Transformer models like GPT-2 [11] rely
design guidelines for future co-writing tools. on massive datasets and can seemingly imitate the style of a refer-
ence text, with legible grammar and even some understanding of
2 BACKGROUND conceptual relations between characters and objects.
We draw on theoretical work on co-creative artistic tools that
In 2016, New York Times Fiction Best Seller Robin Sloan wrote
suggests “creativity emerges through the interaction of both the
about training a language model on a corpus of science fiction
human and the computer” [5]. Improved language models such as
short stories [14]. He embedded this model in a text editor such
GPT-2 may allow a more meaningful interaction to occur between
that he could have it complete a sentence when he pressed ‘tab’.
creative writers and computers. This is what we study here.
His vision for the tool as helper was “less Clippy, more séance”. He
imagined that the model would push him to write in an unexpected
4 EXPERIMENT DESIGN
direction and with fresh language. In 2019, the New York Times
profiled Sloan, who has continued working on this project and is We recruited four published novelists for our study, and observed
using the tool to write his third novel [15]. them complete various tasks that had them interact with generative
More recently, critically acclaimed novelist Sigal Samuel wrote writing tools in individual hour long sessions. Three of the writers
about using a language model called GPT-2 [11] to help her write had no previous exposure to these tools; one writer had been previ-
her next novel [13]. She thought that the near-human outputs ously exposed but only briefly, and not for his professional writing.
of language models were ideal for fiction writers because they We first introduce the writing tools studied, and then describe the
produced text that was close to, but not quite exactly, human writing. study procedure.
This near-human writing “can startle us into seeing things anew”.
She discusses using GPT-2 to finish paragraphs from her previous 4.1 Interfaces
novels; in one case she writes, “Reading this, I felt strangely moved. The adoption of co-creative writing technologies hinges on their
The AI had perfectly captured the emotionally and existentially ability to provide appropriate suggestions while being simple to
strained tenor of the family’s home.” understand and interact with. Small details in the generative sys-
Samuel makes it clear that she didn’t intend to copy-paste sen- tem’s interface design will have ripple effects for their perceived
tences written by a language model, and that the model itself con- utility among writers.
tained all kinds of ephemera that didn’t advance the plot or belong The two interfaces chosen for the study were Talk To Trans-
in the story. Its use was primarily local, and tended to capture a former1 , and Write With Transformer2 , later referred to in this
certain tone or mood and extend that small conceit further. paper as ‘Talk to’ and ‘Write with’ respectively. Both user inter-
These two writers demonstrate the potential for language models faces rely on GPT-2 to predict the most likely sequence of words
to act as aids for creative writers, and their anecdotal reports inspire following some input text. Both take into account at most the last
the work we present here. 256 sub-word tokens available, though in many cases there is not
that much preceding text. GPT-2 was trained on the WebText cor-
3 RELATED WORK pus, which contains 40GB of text from over 8 million articles linked
to by Reddit from before 2017 that received at least 3 votes [11].
Common writing interfaces are beginning to include predictive text
‘Talk to’ (Figure 1a) uses a text completion paradigm where the
suggestions, notably next-word predictions in text messaging on
user writes into a small, centered text box and presses a button
smartphones and sentence completion in email composition [3].
to have the system generate a completion. The completed text is
Independent work has found that these suggestions skew positive
around the same length as the input, though there is a max overall
in sentiment and influence the writer’s composition [1, 9], but this
(input + output) length of 256 sub-word tokens. The completed
work is in its early stages; recently there has been a call to explicitly
text is also not editable, giving a sense of finality to the generated
study ‘AI-mediated communication’ [8].
text, though pressing the button again restarts the text generation,
Others have noted the importance of shifting suggestions away
replacing the previous output.
from the most likely phrases, as participants tend to find these
‘Write with’ (Figure 1b) has the user write into a page-like docu-
suggestions boring or trite [2]. Yet more unexpected suggestions
ment, and requires that the user presses the tab key to trigger text
are often incoherent. Roemmele and Gordon study the effect of
model ‘temperature’ on suggestions in a story writing context, 1 https://talktotransformer.com/
finding that higher temperature suggestions are more original but 2 https://transformer.huggingface.co/doc/gpt2-large
How Novelists Use Generative Language Models: IUI ’20 Workshops, Cagliari, Italy,
An Exploratory User Study
generation. Doing so will show a drop down menu with three short
suggestions, usually between 1 and 10 words. The length of the
suggestions is a function of the time allotted for the generation,
which in turn is a function of the amount of input text. This means
that toward the end of a longer document, suggestions often get
shorter. The user can select one of the suggestions with a mouse or
with arrow keys (or ignore the suggestions completely and continue
writing). The text that is generated appears directly in line with
their previous writing, highlighted blue, and is itself editable.
Both ‘Write with’ and ‘Talk to’ differ from existing predictive
text interfaces, like next word suggestions on a mobile keyboard,
by the length of their suggested text and their interaction mode.
Most predictive text keyboards always surface suggestions, rather Figure 2: A histogram that shows the number of words writ-
than requiring a user trigger, and are generally only one word long. ten in each sentence where a writer triggered the ‘Write
‘Write with’ is somewhat similar to Gmail’s ‘Smart Compose’ with’ model, requesting it to insert text. The high ‘0’ bucket
feature [3], which shows suggested sentence endings when a user is indicates that the writers frequently triggered it at the very
composing an email. Unlike ‘Write with’, ‘Smart Compose’ doesn’t beginning of sentences.
wait for a user trigger, but instead shows suggestions when the
algorithm has high confidence in the suggested text; the ‘tab’ button
allows the user to accept the suggestion.
4.2 Study Procedure
Each writer was asked to complete a pre-defined set of tasks. During
the course of each task, each writer was periodically asked to com-
ment on the output of the tool they were using and its impact on
their writing process. After each task, the writer discussed with the
examiner their thoughts about their, and the tool’s, performance in
the task. Additionally, they were allowed to articulate any response
they had to the tools in a discussion with the examiner after the
completion of all tasks.
The procedure went as follows:
Figure 3: ‘Write with’ most frequently generated suggestions
(1) Following a very brief description of the user interfaces, they that were two or fewer words long. Longer examples, how-
were given an initial open ended experimentation with the ever, were more likely to be accepted by the writers; shorter
tools. (2 - 10 minutes) examples were often low in content.
(2) They were asked to write ‘the most interesting’ or ‘the best’
original piece of fiction that they were able to with the assis-
tance of the tools. They were allowed to switch between the and the higher degree of randomness associated with the longer
tools at will, but were asked to use both. (10 - 20 minutes) text generated from ‘Talk to’.
(3) They were asked to work on an in-progress piece of writing We first looked at when in a sentence writers were likely to
with the assistance of the tools. They were told to try and trigger the system. Figure 2 shows that writers triggered ‘Write
solve an ‘issue’ they’d been having with a scene or descrip- with’ at the beginning of sentence 24% of the time, with a majority of
tion. (10 - 30 minutes) triggers taking place less than 10 words into a sentence. As seen in
(4) They were asked to again write ‘the best’ thing they could Figure 3, longer suggestions were more likely to be accepted by the
with ‘Write with’, with the constraint that they had to use writers, though short suggestions were generated more frequently.
a suggestion at least once every other sentence. (10-20 min- Table 1 shows examples of generated suggestions; E2 and E4 are
utes) indicative of shorter suggestions.
We recorded and transcribed each session. Additionally, we We also noticed that writers often triggered ‘Write with’ multiple
recorded all text written, including text written by the machine, times at a single point in the text if the resulting suggestions were
and for each generated suggestion annotated if it was ‘accepted’ by not what they wanted. We found that 25% of all triggers were a
the writer. repeated trigger, suggesting that once a writer triggered the system,
they were invested in finding a useful suggestion.
5 RESULTS
To preserve anonymity, we refer to the four writers in our study as 5.1 Incoherence and Plot Deviation
W1-W4. All four writers chose to use ‘Write with’ when asked to Unanimously, the writers pointed out that the tools appeared to de-
write ‘the best’ original piece that they could in the allotted time. viate from the direction they were taking their writing, particularly
To explain the preference, they generally cited the lack of control referring to the ‘Talk to’ interface. All writers were quick to point
IUI ’20 Workshops, Cagliari, Italy,
Alex Calderwood, Vivian Qiu, Katy Ilonka Gero, Lydia B. Chilton
Preceding Text Gen 1 Gen 2 Gen 3
E1 Harold sat on the hotel room bed and in front of him was the bedsheet, "which had stood the woman who would was a picture of his late son.
one day become his
E2 The storms colored the sky a shade of red of orange >
E3 The Castle Devocion was six leagues through the for- A few days before the storm There were no roads, no The castle was a large castle
est from the coast, where the fortress lay in disrepair.
E4 He [the man in the photograph] was holding a pen. baby in his small silver
Table 1: Examples of generated text from the user study. (‘>’ represents the model returning an empty suggestion.)
out instances that the system changed point of view (it seemed to that more than three suggestions given could be useful at those
prefer 1st person even when they were in 2nd or 3rd). moments.
As related to novelist Sigal Samuel’s perspective of using tools The writers often didn’t see the usefulness of the tool as a mean-
to “make the familiar strange” (see Background), all of them were ingful generator for plot or for characters. W4 noted that he was
at one point or another struck by just how strange the machine’s not a “spiritualist” writer, meaning that rather than let the flow of
responses were, but often to the point it wasn’t useful to them. W3 ideas come to him during the writing process, he usually sat down
said “it’s like improv. You have to ‘yes, and.’ ” Meaning that if the with a set of “points to hit”. The majority of writers mentioned
generated text does not incorporate the prior facts of the piece, it they could see something like this being useful for generating plot
is not constructive. outlines for writing exercises.
W1 and W2 noted that the tools were much better at following
them into ‘genre’ writing than into the more nuanced and stylized 5.2.3 As Constraint. Especially during Task 4, during which the
writing they were interested in. This is clear in Table 1, E3, where participants were required to use the suggestions from ‘Write with’
the writer set up a fantasy scene and the suggestions were more at least every second sentence, the writers most often found the tool
coherent than normal. Yet, at multiple points in Tasks 1, 2, and 4, all “fun” and “challenging”. During the post-trial discussion, all of the
four writers allowed themselves to be steered by the tools as they four participants returned to the unique challenge of integrating
introduced new characters or new plot devices that seemed unlike its responses into their writing.
those preceding them. Repeatedly, they found these developments They developed a number of strategies to get it to work well,
“interesting” or laughed at the suggestions, and were willing to including allowing it to begin sentences for them, most often rea-
adapt their writing to incorporate the change. They were more soning that if it were to go in a new direction, doing so at the
likely to take the suggestions during Tasks 2 and 4, when they beginning of sentences allows them a chance to “steer back”, or
weren’t writing something they had preconceived. follow it into a new place. W1 and W2 also frequently got it into
situations where rather than generating content noun phrases, it
only generated single words like “The” or “She”. Potential causes for
5.2 Observed Use Cases this include the short suggestion length for long preceding text (See
5.2.1 Model As Antagonist. Because of its tendency to randomness, Section 4.1) and the writers’ non-standard literary style, resulting
all participants initially expressed disappointment or resignation in low source probability under the language model.
at times where the system’s output was not along the lines they 5.2.4 The Unexpected. At one point, W1 set up ‘Write with’ to
anticipated. However, W1, W3, and W4 expressed the idea that this describe the color of the sky, and it suggested “dark blue”, “yellow”,
antagonism was in some ways constructive. W4 was very positive and “a shade of dark”; he accepted the last suggestion. This is an ex-
about this trait of the system, comparing triggering the system’s ample of the system steering from a direction that the writer clearly
auto-complete to flipping a coin, where the coin flip makes you wanted to pursue (hue description) into a related, but separate
realize how you hope it will land, regardless of where it actually concept, describing a shade instead, for stylistic effect.
does. To that end, W4 was the most likely to reject the suggestion Both systems frequently introduces characters or dialogue, which
of ‘Write with’, but generally the most positive about its ability to for Tasks 1, 2, and 4 produced comments like “I wasn’t going to go
help him determine what he wanted to write. there, but that’s interesting”, especially when it brought into play
family members (sister, wife, father), such as in Table 1, E1, where
5.2.2 Description Creation. All four participants experimented suggestions introduce variously a woman (perhaps wife) and a son.
with using ‘Write with’ to generate mid-sentence descriptions for
items, scenes, or characters. All four writers learned through the 6 DISCUSSION
course of the session that they could get ‘Write with’ to focus on fill-
ing in descriptions such as colors or character details by requesting 6.1 Evaluation Criteria for Co-Writing Systems
suggestions after prepositions, and actions by requesting sugges- These trials indicate that novelists hoping to use co-creative gen-
tions after a noun phrase. They rejected adjective descriptions like erative systems in their writing have a complicated evaluation
colors more often than any other type of suggestion, often dis- criterion that includes the system’s ability to extrapolate reason-
missing them as “boring” and limited, though W4 and W1 noted ably well about character traits, settings, and events. They expect
How Novelists Use Generative Language Models: IUI ’20 Workshops, Cagliari, Italy,
An Exploratory User Study
the systems to match their style, verb tense, and perspective, in ad- 7 CONCLUSION
dition to providing a high degree of creative insight—picking a color Through this study, we identified a number of considerations for
from a spectrum they’d already considered is hardly ‘co-creative’. designing co-writing systems, concerning both the interaction dy-
Measures like predictive accuracy won’t do as evaluation criteria namics and the nature of the computer suggestions. Writers found
because writers engaged with co-creative systems are looking for value in being able to edit the systems’ output and quickly replace
creative insight, something not measured by perplexity or by a the generated output with something they preferred. They enjoyed
language model’s ability to solve the canonical downstream NLP using the model as a constraining device for challenging their writ-
tasks. We propose a series of evaluation questions, which could be ing, or as an antagonist that helped them refocus and refine their
answered computationally, to guide system design: intent. We advise that future systems should provide many sugges-
(1) Does a suggestion match the tense of the preceding text? tions, do so with a better understanding of the writer’s intent, be
(2) Does a suggestion introduce new characters or objects, or editable, and regenerate with little to no mental overhead.
does it reference preceding ones?
(3) Are new characters or objects coherent given the context? ACKNOWLEDGMENTS
(4) Does a suggestion include description? Katy Ilonka Gero is supported by an NSF GRF (DGE - 1644869).
(5) Does a suggestion include action? Alex Calderwood is supported by The Brown Institute for Media
(6) Given a single request, how diverse are the suggestions? Innovation (https://brown.columbia.edu/).
These questions highlight the kinds of considerations profes-
sional writers have when evaluating suggestions. Notably they REFERENCES
are not questions that have correct answers; rather they reflect [1] K Arnold, Krysta Chauncey, and Krzysztof Z Gajos. 2018. Sentiment bias in
predictive text recommendations results in biased writing. In Proceedings of
important considerations we found through our user study. Graphics Interface. 33–40.
[2] Kenneth C Arnold, Krzysztof Z Gajos, and Adam T Kalai. 2016. On suggesting
phrases vs. predicting words for mobile text composition. In Proceedings of the
6.2 Design Guidelines for Co-Writing Tools 29th Annual Symposium on User Interface Software and Technology. 603–608.
Future systems should be aware that writers are interested in these [3] Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin
Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al. 2019. Gmail
tools not just for immediate injection of inline text, which most Smart Compose: Real-Time Assisted Writing. arXiv preprint arXiv:1906.00080
feel they are capable of producing on their own, but for a broad (2019).
[4] Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A.
range of descriptive, antagonistic, or constraining effects on their Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on
writing. Slogans and Stories. In 23rd International Conference on Intelligent User Interfaces
By triggering the generative model, the user switches from writer (IUI ’18). ACM, New York, NY, USA, 329–340. https://doi.org/10.1145/3172944.
3172983
to editor. Future design of these systems should continue to stress [5] Nicholas Mark Davis. 2013. Human-computer co-creativity: Blending human and
the nature of the generated text as dynamic and alterable, focusing computational creativity. In Ninth Artificial Intelligence and Interactive Digital
on the suggestive element of these tools and allowing the writer Entertainment Conference.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
to enter an editorial feedback loop. There should be very little Pre-training of Deep Bidirectional Transformers for Language Understanding.
overhead for querying the model. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
[7] Katy Ilonka Gero and Lydia B Chilton. 2019. Metaphoria: An Algorithmic Com-
The systems should provide many suggestions that may be panion for Metaphor Creation. In Proceedings of the 2019 CHI Conference on
swapped out and replaced frequently. Because of the high error rate Human Factors in Computing Systems. ACM, 296.
of these tools, a small number of suggestions may not be useful. [8] Jeffrey T Hancock, Mor Naaman, and Karen Levy. 2020. AI-Mediated Commu-
nication: Definition, Research Agenda, and Ethical Considerations. Journal of
Similarly, extremely short suggestions are not useful. Computer-Mediated Communication (2020).
At times, writers are looking for a specific category of suggestion, [9] Jess Hohenstein and Malte Jung. 2018. AI-Supported Messaging: An Investigation
and any suggestion that does not fit inside those constraints is of Human-Human Text Conversation with AI Support. In Extended Abstracts of
the 2018 CHI Conference on Human Factors in Computing Systems. 1–6.
disruptive. That disruption may itself be the goal of triggering the [10] Enrique Manjavacas, Folgert Karsdorp, Ben Burtenshaw, and Mike Kestemont.
system, as it forces them to explore a new range of possibilities or 2017. Synthetic literature: Writing science fiction in a co-creative process. In
Proceedings of the Workshop on Computational Creativity in Natural Language
back up and consider the reasons the model ‘thought’ to suggest Generation (CC-NLG 2017). 29–37.
what it did. But to increase the odds that writers will use machine [11] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
generated text, future systems need to be more aware of what type Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
Blog 1, 8 (2019).
of suggestion the writer is looking for, rather than providing general [12] Melissa Roemmele and Andrew S Gordon. 2018. Automated assistance for creative
suggestions that lack any specific purpose. writing with an rnn language model. In Proceedings of the 23rd International
Rather than a triggering event that tells the system “generate!" Conference on Intelligent User Interfaces Companion. 1–2.
[13] Sigal Samuel. 2019. How I’m using AI to write my next novel. https://www.vox.
with no other context, we imagine an interface that is passively or com/future-perfect/2019/8/30/20840194/ai-art-fiction-writing-language-gpt-2
actively aware of the type of suggestion that is being requested, [14] Robin Sloan. 2016. Writing with the machine. https://www.robinsloan.com/
notes/writing-with-the-machine/
its length, and how much it should adhere to the current scene or [15] David Streitfeld. 2018. Computer Stories: A.I. Is Beginning to Assist Novel-
freely decide the trajectory of the writing to come. This awareness ists. https://www.nytimes.com/2018/10/18/technology/ai-is-beginning-to-
might be thought of as a list of parameters passed to the trigger, assist-novelists.html
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
but it should be done without intruding on the ease of the request. Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
In this way, the notion of co-creativity can be expanded further, You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/
and push the generation process further into the space of dynamic 1706.03762
conversation between human and machine.