=Paper= {{Paper |id=Vol-2903/IUI21WS-HAIGEN-10 |storemode=property |title=How Data Scientists Improve Generated Code Documentation in Jupyter Notebooks |pdfUrl=https://ceur-ws.org/Vol-2903/IUI21WS-HAIGEN-10.pdf |volume=Vol-2903 |authors=Michael Muller,April Yi Wang,Steven I. Ross,Justin D. Weisz,Mayank Agarwal,Kartik Talamadupula,Stephanie Houde,Fernando Martinez,John Richards,Jaimie Drozdal,Xuye Liu,David Piorkowski,Dakuo Wang |dblpUrl=https://dblp.org/rec/conf/iui/MullerWRWATHMRD21 }} ==How Data Scientists Improve Generated Code Documentation in Jupyter Notebooks== https://ceur-ws.org/Vol-2903/IUI21WS-HAIGEN-10.pdf
How Data Scientists Improve Generated Code
Documentation in Jupyter Notebooks
Michael Mullera , April Yi Wangb , Steven I. Rossc , Justin D. Weiszd , Mayank Agarwale ,
Kartik Talamadupulaf , Stephanie Houdeg , Fernando Martinezh , John Richardsi ,
Jaimie Drozdalj , Xuye Liuk , David Piorkowskil and Dakuo Wangm
a
   IBM Research AI, Cambridge, MA, USA
b
   University of Michigan, Ann Arbor, MI, USA
The first two authors contributed equally to this paper.
c
  IBM Research AI, Cambridge, MA 02142 USA
d
   IBM Research AI, Yorktown Heights, NY, USA
e
  IBM Research AI, Yorktown Heights, NY, USA
f
  IBM Research AI, Yorktown Heights, NY, USA
g
   IBM Research AI, Cambridge, MA, USA
h
   IBM Argentina, La Plata, Argentina
i
  IBM Research AI, Yorktown Heights, NY, USA
j
  Rensselaer Polytechnic Institute, Troy, NY, USA
k
   Rensselaer Polytechnic Institute, Troy, NY, USA
l
  IBM Research AI, Yorktown Heights, NY, USA
m
    IBM Research AI, Cambridge, MA, USA


                                             Abstract
                                             Generative AI models are capable of creating high-fidelity outputs, sometimes indistinguishable from what could be produced
                                             by human effort. However, some domains possess an objective bar of quality, and the probabilistic nature of generative mod-
                                             els suggests that there may be imperfections or flaws in their output. In software engineering, for example, code produced
                                             by a generative model may not compile, or it may contain bugs or logical errors. Various models of human-AI interaction,
                                             such as mixed-initiative user interfaces, suggest that human effort ought to be applied to a generative model’s outputs in
                                             order to improve its quality. We report results from a controlled experiment in which data scientists used multiple models –
                                             including a GNN-based generative model – to generate and subsequently edit documentation for data science code within
                                             Jupyter notebooks. In analyzing their edit-patterns, we discovered various ways that humans made improvements to the
                                             generated documentation, and speculate that such edit data could be used to train generative models to not only identify
                                             which parts of their output might require human attention, but also how those parts could be improved.

                                             Keywords
                                             Code-documentation, Generative AI, Human-AI collaboration, Jupyter notebooks



1. Introduction                                                                                                        split between human and computer. Typical approaches
                                                                                                                       asked, in effect, “who goes first?” and many models went
For several decades, scholars have explored how humans                                                                 no further than a single cycle of human-initiates-and-
and computers might collaborate [1, 2, 3, 4, 5]. Early                                                                 AI-responds or AI-initiates-and-human-responds (e.g.,
work largely focused on a zero-sum “trade-off” model in                                                                [6]).
which a finite conceptual pool of “initiative” was to be                                                                  More recent work has deconstructed the older concept
                                                                                                                       of unitary “initiative” into a flexible and collaborative
Joint Proceedings of the ACM IUI 2021 Workshops, April 13-17, 2021,                                                    framework in which increased initiative by one party
College Station, USA                                                                                                   (e.g. human) does not imply a decrease of initiative by
" michael1_muller@us.ibm.com (M. Muller);                                                                              the other (e.g. AI) [7]. In addition, the “mixed initia-
aprilww@umich.edu (A. Y. Wang); slross@us.ibm.com (S. I. Ross);
jweisz@us.ibm.com (J. D. Weisz); Mayank.Agarwal@ibm.com                                                                tive creative interface” (MICI) framework analyzed by
(M. Agarwal); krtalamad@us.ibm.com (K. Talamadupula);                                                                  Deterding et al. [1] and Spoto and Oyelnik [5], further
Stephanie.Houde@ibm.com (S. Houde); martferc@ar.ibm.com                                                                developed by Muller et at. [3], specifically details how
(F. Martinez); ajtr@us.ibm.com (J. Richards); drozdj3@rpi.edu                                                          human and AI partners interact in creative tasks as a
(J. Drozdal); liux27@rpi.edu (X. Liu); david.piorkowski@ibm.com                                                        series of back-and-forth exchanges.
(D. Piorkowski); Dakuo.Wang@ibm.com (D. Wang)
 0000-0001-7860-163X (M. Muller)                                                                                         In this paper, we examine how humans interact with
                                       © 2021 Copyright 2021 for this paper by its authors. Use permitted under Cre-
                                       ative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                       a generative AI model in the context of writing data
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                         science documentation. We specifically aim to extend
the human-initiates-and-AI-responds interaction pattern       reflected in code-centric use cases: Brockschmidt et al.
to include a step in which a human may make subsequent        [39] proposed the use of generative models for source
edits to the outputs of the model. Prior work by our team     code, and Tufano et al. [40] used generative models to
has explored how data scientists use various kinds of         fix bugs. In this paper, we conduct a deeper analysis of
models – including a GNN-based generative model – to          participant interactions with Wang et al. [8]’s Themisto
aid the task of adding documentation to data science          documentation-generation system, which incorporates a
code in Jupyter notebooks [8]. In this paper, we conduct      GNN-based generative model for generating comments
a deeper analysis on the edits made by participants in        from code.
that study, to understand the nature of their edit-patterns
and how they “compensate” or “augment” the output of          2.2. Human-AI Collaboration with
the generative models. Through a thematic analysis, we
developed a classification of participants’ edit-patterns.
                                                                   Generative Models in Data Science
   We found that 85% of participants’ edit-patterns fully          and Software Engineering
accepted the algorithmically-generated text, or built upon   In a recent study, Weisz et al. [41] examined the use of an
the generated text; and only 15% of the instances involved   unsupervised neural machine translation (NMT) model
a complete rewrite. At first glance, these results suggest   in addressing a task in application modernization, specif-
that the generated text was well-accepted. However, par-     ically regarding translating code from a legacy language
ticipants modified the generated text in 41% of the cases.   to a modern one. They found that software engineers
Thus, (1) a human requests the generation of text; (2) the   would be tolerant of errors or mistakes in the output of an
AI provides that text; and then (3) the human makes a        NMT model, as the code produced by that model would
decision to use - or not to use - that text, and (4) may go  be subject to the same review and testing procedures as
on to edit the generated text. In the future, these three-   code produced by everyone else on their team. In ad-
to-four dialogic ”moves” could become the basis for an       dition, a code highlighting feature that indicated where
extended conversation between human AI.                      in the code human attention might be needed, based on
   In this paper, we develop a taxonomy of edit-patterns,    an aggregation of the model’s token-level confidences,
discovering that some edits added missing details while      was very desirable. Although a subsequent analysis by
other edits explained the function of the code. A third      Agarwal et al. [42] demonstrated how current metrics
category of edits was primarly concerned with modifying      of model confidence may not necessarily correlate with
the formatting or style of the documentation.                external truth of code quality (approximated by lint er-
                                                             rors), the ability for a generative model to be able to “ask
2. Background                                                for help” via code highlights was nonetheless highly val-
                                                             ued by software engineers. In this work, we push even
We discuss recent work in the area of AI and machine further by seeking to understand whether a generative
learning applied to data science and software engineer- model can also specify what kind of it help it needs from
ing, as well as the application of generative models to this its human user.
domain. We also discuss recent studies on human inter-
actions with generative models in software engineering.
                                                              3. Method
2.1. AI and Machine Learning in                             In order to understand the co-creation process of data
     Software Engineering                                   science documentation, we examined Themisto [8] a pro-
                                                            totype code documentation generation system that sup-
In recent years, techniques from AI and machine learn- ports data scientists in writing documentation for com-
ing have been applied to various tasks in software engi- putational notebooks.1 Wang et al. [8] conducted an eval-
neering, including code completion [9, 10, 11, 12], code uation of Themisto with 24 data science professionals. In
translation [13], code classification [14, 15], API recom- this section, we briefly discuss how Themisto generates
mendation [16, 17], variable and method naming [18, 19], documentation from code, as well as the user study setup
type inference [20, 21], bug detection and repair [22, 23, and data collection methodology.
24, 25, 26, 27], comment description and generation [28,
29, 30, 31, 32, 33], code change summarization [34], and
code clone detection [35]. Allamanis et al. [36] provides a
comprehensive review of the use of AI and machine learn-        1
                                                                  While Jupyter notebooks have been used in educa-
ing within data science and software engineering. The tion, we note that these notebooks are increasingly the ba-
emergence of generative AI techniques for natural lan- sis of commercial products [43, 44], as in offerings by IBM
guage, such as GPT-2 [37] and GPT-3 [38], have also been [https://developer.ibm.com/components/jupyter/] and Microsoft
                                                              [https://notebooks.azure.com/].
                                    B

                                                                                C


                                                     D


                    A



Figure 1: The Themisto user interface [8] is implemented as a Jupyter Notebook plugin: (A) When the recommended docu-
mentation is ready, a lightbulb icon shows up to the left of the currently-focused code cell. (B – D) shows the three options
in the dropdown menu generated by Themisto, (B) A documentation candidate generated for the code with a deep-learning
model, (C) A documentation candidate retrieved from the online API documentation for the source code, and (D) A prompt
message that nudges users to write documentation on a given topic.



3.1. Themisto: A System for Automatic                      pant was randomly given one of the two notebooks and
     Documentation Generation                              asked to document the notebook within 12 minutes; two
                                                           participants failed to complete the task within that time-
We implemented the automatic documentation gener- period, and were excluded from further analysis. Before
ation system as an extension to JupyterLab (Figure 1). the study, we provided a quick demo on the functionality
The extension generates three types of documentation of the extension.
for a given code snippet. The first type of documenta-
tion is generated using a Graph Neural Network based
approach [45] which is commonly used in code summa- 3.3. Data Collection
rization tasks. The second type of documentation is gen- We collected the completed final notebooks (N=24) after
erated by retrieving relevant external API documentation participants finished the task. All study sessions were
for a code function (e.g., functions defined in Pandas2 , conducted remotely using a teleconferencing tool and
Numpy3 , and Scikit-learn4 ). Lastly, the extension also we recorded participants’ screens with their permissions.
provides a prompt-based approach where users are given After the session, we conducted a retrospective interview
a short prompt to manually create the documentation. to ask about their experience and feedback.
For example, if a code cell contains a graphic output,        We wanted to understand how participants used the
the extension would generate the prompt to ask users to algorithmically-generated documentation. For orienta-
interpret the output.                                      tion, we review that 13 participants made documenta-
                                                           tion choices and optional modifications in each of nine
3.2. User Study Setup                                      markdown cells in each of two notebooks ("Covid" and
                                                           "house") - i.e., 13 participants for each notebook. In the
We are interested in how users make revisions on the data from each notebook, we discovered two participants
suggested explanations. Thus, we recruited 26 data sci- (per notebook) who did not complete the task, or who
entists to add documentation to a given draft notebook created extra cells. We could not be certain how to "map"
using the prototype. We prepared two draft notebooks these extra cells to the structure that was common across
with the same length (9 code cells) and similar levels of the other 11 participants. Because we wanted to compare
difficulty, but for two different problems. Each partici- participants’ responses in a disciplined way, we treated
                                                           each of these people (who created extra cells) as outliers,
     2
       https://pandas.pydata.org/docs/reference/index.html and excluded them from analysis. This exclusion left us
     3
       https://numpy.org/doc/stable/reference/             with  11 participants per notebook who had worked with
     4
    https://scikit-learn.org/stable/modules/classes.html
Table 1
Edit-Patterns - Applicable as Modifications to Algorithmically-Generated Text

          Source text                Participant text                                           Participant+Cell
          Details - Current
          ### Evaluate a score       ### evaluate a score by [5-fold] cross-validation [using   P15-house+9
          by cross-validation        [rmse]
          ### Fill NA/NaN values     ### [replace] na/nan values [with the mean]                P07-house+6
          using the specified
          method



the algorithmically-generated documentation.                 this workshop paper, we present new analyses to examine
                                                             the edit-patterns in the 41% of the cells with participant-
3.3.1. Preparatory Analysis                                  edited documentation.
                                                                There were few statistically significant differences be-
To prepare the documentation for analysis, we grouped tween participant data from the two notebooks. We
all of the texts for each cell together (separately for each briefly report those analyses here, before the qualitative
notebook). We then used a bag-of-words method to analyses that are the core of this paper. Because we did
identify words (tokens) that were not included in the not find differences between the two notebooks, we will
algorithmically-generated documentation. In the quo- then perform content analyses of participants’ text on
tations in this paper, the participant-introduced words the combined data from the two notebooks in Sections 5
appear in bracketed [blue-ink].5 Table 1 provides an illus- and 6.
trative example. This was a slightly conservative method
for identifying new text, because we might fail to detect
that a participant had typed "data" (for example) in their 4.1. Starting Points for Documentation
own usage, rather than including the word "data" from We provided three different Sources of documentation:
the algorithmically-generated text. However, we used AI, Query, and Prompt. A chi-square analysis found no
this method only to orient ourselves to the texts.           significant differences in the proportions of Sources cho-
   We next read each text by each participant. After read- sen by participants in each notebook. We also looked at
ings all of the texts, two researchers agreed upon a code- combinations of Sources - i.e., no discernable source vs.
book of edit-patterns. One researcher than applied that a single source vs. multiple sources. Again, a chi-square
codebook rigorously to all of the texts.                     analysis showed no significant differences between the
                                                             notebooks.
3.3.2. Reference Notation
We identify each text in terms of the participant num-         4.2. Edit-Patterns
ber (1-26, with two non-completers and two outliers ex-      We describe distinct Edit-Patterns below in Sections 5
cluded), the notebook ("covid" or "house"), and the cell     and 6. Here, we briefly state that we used chi-square
in the notebook (1-9). For example, "p21-house+4" refers     tests to examine whether participants used different edit-
to participant 21, in the "house" notebook, in the 4th       patterns between the two notebooks. Only one of edit-
markdown cell.                                               patterns showed a significant difference, and that pattern
                                                             was concerned with the format (not the content) of the
4. Results: High-Level                                       documentation (i.e., levels of headers in the markdown
                                                             cell).
     Quantitative Comparisons                                   Similarly to the "combinations" analysis of Section 4.1,
                                                             we found no significant differences across notebooks for
Participants accepted the algorithmically-generated doc-
                                                             cells with zero edit-patterns, a single edit-pattern, or
umentation unchanged in 45% of the cells, and they edited
                                                             multiple edit-patterns.
the algorithmically-generated documentation in 41% of
the cells. The remaining 9% of the cells were left blank. In
    5
      Readers who use a screenreader may want to consult
https://doccenter.freedomscientific.com/doccenter/doccenter/
rs25c51746a0cc/2012-06-20_TextFormatting/02_TextFormatting.
htm for information on how to access font-attributes through
JAWS.
5. Results: Content-Related                                     • check for [missing/null] values for [some countries/regions
                                                                  there is no province/state data this is probably correct
   Edit-Patterns                                                  and not a flaw in] the [data] (p01-covid+5)6
The preceding quantitative analyses showed only a sin-
gle, stylistic difference in participants’ work with the
                                                                5.1.2. This-step
two notebooks. We therefore combine our qualitative
analyses of edit-patterns across the two notebooks.             This-step edit-patterns occurred in many distinct subcat-
   We manually coded the edit-patterns in each partici-         egories. The first subcategory is the addition of a few
pant’s text in each markdown cell, according to our code-       words to clarify the current step:
book (Section 3.3.1). For the 22 participants in nine mark-
down cells, we thus coded 99 texts in each notebook, for        • ### importing libraries ### importing [the necessary]
a total of 198 coded markdown cells. We applied an infor-         libraries (p20-covid+1)
mal version of thematic analysis [46], noting Braun’s and       While P20-covid’s addition might be only a matter of
Clarke’s advice that there are multiple ways of conduct-        emphasis, other simple additions provided much more
ing a thematic analysis [47]. Previous grounded theory          specific information about what was being done in the
and thematic analysis studies have involved from 6 to           step. P08-covid changed the meaning of the generated
74 participants [48, 49], and so our sample of 22 partic-       documentation by adding specificity about what value
ipants is within that conventional range. Within this           was being computed:
sample, we used the saturation practices of Guest et al.
[50] (recommended by Ando et al. [46]) and Majid et al.         • ### check [number of] the missing values (p08-covid+5)
[51], defining saturation by a code that appeared from at
least two participants. We made no restrictions on the          P03–house provided a different kind of specificity about
number of codes that we identified in a single text. Thus       the types of datasets being used:
a text might have zero codes if the participant simply          • ### read the [training and test datasets] (p03-house+2)
accepted the algorithmically-generated documentation,
or it might have as many as three or four different codes       P13-covid engaged in the same type of edit-pattern (in
in complex cases.                                               the other notebook), but included much greater detail:
                                                                • ### read a comma-separated values (csv) file into data-
5.1. Details Edit-Patterns (three                                 frame [of training] data [and test] data return the first
     subcategories)                                               5 rows [of] the [training] data (p13-covid+2)
Participants expanded on the generated documentation We contrast P08-covid, P03-house, and P13-covid, who
by adding details. There were three subcategories of were adding information about what was being calcu-
details: Contextual information, information about This- lated or input, vs. P07-house and P26-house, who de-
step (the current step), and information about Subsequent scribed how the operations were done:
steps.
                                                          • ### create [train] and test data [by splitting
                                                            dataframe] (p07-house+7)
5.1.1. Contextual Details
Contextual details could take several forms. P03-house • ### create the target and the test data and [use slicing]
clarified how the prior steps had produced materials that    (p026-house+7)
were used in the current step:                                  The preceding pair of examples suggests that partici-
• ### create the target and the test data [re-create train- pants may solve the same problem, in the same notebook-
  ing] and test [datasets based on] the [size of] the [orig- cell, in different ways. We found many examples of dif-
  inal training dataset] (p03-house+7)                       ferent strategies and/or different conceptions of what the
                                                             intended reader would need to know, such as this con-
By contrast, P210-covid focused on the treatment of miss- trast between P13-covid’s rather minimalist description,
ing values                                                   vs. P01-covid’s much more extensive description:

• ### check for [any] missing values [note] that [province/ • ### replace a specified phrase [(_)] with another speci-
  state have quite a few] missing (p021-covid+5)              fied phrase [( ) then transform] the [datatype to int]
                                                               (p13-covid+4)
and P01-covid provided an even-more-detailed account
                                                                6
of the same issue, with commentary on what they had               We note that P01-covid edited-out the markdown formatting
                                                            command, "###". We will have more to say about this kind of stylis-
observed:
                                                                tic edit-pattern, below.
• ### data [preparation in] the [training] data [set] re- 5.2. Explanation Edit-Patterns
  place the [dashes] with [spaces for] the [date column
                                                             Sometimes participants went beyond simple details, writ-
  and] convert the data [type to] integer (P01-covid+4)
                                                             ing a more extended Explanation. P02-house and P15-
  In some cases, participants added specific algorithmic house provided brief examples, in which they added op-
details that none of the generated texts had included. A erational explanations of how to perform the activity in
repeated example was to mention (and sometimes dis- the cell:
cuss) root mean square error and its importance:
                                                             • ### return the first 5 rows [(defvalue=5)] (p02-house+3)
• ### evaluate a score by cross-validation [uses rmse as
  an evaluation metric] with [5-fold] cross-validation (p03- • ### [separate train] and test [subsets post feature en-
  house+9)                                                     gineering set] the target [as saleprice] (p15-house+7)

• ### evaluate a score by [5-fold] cross-validation [using    P02-covid and P14-house went further, documenting the
  rmse] (p15-house+9)                                         nature of source data files and their formats, and also the
                                                              functional significance of additional modules:

5.1.3. Next-Step                                              • read the [data: from] the [two files: ‘traincsv‘ and
                                                                ‘testcsv‘ they contain] data [in csv format now ‘train‘
We found a third edit-pattern that anticipated the next         contains] the [train] data [and ‘test‘ contains] the [test]
step (i.e., a subsequent cell) in the notebook. P08-covid       data [start on] the [train] data first (p02-covid+2)
briefly stated the use that the inputted data would be put
to:                                                           • ### importing libraries - pandas for [dataframes (like
                                                                excel spreadsheet)] - numpy for [fast vector operations
• ### read [and sanity check] the data (p04-covid+2)            - sklearn] for [simple] data analysis [(in] this [case
However, in other cases, the participant provided a much        linear model)] (p14-house+1)
richer description of the next steps, as in this recitation
                                                              The Explanation edit-pattern brings in different types
by P05-covid:
                                                            of information, including operational aspects and ex-
• ### Model A random forest is a meta estimator that fits tended explanatory material about data files and pro-
  a number of decision tree classifiers on various sub- grammatic resources. As we noted above, with more
  samples of the dataset and uses averaging to improve data, we may discover that Explanations may need to
  the predictive accuracy and control over-fitting the be combined with Details. Another possibility is that
  [first line below initiates] a model [instance] and the Explanations may turn out to be a subset of Tutorial
  [second line] fits the model on the [training data] (p05- edit-patterns, which we describe in the next subsection.
  covid+8)
                                                              5.3. Tutorial Edit-Patterns
5.1.4. Details Patterns Summary                              In a more complex pattern, participants appeared to be
We have shown three edit-patterns in which participants teaching the reader how to do the analysis. For example,
have provided more detail than was available in the gen- P05-covid provide detailed explanations about how to
erated texts. These patterns might be considered as span- carry-out a series of operations in python:
ning an imagined audience’s reading experience. In some • ### convert [training] data [remove dashes (‘-‘) in] the
cases, participants wrote Contextual information into          [dates this is done by applying] the [‘replace‘ function
the generated texts. This contextual information was           ‘astype‘ sets] the [‘date‘ column to integer type] (p05-
generally retrospective - i.e., what should the reader have    covid+4)
known in order to understand the code? In contrasting
cases, participants focused less on context, and more on Similarly, P03-house gave instructions about how to work
content within the current cell (This-step). Finally, in a with several datasets, including which columns (factors)
few cases, participants wrote to anticipate the next cell or were involved and how to process those columns:
cells. In the Discussion, we will think further about this
                                                             • [## dataset preparation] the [next few cells prepare]
kind of participants’ mental model of their audience’s
                                                               the [train and test datasets] ### concatenate the [train
experiences.
                                                               and test datasets] with a [subset of columns (mssub-
   We also wish to acknowledge that some of our category
                                                               class to salecondition)] is [format(a b)] (p03-house+4)
boundaries are fuzzy. The next category of edit-patterns
poses the question - what is the difference between a De- In a different cell, P05-covid explained the meaning of a
tails edit-pattern and an Explanation edit-pattern? With function call and gave further instruction about how to
further research, we may need to redraw this boundary. use the results of the function:
• ### check the missing values detect the [number of]        Similarly, two people rewrote the contents of cell 9 of the
  missing values for [each column ‘isnull()‘ returns] an     Covid notebook:
  [array of indicators of whether each value in a column
  is] missing [and ‘sum()‘ calculates] the [total number     • ### [make predictions] (p17-covid+9)
  of] missing values [along] that [column] (p05-covid+5)     • ### [test] to [see how] the [model performs] (p20-
P11-house taught how a conceptual operation worked             covid+9)
and also gave advice about the naming of the statistical     While it initially appears that Rewritten cells were
action:                                                   relatively brief, we found that two other participants
• [#####] this code cell is for [handling] missing val- Rewrote the same cell (cell 9 of the Covid notebook) at
  ues [which are replaced with] the [mean value] for much greater length:
  [that feature] this is [also known as column-wise mean- • ### [run] the [model] to [generate predictions] on the
  imputation] (p11-house+6)                                 [test data] and [store them as a ‘dataframe‘] (p04-
Finally, we note that P24-covid took a somewhat different       covid+9)
tutorial strategy. They left the original generated text in- • use the [trained model] to [predict] the [target] on the
tact as provided by the algorithm, and then added a link to     [test data] (p05-covid+9)
more information about the python code-structures that
were used in the cell that the algorithm had described:         In some cases, partipants Rewrote the generated text
                                                             at a higher level of sophsitication:
• p24-covid/expt ### replace a specified phrase with
  another specified phrase [[for more information about • ### [one hot encode] the [features] (p15-house+5)
  lambda](https://realpythoncom/python-lambda/)] (p24-
  covid+4)                                                   And in one case, the participant Rewrote in a very sum-
                                                             mary fashion, only listing a series of steps:
   Tutorial edit-patterns went much further than Expla-
nation edit-patterns (which themselves had gone further • [modeling process - subsetting data - cleaning data -
than Details edit-patterns). Tutorial edit-patterns provide     getting rid of nulls -model training] (p14-house+4)
not only how-to information, but also interpretations of        In the Rewriting edit-pattern, we see diverse strategies,
the meaning or purpose of actions, and in one case a link    ranging from brief summaries to extensive new text, as
to further information.                                      well as high-level abstractions. Significantly, in multiple
                                                             cases, participants made distinctly different Rewritings
5.4. Rewriting Edit-Patterns                                 of the same generated text (i.e., in the same cell of the
                                                             notebook). Thus, while the category of the edit-pattern is
In the preceding subsection, we began to describe a di-
                                                             the same, the individual strategies can be quite different.
mension of increasing complexity in the information that
                                                             We recall that we saw analogous patterns in the This-step
participants provided, and also (we infer) increasing ef-
                                                             Details edit-patterns of subsesction 5.1.2. While there
fort on the part of the informants to think "beyond" what
                                                             may be agreement among participants that the generated
was given in the generated text. The last edit-pattern in
                                                             text in certain cells requires a certain type of change,
this series involves even more "beyond" work: beyond
                                                             participants clearly adopt different strategies about how
previous transformations of the generated text, and prob-
                                                             to make those changes.
ably beyond previous levels of effort. We call this edit-
pattern "Rewriting," because it involves nearly complete
replacement of the generated text.                           5.5. Content Edit-Patterns Summary
   It may be that the generated text in certain cells led to In this part of Results, we have examined how partici-
more instances of Rewriting. For example, three different pants changed the contents of the generated texts. Figure
participants rewrote the generated text of cell 4 of the 2 summarizes a dimension that runs from simple Details,
House notebook:                                              to more complex Explanations, to instructive Tutorials,
• ### [join features from train and test into one df] (p15- and finally to complete Rewritings of the generated text.
  house+4)                                                  Collectively, participants have an extensive repertoire of
                                                            edit-patterns that they apply to particular problems in
• ### [transform and clean] the [data] (p22-house+4)        particular documentation cells. We next examine more
                                                            stylistic changes that participants applied to generated
• ### [concat train and test col salecondition] (p02- text.
  house+4)
Figure 2: Dimension of Content Edit-Patterns.
Figure description: Graphical display of four Content Edit-Patterns, in a two-dimensional abstract sapce. The horizontal
axis represents the complexity of the edit-patterns. The vertical axis represents the inferred amount of participant effort to
make the changes. The four classes of edit-patterns run from the lower left (low-low) to the upper right (high-high), in the
order of Details, Explanations, Tutorials, and Rewritings.



6. Results: Stylistic Edit-Patterns                             • ### model [creating] (p21-covid+8)

6.1. Modifying document hierarchies                             • ### model [training] (p24-covid+8)

Sometimes in combination with other edit-patterns, par-         However, participants also engaged in more complex
ticipants modified the markdown formatting from the             ways of completing a sentence. For example, P01-covid
generated texts. Initially, all markdown texts were pro-        both added a verb and changed the object of that verb:7
vided at the same hierarchical level (###). In multiple
                                                                • [generate] the [predictions] (p01-covid+9)
cases, participants modified those levels, placing texts in
super-order / sub-order relation to one another:              While editing the same cell, P02-house, P07-house, and
• [#####] this code cell is for [handling] missing val- P15-house added the same verb, but then made different
  ues [which are replaced with] the [mean value] for modifications to the object of that verb:
  [that feature] this is [also known as column-wise mean- • ### [fit regression] model (p02-house+8)
  imputation] (p11-house+6)
                                                              • ### [fit] a lasso linear model [to the training data]
P17-covid modified the header markdown specification
                                                                  (p07-house+8)
when combining the contents of two different forms of
generated text:                                               • ### [train] lasso [cv] linear model (p15-house+8)
• ### [create] a [classifier ####] random forest [classifier]
                                                              There were also even more complex cases, in which it
   (p17-covid+8)
                                                              is not clear if the participant’s purpose was to complete
Further research will be needed to understand if these a sentence Here, we repeat one example of P17-covid
stylistic/formatting edits are related to changes to the from the previous subsection, which illustrates our point
words in the documentation.                                   about the ambiguity of complex cases:

                                                                • ### [create] a [classifier ####] random forest [classifier]
6.2. Completing a Sentence                                         (p17-covid+8)
Some of the changes to content appeared to clarify what
                                                                • ### [leverage] the random forest model and [fit] the
was being done in the code. The primary subcategory of
                                                                  model [with training] dataset [(a] random forest is
these changes was to add a verb to a noun-phrase:
                                                                  a meta estimator that fits a number of decision tree
• ### [fit] the model (p01-covid+8)                               classifiers on various sub-samples of the dataset and
                                                                  uses averaging to improve the predictive accuracy and
• ### [train the] model (p10-house+8)
                                                                  control [over-fitting)] (p13-covid+8)
Some participants added the verb in a different position
in the sentence. In these two examples, we see P21-covid             7
                                                                       We use ”verb” and ”object” in the technical senses of English-
and P24-covid modifying the same generated text, but            language grammar (e.g., [52]). A ”verb” performs an action. An
with different versbs:                                          ”object” receives the effect of that action.
6.3. Conversational Tone                                       Table 2
                                                               Percent of All Edit-Patterns in Edited Cells
We observed a further stylistic modification which ap-
peared to make the generated text more conversational.           Edit-Pattern            Percentage of All Edited Cells
To avoid asserting our own judgments of what ”conversa-          Details-Contextual                                 5.08%
tional” might mean, we show only examples in which the           Details-This-step                                 33.90%
participant added a personal pronoun - typically ”we” or         Details-Next-step                                    5.08
”you”. For ease in reading, we have bolded those pronouns        Explanations                                      11.02%
in the following examples:                                       Tutorial                                           8.47%
                                                                 Rewritten                                         14.41%
• importing libraries: [in] this code [segment you import        Markdown headers                                   5.08%
  the python] libraries [first that include ‘numpy‘] and         Complete sentence                                 12.71%
  [‘panda‘ you also import] a [class from sklean if you          Conversational tone                                4.24%
  need to display some warning import the warnings]              Note: A single cell could contain multiple edit-patterns.
  library [as shown] (p21-covid+1)                               Therefore, sums of percentages may not be meaningful.
• ### [define] and [configure] the model a random forest
  is a meta estimator that fits a number of decision tree
  classifiers on various sub-samples of the dataset and
                                                               8. Discussion
  uses averaging to improve the predictive accuracy and     These results have implications for the design of algo-
  control over-fitting [we also train] the model [with      rithmic documentation systems. Taken together with
  ‘fit()‘] (p04-covid+8)                                    our prior work on TransCoder, we can also see emergent
• [##### here we] evaluate the [square-root of] the [5- ideas about how people can understand and make use
  fold cross-validated mean-squared-error of] the [trained] of AI outcomes, even in the absence of formal explana-
  model with the [training set ‘(x_train y)‘ (p11-house+9) tory systems (e.g., Explainable AI, or XAI). Finally, these
                                                            two projects point us toward important questions in the
• ### [we now show] the [predicted values] (p24-covid+9) design of future human-AI collaborative systems.

                                                               8.1. Learning from Participants’
6.4. Stylistic Edit-Patterns Summary                                Improvements
We acknowledge that the distinction between content
                                                               The results we reported on participants’ editing patterns
and style is far from clear (e.g., [53]). Therefore, we con-
                                                               lead us to think about a few implications to further im-
sider that our current categorization of Content-Related
                                                               prove the automatic documentation approach. First, the
edit-patterns and Stylistic edit-patterns may require re-
                                                               Details edit-patterns, Explanation edit-patterns, and Tu-
vision. With larger datasets, we may for example con-
                                                               torial edit-patterns are relevant to the purpose of the
clude that Conversational Tone is more related to Tutorial
                                                               notebook and the target audience. We believe that a
changes, and less related to ”style.” The same may occur
                                                               future version of the generative approach should tailor
with Completing a Sentence. We will also need to under-
                                                               the automatic documentation based on the usage sce-
stand better the relationship of header-styles to content
                                                               nario. Data scientists can benefit from more candidate
in brief documentation text snippets.
                                                               documentation where the level of details varied.
                                                                  With a larger dataset, we could associate these edit-
7. Results: Summary Statistics of                              patterns to particular patterns in the algorithmically-
                                                               generated texts. Based on those associations, we could
     Edit-Patterns                                             modify the algorithms to anticipate the kinds of edits
                                                               that humans have previously made (e.g., [54]). For exam-
We computed the percentage of the Content-edited cells
                                                               ple, if we can remove the need for Details-related and
in which each of the above Content edit-pattern appeared.
                                                               Conversational-Tone edits, then humans can focus on
Details edit-patterns were the most frequent. This may be
                                                               higher-value editing, such as Tutorials. We may then see
unsurprising, because these kinds of edit-patterns took
                                                               emergent categories of even more task-specific and/or
relatively little effort (Figure 2). In general, edit-patterns
                                                               domain-oriented edit-patterns, when humans no longer
that were most costly of effort (Tutorial, Rewritten) had
                                                               need to put work into less significant edits.
lower frequencies of occurrence, with Rewritten edit-
                                                                  One way to do this, is to include a reinforcement learn-
patterns occurring in fewer than 15% of the edited cells.
                                                               ing component into the algorithm that could learn from
                                                               users’ modifications to the proposed texts. Our current
GNN model relies on the size of the training dataset to en-    If the human edits the target code in the TransCoder
sure the quality of the results. However, data science code    project, then a secondary AI (e.g., [12]) might assist by
snippets are patternless and are of limited use for gener-     generating completion of the human’s new code or sug-
ating explanations. In the future, we can combine deep         gesting additional modifications to the target to remain
reinforcement learning with our current GNN model              consistent with the changes. This secondary AI would be
[55] to improve the performance and generalizability of        ”aware” of the original code, and could provide additional
the results. This can help provide consistent stylistic        type-ahead support, advice, or consultation as needed.
documentation in terms of the writing style, sentence          Similarly, if the human edits the generated text in the
structure, and level of details.                               Documentation project, then a secondary AI (e.g., [56])
                                                               could provide assistance with Details types of edits, but
8.2. Flawed Generative Outcomes can be                         could also provide language-quality (Stylistic) support
                                                               for more complex Tutorial accounts, or even narratives
     Useful Outcomes                                           (e.g., [57]).
In our earlier generative documentation paper [8], we
learned that people accepted algorithmically-produced          8.4. How Does a Generative AI Model
documentation in 45% of the cells. In this workshop
                                                                    Ask for Help?
paper’s analysis of the 41% of edited cells, participants
retained at least part of the generated text in over 85% of    Our work highlights an opportunity for enriching human
the cells (Details, Explanations, Tutorials). The fact that    interactions with generative models. At a base level, a
they chose to do the extra work to Rewrite in only 15%         generative model takes input (e.g. code) and produces
of the cells, is evidence that they mostly chose to work       output (e.g. documentation for that code). Agarwal et al.
with imperfect text rather than to replace it.                 [42] demonstrate how a generative model can produce
   This outcome is consistent with our previous study of       confidence scores alongside its output, and Weisz et al.
code translation, in which engineers reported that they        [41] show the utility such scores can have in steering
preferred an imperfect translation to no translation at all.   human attention toward reviewing portions of the output
While we hope to improve our generative algorithms in          in which the model has low confidence. In this work, we
both research programs, we also envision future studies        demonstrate how having an understanding of the nature
in which we will calibrate the quality of the outcomes, to     of human edit-patterns to a generative model’s output
determine the threshold of ”poor quality” below which an       can enable a generative model to not only identify where
algorithm should not be deployed. We could then perhaps        human attention is needed, but also how human effort
provide a more ”skeletal” outcome, such as an outline of       can be used to improve the quality of its output.
documentation rather than full-text. We could also treat          One of the interesting questions will be exactly how
”poor-quality” instances as higher-priority opportunities      to choose among those alternatives - i.e., when is type-
for algorithmic improvement.                                   ahead useful, and how should a watchful but respectful
                                                               AI intervene with advice, and what dialogic or other com-
8.3. Deepening Human-AI Collaborations                         munication structures should be involved in an on-going
                                                               AI-human consultation [58, 59, 60])? Our experiences
We now consider each of our research programs (transla-        with the NMT algorithm in the TransCoder experiment
tion and documentation) in terms of published patterns         showed that, with a sufficiently broad beam-search [61],
of human-AI collaboration [6, 1, 2, 4, 7, 5]. In both of       we could generate a manageable set of alternative trans-
our studies, the human provides some initial information       lations, which could be compared using an algorithm like
(source code in both cases), and the generative algorithm      [62] to determine regions of agreement between the trans-
responds with a proposed outcome text (target code or          lations as well as regions of uncertainty along with the
documentation, respectively). After that, with minimalist      alternatives considered for the uncertain regions. These
support, the human has to make their own way - e.g., by        alternatives could then serve as informal explanations
choosing among alternative translations for sections of        - e.g., ”Q: Why is the output marked as uncertain in this
the target code, or by manually editing the documenta-         region? A: Because the algorithm considered multiple pos-
tion.                                                          sible translations at this point, and this is what they were.”
   These patterns remain consistent with simple initiative     The GNN in the Documentation project might also be
models (e.g., [2, 4]), and fall short of the richer on-going   modified to produce multiple possible texts, with sim-
interactions of some of the experimental MICI applica-         ilar explanatory power (”Q: Why is this documentation
tions [1, 5], with their potential of AI-augmentation in       marked as uncertain...” ).
support of skilled human work. We anticipate that fu-             If there are multiple, alternative outcomes with no
ture versions of both projects could move toward longer        emergent ”most-probable” alternative, then this could
and richer exchanges between human and AI (e.g., [6]).         become an initiation point at which the algorithm de-
tects the need for human assistance. One way to imple- directions for longer-term, on-going human-AI collabo-
ment this request-for-help is as a ”human-in-the-loop” rations.
paradigm, in which the human responds to the needs
of the algorithm. We also envision situations in which
the human may be in the midst of editing code or text, References
and may ask the algorithm to serve as a text-assistant
                                                             [1] S. Deterding, J. Hook, R. Fiebrink, M. Gillies, J. Gow,
for the human’s on-going editing work. This could be-
                                                                 M. Akten, G. Smith, A. Liapis, K. Compton, Mixed-
come an example of the ”AI in the loop” paradigm that
                                                                 initiative creative interfaces, in: Proceedings of
we discussed at last year’s workshop. In these ways, we
                                                                 the 2017 CHI Conference Extended Abstracts on
could move from the relatively single-process ”initiative”
                                                                 Human Factors in Computing Systems, 2017, pp.
models of [2, 4, 7], and toward a more collaborative and
                                                                 628–635.
on-going series of interactions as in [1, 5].
                                                             [2] E. Horvitz, Principles of mixed-initiative user in-
                                                                 terfaces, in: Proceedings of the SIGCHI conference
8.5. Limitations                                                 on Human Factors in Computing Systems, 1999, pp.
In order to conduct a controlled study, we sacrificed eco-       159–166.
logical validity. We asked participants to document some- [3] M. Muller, J. Weisz, W. Geyer, Mixed initiative
one else’s notebook. By contrast, the canonical case in          generative ai interfaces: An analytic framework for
Jupyter notebooks is to document one’s own code. A               generativeai applications (2020).
future goal should be a more naturalistic practice of doc-   [4] R. Parasuraman, T. B. Sheridan, C. D. Wickens, A
umenting my notebook.                                            model for types and levels of human interaction
   Paradoxically, for precision of evaluation, we may also       with  automation, IEEE Transactions on systems,
need to perform an even more controlled study, in which          man,   and cybernetics-Part A: Systems and Humans
each person receives only one algorithm’s text at a time.        30  (2000) 286–297.
This approach could help us to assess each algorithmic       [5] A.  Spoto,  N. Oyelnik, Library of mixed initiative
approach more independently than in the preliminary              creative interfaces, "http://mici.codingconduct.cc",
experiment in this workshop paper.                               2017. [Online; accessed 21-December-2020].
   We also note that our analytic method could be strength-  [6] J. A. Biles, Genjam: Evolution of a jazz improviser,
ened in future research. Our bag-of-words approach was           in: Creative evolutionary systems, Elsevier, 2002,
insensitive to word-order, and we looked only at pat-            pp. 165–187.
terns of added words. Future work should also examine        [7] B. Shneiderman, Human-centered artificial intelli-
patterns of deleted words.                                       gence: Reliable, safe & trustworthy, International
   Finally, we note the obvious sampling weaknesses. We          Journal  of Human–Computer Interaction 36 (2020)
conducted a relatively small study in a single institution.      495–504.
We hope to examine similar practices in other settings, [8] A. Y. Wang, D. Wang, J. Drozdal, M. Muller,
and with more participants.                                      S. Park, J. D. Weisz, X. Liu, L. Wu, C. Dugan,
                                                                 Themisto: Towards automated documentation gen-
                                                                 eration in computational notebooks, arXiv preprint
9. Conclusion                                                    arXiv:2102.12592 (2021).
                                                             [9] A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu,
In this workshop paper, we have addressed topics in              On the naturalness of software, in: 2012 34th In-
human-AI collaboration in data science and software en-          ternational Conference on Software Engineering
gineering. We reported text analytic results from a study        (ICSE), IEEE, 2012, pp. 837–847.
of generative documentation, showing that participants [10] V. Raychev, M. Vechev, E. Yahav, Code completion
accepted generated text with or without modification in          with statistical language models, in: Proceedings of
the majority of instances. These results are consistent          the 35th ACM SIGPLAN Conference on Program-
with our earlier work, in which engineers were enthusi-          ming Language Design and Implementation, 2014,
astic about using imperfect NMT-generated translations           pp. 419–428.
of software code. Similarly, participants in this study [11] M. Bruch, M. Monperrus, M. Mezini, Learning from
were also quite ready to accept or to work with imperfect        examples to improve code completion systems, in:
GNN-generated texts. We also analyzed the edit-patterns          Proceedings of the 7th joint meeting of the Euro-
in the generated text, developing categories that suggest        pean software engineering conference and the ACM
future work directions. Finally, going beyond early uni-         SIGSOFT symposium on the foundations of soft-
directional models of ”initiative,” we sketched promising        ware engineering, 2009, pp. 213–222.
                                                            [12] A. Svyatkovskiy, S. K. Deng, S. Fu, N. Sundaresan,
     Intellicode compose: Code generation using trans-        [23] M. Pradel, K. Sen, Deepbugs: A learning approach
     former, arXiv preprint arXiv:2005.08025 (2020).               to name-based bug detection, Proceedings of the
[13] B. Roziere, M.-A. Lachaux, L. Chanussot, G. Lam-              ACM on Programming Languages 2 (2018) 1–25.
     ple, Unsupervised translation of programming lan-        [24] M. Vasic, A. Kanade, P. Maniatis, D. Bieber, R. Singh,
     guages, Advances in Neural Information Processing             Neural program repair by jointly learning to local-
     Systems 33 (2020).                                            ize and repair, arXiv preprint arXiv:1904.01720
[14] L. Mou, G. Li, L. Zhang, T. Wang, Z. Jin,                     (2019).
     Convolutional neural networks over tree struc-           [25] E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, K. Wang,
     tures for programming language processing, in:                Hoppity: Learning graph transformations to detect
     D. Schuurmans, M. P. Wellman (Eds.), Proceed-                 and fix bugs in programs, in: International Confer-
     ings of the Thirtieth AAAI Conference on Arti-                ence on Learning Representations, 2019.
     ficial Intelligence, February 12-17, 2016, Phoenix,      [26] M. White, M. Tufano, M. Martinez, M. Monperrus,
     Arizona, USA, AAAI Press, 2016, pp. 1287–                     D. Poshyvanyk, Sorting and transforming program
     1293. URL: http://www.aaai.org/ocs/index.php/                 repair ingredients via deep learning code similari-
     AAAI/AAAI16/paper/view/11775.                                 ties, in: 2019 IEEE 26th International Conference
[15] V. Jayasundara, N. D. Q. Bui, L. Jiang, D. Lo,                on Software Analysis, Evolution and Reengineering
     Treecaps: Tree-structured capsule networks for                (SANER), IEEE, 2019, pp. 479–490.
     program source code processing, arXiv preprint           [27] V. J. Hellendoorn, P. Maniatis, R. Singh, C. Sutton,
     arXiv:1910.12306 (2019).                                      D. Bieber, Global Relational Models of Source Code,
[16] S. Gu, T. Lillicrap, I. Sutskever, S. Levine, Continu-        in: International Conference on Learning Represen-
     ous deep q-learning with model-based acceleration,            tations, 2020.
     in: International Conference on Machine Learning,        [28] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pol-
     2016, pp. 2829–2838.                                          lock, K. Vijay-Shanker, Automatic generation of
[17] N. D. Q. Bui, Y. Yu, L. Jiang, SAR: learning                  natural language summaries for java classes, in:
     cross-language API mappings with little knowl-                2013 21st International Conference on Program
     edge, in: M. Dumas, D. Pfahl, S. Apel, A. Russo               Comprehension (ICPC), IEEE, 2013, pp. 23–32.
     (Eds.), Proceedings of the ACM Joint Meeting on          [29] S. Iyer, I. Konstas, A. Cheung, L. Zettlemoyer,
     European Software Engineering Conference and                  Summarizing source code using a neural attention
     Symposium on the Foundations of Software En-                  model, in: Proceedings of the 54th Annual Meeting
     gineering, ESEC/SIGSOFT FSE 2019, Tallinn, Es-                of the Association for Computational Linguistics
     tonia, August 26-30, 2019, ACM, 2019, pp. 796–                (Volume 1: Long Papers), 2016, pp. 2073–2083.
     806. URL: https://doi.org/10.1145/3338906.3338924.       [30] S. Scalabrino, G. Bavota, C. Vendome, M. Linares-
     doi:10.1145/3338906.3338924.                                  Vásquez, D. Poshyvanyk, R. Oliveto, Automatically
[18] M. Allamanis, H. Peng, C. Sutton, A convolutional             assessing code understandability: How far are we?,
     attention network for extreme summarization of                in: 2017 32nd IEEE/ACM International Conference
     source code, in: International conference on ma-              on Automated Software Engineering (ASE), IEEE,
     chine learning, 2016, pp. 2091–2100.                          2017, pp. 417–427.
[19] U. Alon, M. Zilberstein, O. Levy, E. Yahav, code2vec:    [31] X. Hu, G. Li, X. Xia, D. Lo, Z. Jin, Deep code com-
     Learning distributed representations of code, Pro-            ment generation, in: 2018 IEEE/ACM 26th Inter-
     ceedings of the ACM on Programming Languages                  national Conference on Program Comprehension
     3 (2019) 1–29.                                                (ICPC), IEEE, 2018, pp. 200–20010.
[20] V. J. Hellendoorn, C. Bird, E. T. Barr, M. Allamanis,    [32] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu,
     Deep learning type inference, in: Proceedings of              P. S. Yu, Improving automatic source code sum-
     the 2018 26th acm joint meeting on european soft-             marization via deep reinforcement learning, in:
     ware engineering conference and symposium on                  Proceedings of the 33rd ACM/IEEE International
     the foundations of software engineering, 2018, pp.            Conference on Automated Software Engineering,
     152–162.                                                      2018, pp. 397–407.
[21] J. Wei, M. Goyal, G. Durrett, I. Dillig, Lambdanet:      [33] U. Alon, S. Brody, O. Levy, E. Yahav, code2seq: Gen-
     Probabilistic type inference using graph neural net-          erating sequences from structured representations
     works, arXiv preprint arXiv:2005.02161 (2020).                of code, arXiv preprint arXiv:1808.01400 (2018).
[22] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bac-       [34] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto,
     chelli, P. Devanbu, On the "naturalness" of buggy             A. Marcus, G. Canfora, Automatic generation of
     code, in: 2016 IEEE/ACM 38th International Con-               release notes, in: Proceedings of the 22nd ACM
     ference on Software Engineering (ICSE), IEEE, 2016,           SIGSOFT International Symposium on Foundations
     pp. 428–439.                                                  of Software Engineering, 2014, pp. 484–495.
[35] M. White, M. Tufano, C. Vendome, D. Poshyvanyk,              man Factors in Computing Systems, 2019, pp. 1–13.
     Deep learning code fragments for code clone detec-      [49] A. Pradhan, B. Jelen, K. A. Siek, J. Chan, A. Lazar,
     tion, in: 2016 31st IEEE/ACM International Con-              Understanding older adults’ participation in design
     ference on Automated Software Engineering (ASE),             workshops, in: Proceedings of the 2020 CHI Con-
     IEEE, 2016, pp. 87–98.                                       ference on Human Factors in Computing Systems,
[36] M. Allamanis, E. T. Barr, P. Devanbu, C. Sutton, A           2020, pp. 1–15.
     survey of machine learning for big code and natu-       [50] G. Guest, A. Bunce, L. Johnson, How many inter-
     ralness, ACM Computing Surveys (CSUR) 51 (2018)              views are enough? an experiment with data sat-
     1–37.                                                        uration and variability, Field methods 18 (2006)
[37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,             59–82.
     I. Sutskever, Language models are unsupervised          [51] M. A. A. Majid, M. Othman, S. F. Mohamad, S. A. H.
     multitask learners, OpenAI blog 1 (2019) 9.                  Lim, Achieving data saturation: evidence from a
[38] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-           qualitative study of job satisfaction, Social and
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-         Management Research Journal 15 (2018) 66–77.
     try, A. Askell, et al., Language models are few-shot    [52] B. S. Azar, D. A. Azar, R. S. Koch, Understanding
     learners, arXiv preprint arXiv:2005.14165 (2020).            and Using English Grammar: Workbook, Longman,
[39] M. Brockschmidt, M. Allamanis, A. L. Gaunt,                  2000.
     O. Polozov, Generative code modeling with graphs,       [53] G. Lakoff, A figure of thought, Metaphor and
     arXiv preprint arXiv:1805.08490 (2018).                      symbol 1 (1986) 215–225.
[40] M. Tufano, C. Watson, G. Bavota, M. D. Penta,           [54] K.-H. Zeng, M. Shoeybi, M.-Y. Liu, Style example-
     M. White, D. Poshyvanyk, An empirical study on               guided text generation using generative adversar-
     learning bug-fixing patches in the wild via neural           ial transformers, arXiv preprint arXiv:2003.00674
     machine translation, ACM Transactions on Soft-               (2020).
     ware Engineering and Methodology (TOSEM) 28             [55] P. Almasan, J. Suárez-Varela, A. Badia-Sampera,
     (2019) 1–29.                                                 K. Rusek, P. Barlet-Ros, A. Cabellos-Aparicio, Deep
[41] J. D. Weisz, M. Muller, S. Houde, J. Richards, S. L.         reinforcement learning meets graph neural net-
     Ross, F. Martinez, M. Agarwal, K. Talamadupula,              works: exploring a routing optimization use case,
     Perfection not required? human-ai partnerships in            arXiv (2019) arXiv–1910.
     code translation, in: Proceedings of IUI 2021, 2021.    [56] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[42] M. Agarwal, K. Talamadupula, S. Houde, F. Mar-               M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     tinez, M. Muller, J. Richards, S. Ross, J. D. Weisz,         limits of transfer learning with a unified text-to-
     Quality estimation & interpretability for code trans-        text transformer, arXiv preprint arXiv:1910.10683
     lation, arXiv preprint arXiv:2012.07581 (2020).              (2019).
[43] M. B. Kery, M. Radensky, M. Arya, B. E. John, B. A.     [57] R. Kazman, G. Abowd, L. Bass, P. Clements,
     Myers, The story in the notebook: Exploratory                Scenario-based analysis of software architecture,
     data science using a literate programming tool, in:          IEEE software 13 (1996) 47–55.
     Proceedings of the 2018 CHI Conference on Human         [58] E. Horvitz, Uncertainty, action, and interaction: In
     Factors in Computing Systems, 2018, pp. 1–11.                pursuit of mixed-initiative computing, Intelligent
[44] J. M. Perkel, Why jupyter is data scientists’ com-           Systems (1999) 17–20.
     putational notebook of choice., Nature 563 (2018)       [59] S. Ross, E. Brownholtz, R. Armes, Voice user in-
     145–147.                                                     terface principles for a conversational agent, in:
[45] A. LeClair, S. Haque, L. Wu, C. McMillan, Improved           Proceedings of the 9th International Conference on
     code summarization via a graph neural network,               Intelligent User Interfaces, 2004, pp. 364–365.
     arXiv preprint arXiv:2004.02843 (2020).                 [60] S. Ross, E. Brownholtz, R. Armes, A multiple-
[46] H. Ando, R. Cousins, C. Young, Achieving satura-             application conversational agent, in: Proceedings
     tion in thematic analysis: Development and refine-           of the 9th International Conference on Intelligent
     ment of a codebook, Comprehensive Psychology 3               User Interfaces, 2004, pp. 319–321.
     (2014) 03–CP.                                           [61] C. Wilt, J. Thayer, W. Ruml, A comparison of greedy
[47] V. Braun, V. Clarke, Using thematic analysis in              search algorithms (2010).
     psychology, Qualitative research in psychology 3        [62] E. Myers, An o(nd) difference algorithm and its
     (2006) 77–101.                                               variations, Algorithmica 1 (1986) 251–266.
[48] C. M. Baker, L. R. Milne, R. E. Ladner, Understand-
     ing the impact of tvis on technology use and se-
     lection by children with visual impairments, in:
     Proceedings of the 2019 CHI Conference on Hu-