Creative Research Question Generation for
Human-Computer Interaction Research
Yiren Liu1,† , Mengxia Yu2,† , Meng Jiang2 and Yun Huang2
1
    University of Illinois Urbana-Champaign, Champaign, IL, 61820, USA
2
    University of Notre Dame, Notre Dame, IN, 46556, USA


                                          Abstract
                                          It is essential to develop innovative and original research questions/ideas for interdisciplinary research fields, such as
                                          Human-Computer Interaction (HCI). In this work, we focus on discussing how recent natural language generation (NLG)
                                          methodologies can be applied to promote the formulation of creative research questions. We collect and curate a dataset that
                                          contains texts of RQs and related work sections from HCI papers, and introduce a new NLG task of automatic HCI research
                                          question (RQ) generation. In addition to applying common NLG metrics used to evaluate generation accuracy, including
                                          ROUGE and BERTScore, we propose two sets of new metrics for evaluating the creativity of generated RQs: 1) DistGain and
                                          DiffBS for novelty, and 2) PPLGain for the level of surprise. The task is challenging due to the lack of external knowledge.
                                          We investigate four approaches to enhance the generation models with (1) general world knowledge, (2) task knowledge,
                                          (3) transferred knowledge, and (4) retrieved knowledge. The results of the experiment indicate that the incorporation of
                                          additional knowledge benefits both the accuracy and creativity of RQ generation. The dataset used in this study can be found
                                          at: https://github.com/yiren-liu/HAI-GEN-release.

                                          Keywords
                                          datasets, text generation, creativity


1. Introduction                                                                                                  Besides the important role of RQs, the interdisciplinary
                                                                                                                 nature of HCI research motivates us to perform this study
Asking novel research questions (RQ) is key to starting [3, 4]. There is a global trend of interdisciplinary research
innovative scientific studies. As David Hilbert states, “he [5, 6]. The fact that HCI is a highly interdisciplinary field
who seeks for methods without having an infinite prob- [3, 4] poses unique challenges [7, 8] to education and re-
lem in mind seeks for the most part in ‘vain’”. Proficient search. Depending on their interests and skills, students
scientists read and analyze representative literature in and scholars could conduct HCI works and contribute
a specific domain, in order to identify the limitations of from various perspectives to different disciplinary areas
the existing work and ask new RQs [1]. In computer [9]. Because of the interdisciplinary nature, HCI research
science research, methodologies are often derived from a could make contributions that are technical-driven, UX-
study’s core research question(s) [1]. Research questions focused, and/or method-oriented, etc. [10], which opens
(RQs) are one of the most important components in HCI a wide door to innovation. The related work section of
research, which are often explicitly stated in research HCI papers often reviews the relevant literature, which
papers from the HCI domain. As an outline of the whole can be used to drive the RQs.
paper, RQs are often proposed at the beginning sections                                                             Human researchers have to spend a lot of time reading
and often stated in a unified format, e.g., “RQ1: ..., RQ2: and understanding tons of interdisciplinary literature.
...”. For example, in the HCI paper from Lee et al. [2], the Artificial intelligence (AI) systems have demonstrated
authors listed two RQs at the end of the related work their abilities to facilitate some types of scientific re-
section:                                                                                                         search tasks, such as summarizing scientific literature
                    ”RQ1: How do different chatting styles in-                                                   [11], recommending related work [12], and generating
                    fluence people’s self-disclosure? and RQ2:                                                   new biomedical hypotheses [13]. If a machine could gen-
                    How do different chatting styles influence                                                   erate RQs based on existing literature, it would help HCI
                    people’s self-disclosure over time?”                                                         researchers discover potential research topics, though
                                                                                                                 they needed to verify the machine-suggested RQ candi-
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, dates. However, to the best of our knowledge, there is no
Australia
†                                                                                                                AI model or research on automating HCI RQ generation.
    These authors contributed equally.
Envelope-Open yirenl2@illinois.edu (Y. Liu); myu2@nd.edu (M. Yu);                                                search on automating HCI RQ generation.
mjiang2@nd.edu (M. Jiang); yunhuang@illinois.edu (Y. Huang)                                                         In this work, we propose a novel task of research ques-
Orcid 0000-0003-1507-0303 (Y. Liu); 0000-0002-6627-2709 (M. Yu);                                                 tion generation in the field of HCI research. Given the
0000-0002-3009-519X (M. Jiang); 0000-0003-0399-8032 (Y. Huang)                                                   related work section (denoted as 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘), the task
                     © 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons
                     License Attribution 4.0 International (CC BY 4.0).                                          aims to generate one or multiple research questions. We
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
notice that given a set of literature, it is easy to come      answers are short pieces of text. The QG datasets are
up with plausible but too generic RQs on broad research        usually converted from the question-answering datasets.
topics. Therefore, the challenge of our task lies in that      Instead of factoid questions, RQs are open-ended ques-
when the same set of literature is surveyed from differ-       tions, the generation of which is found to be more chal-
ent perspectives, i.e., given different 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, the      lenging in prior work [18], because it requires a deep
generated RQs should be different, correspondingly.            understanding and needs to be addressed with long-form
   To study this new problem, we build a dataset from          answers. Nevertheless, the existing open-ended ques-
HCI literature. We collect 8,904 HCI papers from Arxiv         tion generation tasks are conditioned on the answers.
and manually extract 158 data examples. Each example           More research is needed to be done in order to generate
has the text of the related work section and the text of re-   unsolved open-ended HCI research problems.
search questions. In this study, we develop and evaluate          For educational domains, QG systems often aim at
four approaches: (1) prompting pre-trained GPT-3 [14]          generating assessment questions, e.g., multi-choice ques-
that has knowledge from pre-training corpus, (2) BART          tions, to help students understand the learning materials
[15] that is fine-tuned on our limited training examples,      and reduce the manual workload required from instruc-
(3) transfer learning for knowledge augmentation that          tors. Emerging studies have proposed datasets for ed-
warms up the model to generate paper titles which are          ucational QG [19, 20, 21]. However, these works aim
much more accessible than RQs, and (4) retrieval-based         to generate questions that help with comprehension of
augmentation that uses information from the HCI litera-        learning materials, not exploring potential unsolved re-
ture text we provide.                                          search problems.
   We evaluate the RQ generation quality based on three
sets of automated metrics: (1) ROUGE and BERTScore             2.2. Scientific Text Generation
with target RQs as references for accuracy, (2) DistGain
and DiffBS for novelty, and (3) PPLGain for level of sur-      In order to reduce the burden of scientific writing or sim-
prise. We propose to use these metrics for evaluation          ulate scientists’ behaviors, there is a line of research aim-
for practical reasons. First, when RQs are not explicitly      ing at automatic scientific text generation. Since early
spelled out in HCI papers, the model that yields greater ac-   work on abstract generation [22], various approaches
curacy could be more effectively utilized. As researchers      have been proposed for scientific text summarization
try to quickly form the RQs given a large amount of            [23, 24, 11]. Spangler et al. [13] leverage text mining for
surveyed papers, the model could aid in boosting the effi-     scientific hypothesis generation. ReviewerBot [25] uti-
ciency of literature review for both research and learning     lizes information extracted from knowledge graphs to
purposes. Second, the model that leads to higher novelty       construct synthetic paper reviews from templates. Au-
and surprises could be used, when the HCI papers already       toCite [26] leverages multi-modal information to gener-
explicitly present RQs. In this case, researchers can com-     ate contextualized citation texts. PaperRobot [27] cascad-
pare the existing “ground truth” RQs with the generated        ingly generates abstracts, conclusions, future work, and
RQs to explore “new” directions for future research.           titles for a follow-on paper. However, the automatic HCI
   The main contributions of this study are:                   RQ question has not been studied as an NLP task.

     • We propose the task of HCI research question
       generation, collecting and releasing a dataset.         2.3. Evaluating Creativity in Text
     • We design and develop four types of models that              Generation
       leverage various knowledge to improve RQ gen-           Methods for enhancing the ability of machine learning
       eration.                                                models to produce original content have been a crucial
     • We evaluate the accuracy, novelty, and level of         topic in the emerging research domain of computational
       surprise of generated RQs and find that knowl-          creativity [28]. Franceschelli and Musolesi [29] summa-
       edge transfer is the most promising approach            rized existing methods for creativity evaluation and dis-
       when the available task data size is small.             cussed their potential application in recent deep learning
                                                               models (e.g., VAE and GAN). However, most of these
                                                               existing evaluation methods are highly subjective and
2. Related Work                                                require strong human intervention. With the recent ad-
                                                               vances in text generation methods based on pre-trained
2.1. Question Generation                                       language models, additional research is still needed to
Automatic question generation (QG) has been studied as         be done in order to automatically and objectively evalu-
a data augmentation approach for Question Answering            ate the creativity of text generation models. Prior NLP
[16] and Machine Reading Comprehension [17]. Most              research has discussed potential methods to automati-
existing QG studies focus on factoid questions, whose          cally evaluate generation taking into consideration both
                   Avg. # of words            Avg. # of   This resulted in a total of 8,904 HCI-related papers 2 . We
             per Related Work per RQ        RQs per paper then convert these papers from PDF to sectioned XML
 train           409.9              14.2        2.4       format using GROBID 3 and SciPDF Parser 4 in order to fur-
  dev            369.9              13.8        2.4       ther analyze and filter based on their textual context. The
  test           332.4              15.7        2.2       section and title information are preserved in the XML
                                                          version of our collected papers. For research questions,
Table 1                                                   we conducted pattern matching of question sentences
Descriptive Statistics of the Proposed Dataset
                                                          starting with “RQ”. In order to collect text from related
the accuracy and diversity of the generated results [30]. work sections, i.e., 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, we extract sections with
In this work, we employ Boden’s three criteria [31] for titles containing the keywords “related work”. We re-
studying machine creativity, defined as “the ability to move RQs from 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 if it appears.
generate ideas or artifacts that are new, surprising and     The resulting dataset consists of 158 valid examples.
valuable”, to propose new metrics for creativity evaluate We then split the dataset into train/dev/test sets with
in text generation tasks.                                 108/25/25 examples. Note that the splits are carefully
                                                          arranged in chronological order, i.e. papers in the dev
                                                          and set are published later than those from the train
3. Problem Definition and Data                            split. This is to ensure the RQs in the dev/test sets are
                                                          the newest and are not revealed in the train set. The
A research question refers to a question that a study descriptive statistics of the final dataset can be found in
or research project aims to address. In HCI research Table 1.
publications, RQs are often proposed after the survey of
related work. Based on the understanding of existing
literature and citation purposes, different papers will 4. Method
compose the related work sections differently, even if
they cite the same set of literature. Correspondingly, To tackle the lack of knowledge issue in HCI RQ gen-
their research questions should be different.             eration, we investigate four types of approaches that
   We formally define the task of HCI RQ generation with leverage different types of knowledge. We present three
task variables as follows.                                sets of quantitative metrics to evaluate the quality of gen-
                                                          erated questions from three different aspects: accuracy,
Definition 1 (HCI Research Question Generation). novelty and level of surprise.
Given the 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 of an HCI research paper, the gen-
eration model requires maximizing 𝑃(𝑅𝑄|𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘).
                                                                4.1. Generation Models with Various
   In real-life HCI research scenarios, researchers strive           Knowledge
to propose highly novel and creative research questions
based on existing work. Thus, we propose also to mea-          In this section, we describe the different models used for
sure the creativity of generated research questions. Based     training and evaluating the RQ generation task.
on the theory of Boden’s criteria [31] “the ability to gen-       Pre-trained GPT-3 As a large language model (LM)
erate ideas or artifacts that are new, surprising, and valu-   with 175 billion parameters, GPT-3 is the state-of-the-
able’’, we construct the creativity measurement as a com-      art learner succeeding on many NLP tasks and shows
bination of two aspects: 1) novelty and 2) level of surprise.  its capability in research paper writing [32], educational
We do not evaluate the value of generated RQs since we         question generation [33] and open-domain QA [34]. GPT-
believe it would require extensive expert knowledge and        3 is trained on 45 TB of text data from multiple sources
is hardly feasible without human intervention.                 which include Wikipedia and books, enabling the model
                                                               to store a huge amount of general world knowledge.
Definition 2 (Generation Creativity). 1) We measure               Fine-tuned BART We choose BART, a Transformers-
the novelty of a set of RQs by comparing their similarity based pretrained generation model, as our backbone
to the RQs of prior publications within our collected corpus; model. By fine-tuning BART on our RQ generation
2) We measure the level of surprise of a set of RQs based dataset, the model should acquire specific task knowl-
on their perplexity with respect to the perplexity of existing edge, but the knowledge would be limited due to data
RQs using a large PLM (e.g., GPT-2).                           scarcity.
   To collect open-access HCI publications, we used pa-           Knowledge transfer from title generation Trans-
pers available through Arxiv. We collected PDF files of fer learning is an effective way to improve the model
papers under the category of Human-Computer Inter-
                                                               2
action (cs.HC) 1 using the public API provided by Arxiv. 3 https://github.com/yiren-liu/HAI-GEN-release
                                                                    https://github.com/kermitt2/grobid
1                                                               4
    https://arxiv.org/list/cs.HC/recent                             https://github.com/titipata/scipdf_parser
when only a limited amount of data on the target task          DistGain can be written as follows:
is available. The available RQ data may be limited for a                                        𝑀
variety of reasons, e.g., errors during PDF parsing, or RQs                                1     |{𝑦𝑗 } − {𝑥𝑖 }|
                                                                            𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛𝑗 =      ∑                   ,       (1)
that are not explicitly written in some papers. In contrast,                               𝑀 𝑖=1       |𝑌𝑗 |
paper titles are more accessible, where the amount we
extracted is 30 times that of research questions. In seman-    where sequence 𝑌𝑗 = (𝑦𝑗 )∶ denotes the 𝑗-th generated RQ,
tic space, a paper’s title represents its most significant     sequence 𝑋𝑖 = (𝑥𝑖 )∶ denotes the 𝑖-th RQ in {𝑅𝑄existing },
contribution, which is strongly tied to its research topics.   and 𝑀 = |{𝑅𝑄existing }|. We average the 𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛𝑗 of all
In most cases, paper titles can be considered as a high-       generated RQs to obtain an overall score 𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛.
level summary of the solution to the research questions.          Difference in BERTScore (DiffBS or DBS): In or-
Therefore, we propose to augment the BART model with           der to measure the distance between the generated RQ
transfer relevant task knowledge from title generation         and {𝑅𝑄existing }, we calculate cosine similarity of BERT
to RQ generation. The titles in train/dev/test sets are ex-    embeddings [39] between the generated RQ and each
cluded. They are not used as input for the target task. So     𝑋𝑖 ∈ {𝑅𝑄existing }. For each generated RQ, we calculate
there is no data leaking.                                      the F1-BERTScore for each pair (𝑌𝑗 , 𝑋𝑖 ), and average over
   Knowledge retrieval from HCI corpus Knowl-                  all existing RQs:
edge retrieval is another promising solution to many                                      𝑀
knowledge-intensive NLP tasks [35] such as question an-                                1
                                                                         𝐷𝑖𝑓 𝑓 𝐵𝑆𝑗 =     ∑(1 − 𝐹BERT (𝑌𝑗 , 𝑋𝑖 )),        (2)
swering [36] and information-seeking question genera-                                  𝑀 𝑖=1
tion [37]. To incorporate external domain knowledge,
we apply the Dense Passage Retriever (DPR) [36] to re-         where 𝐹BERT (𝑌𝑗 , 𝑋𝑖 ) denotes the F1-BERTScore calculated
trieve sentences most relevant to the input 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘       between 𝑌𝑗 and 𝑋𝑖 . The final 𝐷𝑖𝑓 𝑓 𝐵𝑆 for each model is
from the HCI corpus. The retrieved sentences are ap-           averaged over all generated RQs.
pended to the end of the original related work text as
input.                                                         4.2.2. Measuring level of surprise
                                                              To measure the level of surprise, we refer to Boden [31]’s
4.2. Evaluation Methods for Novelty and                       definition of surprise “an idea may be surprising because
     Surprise                                                 it’s unfamiliar, or even unlikely”. We propose a new auto-
                                                              matic metric to measure the level of surprise in generated
The task of HCI RQ generation aims to generate open- RQs.
ended research questions to inspire researchers, which           Perplexity Gain (PPLGain). Perplexity, the inverse
need to be highly creative. Recently, Computational Cre- probability, is frequently used to measure how uncertain
ativity has become an emerging field of study in the HCI an LM generates the test data. Given a text, the higher
domain [29]. Inspired by Boden’s three criteria [31] “the the perplexity is, the more uncertain the LM is about gen-
ability to generate ideas or artifacts that are new, surpris- erating it. Assuming an LM is successfully pre-trained
ing and valuable”, we introduce evaluation metrics to with a sufficient amount of general text data, the perplex-
measure the novelty and level of surprise of generated ity reflects the unexpectedness, or level of surprise, of
RQs. We do not evaluate the value of generated RQs the LM to the given text. Thus, we employ the perplexity
to HCI research since it would require extensive expert of GPT-2 of the RQs:
knowledge and human intervention.
                                                                                            𝑇
                                                                                      1
4.2.1. Measuring novelty                                             𝑝𝑝𝑙(𝑌𝑗 ) = exp (− ∑ log 𝑝(𝑦𝑖 |𝑦1 , ..., 𝑦𝑖−1 )) .   (3)
                                                                                      𝑇 𝑖=1
To evaluate the novelty of generated RQs, i.e., how new/o-
riginal the RQs are, we measure the difference between         To measure the level of surprise, or unexpectedness, of
the generated RQs and prior RQs. We introduce two              the generated RQs, we calculate the difference between
metrics: 1) an 𝑛-gram-based score DistGain, and an             the perplexity of generated RQs and prior RQs. We define
embedding-based score DiffBS. We first make a set of           the perplexity gain as follows:
prior RQs, denoted as {𝑅𝑄existing }, from papers published                                        1     𝑀
earlier than the papers in dev/test sets.                                              𝑝𝑝𝑙(𝑌𝑗 ) − 𝑀 ∑𝑖=1 𝑝𝑝𝑙(𝑋𝑖 )
                                                                        𝑃𝑃𝐿𝐺𝑎𝑖𝑛𝑗 =                                   .   (4)
   Distinct-𝑘 gain (DistGain or DG) is defined based                                        1  𝑀
                                                                                              ∑ 𝑝𝑝𝑙(𝑋𝑖 )
on Distinct-𝑘 [38]. We calculate the average proportion                                     𝑀 𝑖=1
of new unique 𝑛-grams in the newly generated RQ com-           The final 𝑃𝑃𝐿𝐺𝑎𝑖𝑛 score is averaged over all 𝑌𝑗 .
pared to the total number of 𝑛-grams in the {𝑅𝑄existing }.
5. Experiments                                                of surprise of generated RQs. As a large LM, GPT-3 pos-
                                                              sesses rich knowledge outside of the HCI research do-
5.1. Evaluation Methods                                       main, which enables it to output different words from
                                                              existing RQs, but those words may be off the research
We evaluate the generation quality with three sets of
                                                              topic.
metrics: (1) ROUGE and BERTScore for measuring accu-
                                                                 Knowledge augmentation is effective on HCI RQ
racy; (2) DistGain and DiffBS for measuring novelty; (3)
                                                              generation. Transfer learning augmented model, i.e.,
PPLGain for measuring surprise.
                                                              BART-FT+transfer, outperforms BART baselines in terms
                                                              of ROUGE-2 (11.7%↑ on dev and 9.1%↑ on test) and
5.2. Experimental Settings                                    ROUGE-L (4.7%↑ on dev and 3.1%↑ on test). The effec-
                                                              tiveness of transfer learning shows that learning the task
We evaluated four text generation models with different
                                                              of title generation helps bridge the gap between existing
types of knowledge over our proposed dataset.
                                                              research and new research. Retrieval augmented mod-
   GPT-3. We prompt GPT-3 (text-davinci-002) with a
                                                              els, i.e., BART-FT+retrieval, perform at the same level as
one-shot example. We use a temperature of 0.7 and pick
                                                              BART-FT on dev set and significantly outperforms BART-
the top-1 generation. To align the output format with
                                                              FT in terms of ROUGE-2 (20.9%↑) and ROUGE-L (4.2%↑)
BART-based models, we post-process the GPT-3 output
                                                              on the test set. The model also surpassed the baseline
by replacing the question number. That means, “1.” or
                                                              BART-FT in novelty and surprise metrics. Both knowl-
“1)” will be replaced by “RQ1:”.
                                                              edge augmentation methods improve the novelty and sur-
   BART-FT. We use the 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 section as input.
                                                              prise of the generated RQs. This implies that introducing
An HCI paper may have multiple research questions.
                                                              additional knowledge from publications enables the lan-
The latter ones are highly likely to be dependent on the
                                                              guage model to generate RQs with new ideas outside the
previous ones. Thus, instead of an individual RQ, our
                                                              training set. Although both methods improve generation
output is set as a sequence of concatenated RQs such as
                                                              novelty and surprise, using knowledge transfer results in
“RQ1: ..., RQ2: ...”. For all the experiments with the BART
                                                              a higher increase. This might be because titles tend to re-
model, the maximum input and output length is set as
                                                              flect the contributions of studies in a self-contained and
768 and 128 tokens, respectively.
                                                              abstractive manner. Similarity-based retrieved results
   BART-FT+transfer. To transfer knowledge from ti-
                                                              tend to be individual sentences that might be confusing,
tle generation, we first fine-tune the BART model on
                                                              or even noisy when they are used as input, because they
{𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, title} pairs and then continue fine-tuning
                                                              bring information outside the context paragraph.
on {𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, RQ} pairs. We carefully construct the
dataset for title generation and avoid dev/test RQ data
leaking in the training data of the title generation.         6. Discussion
   BART-FT+retrieval. To construct the retrieval cor-
pus, we gather the abstract, introduction, and related        6.1. Case Study of Generated RQs
work section of the existing papers that were published
before dev/test papers, split the text into sentences, andTo further validate the proposed creativity metrics, we
form an HCI corpus containing 310,955 sentences. We       qualitatively compare examples of RQs generated by dif-
retrieve top-3 sentences with pre-trained DPR using re-   ferent models, as shown in Table. 3. It shows that RQs
lated work as queries, and append the retrieved text to   generated by GPT-3 appear to be less relevant compared
input sequences.                                          to other models, where the research topic is generalized
                                                          from “GitHub issues” to “online discussion”. Meanwhile,
                                                          the results generated by GPT-3 also suffered from repeti-
5.3. Results                                              tion as the sequence of “incivility and toxicity in online
Results on automatic evaluation are presented in Table 2. discussions” appeared twice in the given example. How-
   GPT-3 with general world knowledge increases ever, the language/words it uses could be new compared
generation novelty, but under-performs fine-tuned to prior RQs. This implies that the incorporation of gen-
models in accuracy and surprise level. Table 2 shows eral world knowledge generalizes the content of machine-
that, compared to the BART models, GPT-3 performs created RQs to domains other than that of the target
worse in terms of ROUGE and BERTScore on dev and paper. In this example, only BART-FT+transfer captured
test, but it surpassed the other three models on DistGain the information about “maintainers” which is critical in
and DiffBS, which are both measurements for genera- the ground truth RQ2, showing the advantage of trans-
tion novelty. However, all three BART-based models fer learning. We also found that the output of BART
achieved higher PPLGain scores which measure the level achieved the highest PPLGain score (level of surprise),
                                                          as the results mentioned interesting concepts including
                                               Dev                                                              Test
                        R-2        R-L        BS       DG     DBS        PG       R-2            R-L         BS        DG     DBS      PG
GPT-3                   8.48       21.1      80.26     78.1   13.1       -74.9    10.27          22.97      80.42      76.9   13.2     -72.2
BART-FT               12.92∗∗∗   26.88∗∗∗   83.48∗∗∗   60.1   10.4     -45.9∗∗∗   12.37          26.65     82.57∗∗     59.9   10.5   -39.9∗∗∗
BART-FT+transfer      14.43∗∗∗   28.14∗∗∗   84.04∗∗∗   65.1   10.8     -37.1∗∗∗    13.5         27.48∗     82.97∗∗     64.5   11.1   -27.6∗∗∗
BART-FT+retrieval      12.87∗∗    26.93∗∗   83.22∗∗∗   60.5   10.5     -42.1∗∗∗   14.96         27.76∗     82.54∗∗∗    63.2   10.8   -35.5∗∗∗

* Metric notations: R-2: ROUGE-2, R-L: ROUGE-L, BS: BERTSore, DG: DistGain, DBS: DiffBS, PG: PPLGain. For all
metrics, higher means better. T-tests are conducted for all BART-based models using GPT-3 as a baseline (with 𝑝 < 0.001∗∗∗ , <
0.01∗∗ , < 0.05∗ ).
Table 2
Automatic evaluation results of four models on HCI RQ generation. GPT-3 is prompted with one-shot examples. BART-FT
denotes BART fine-tuned on our dataset. BART-FT+transfer denotes BART fine-tuned with transfer learning. BART-
FT+retrieval denotes BART fine-tuned with knowledge retrieval.

  Gold                  RQ1: What are the characteristics of github locked is-            R-2       R-L       BS       DG     DBS     PG
                        sues? RQ2: What are the justifications given in the
                        comments by project maintainers when locking issues
                        as too heated?
  GPT-3                 RQ1: How effective are github locked issues in curbing in-        7.7       23.7      81.2     86.1   13.4   -79.7
                        civility and toxicity in online discussions? RQ2: How well
                        do current machine learning detectors perform in detecting
                        incivility and toxicity in online discussions?
  BART                  RQ1: What kinds of incivility are present in code review          11.5      38.4      86.7     69.7   13.8   -46.5
                        discussions of rejected patches? RQ2: How common are
                        code review issues locked as too heated?
  BART-FT+transfer      RQ1: How do maintainers respond to github issues locked           19.2      33.3     90.3      82.4   13.4   -53.9
                        as too heated? RQ2: What are maintainers’ reactions to the
                        locked issues?
  BART-FT+retrieval     RQ1: What kinds of incivility exist in github issues locked     23.0        39.2      87.5     90.9   13.5   -59.9
                        as too heated? RQ2: What are the most common types of
                        incivility in github?

Table 3
Generated RQs on a test example of the paper titled “How heated is it? Understanding GitHub locked issues”.

“code review” and “rejected patches”.                                one-shot scenario with a manually selected example by
                                                                     researchers. Future work should take into considera-
6.2. Limitations and Future Work                                     tion the potential impact of the demonstration selection
                                                                     method on the generation quality of GPT-3.
Although the experimental results revealed RQ gener-
ation as a promising and meaningful task, several lim-
itations exist in our current study. First, the training             7. Conclusions
and evaluation of generation methods were conducted
on a relatively small-scale dataset, undermining the so-             In this work, we proposed a novel NLP task of HCI RQ
lidity of the conclusions yielded from the experiments.              generation. We curated a dataset of 8,904 HCI publica-
Future work should consider expanding the dataset by                 tions and a collection of 158 examples of (related work,
collecting more open-access publications and employing               RQ)-pairs. In addition to accuracy metrics, we evaluated
careful human annotation to expand the scale and im-                 the creativity of RQ generation with metrics for nov-
prove the quality of the dataset. Second, the evaluation             elty and surprise. We investigated the performance of
metrics used/proposed in this work did not fully con-                four approaches that leverage different types of knowl-
sider the open-ended nature of the RQ generation tasks.              edge. Through experiments, we showed general world
In practice, a well-surveyed research topic should yield             knowledge in pre-trained LM helped improve generation
many open-ended creative research questions, while our               novelty, and domain knowledge augmentation methods
evaluation was sorely based on the comparison between                improved accuracy and level of surprise. Future studies
the generated and ground-truth RQs. Further quantifi-                could explore knowledge augmentation methods by in-
able human evaluation should be incorporated to validate             corporating different kinds of knowledge, e.g., general
the quality of generation. Additionally, the evaluation              world knowledge, task knowledge, transferred domain
of GPT-3 as an RQ generation method only covered a                   knowledge, or retrieved textual knowledge.
Acknowledgments                                                [12] M. Färber, A. Jatowt, Citation recommendation:
                                                                    approaches and datasets, International Journal on
This material is based upon work supported by the Na-               Digital Libraries 21 (2020) 375–405.
tional Science Foundation under Grant No. 2119589. Any         [13] S. Spangler, A. D. Wilkins, B. J. Bachman, M. Na-
opinions, findings, and conclusions or recommendations              garajan, T. Dayaram, P. Haas, S. Regenbogen, C. R.
expressed in this material are those of the author(s) and           Pickering, A. Comer, J. N. Myers, et al., Automated
do not necessarily reflect the views of the National Sci-           hypothesis generation based on mining scientific lit-
ence Foundation.                                                    erature, in: Proceedings of the 20th ACM SIGKDD
                                                                    international conference on Knowledge discovery
                                                                    and data mining, 2014, pp. 1877–1886.
References                                                     [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
 [1] R. Elio, J. Hoover, I. Nikolaidis, M. Salavatipour,            plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
     L. Stewart, K. Wong, About computing science re-               try, A. Askell, et al., Language models are few-shot
     search methodology, 2011.                                      learners, Advances in neural information process-
 [2] Y.-C. Lee, N. Yamashita, Y. Huang, W. Fu, ” i hear             ing systems 33 (2020) 1877–1901.
     you, i feel you”: encouraging deep self-disclosure        [15] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
     through a chatbot, in: Proceedings of the 2020 CHI             hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
     conference on human factors in computing systems,              Denoising sequence-to-sequence pre-training for
     2020, pp. 1–12.                                                natural language generation, translation, and com-
 [3] H. R. Hartson, Human–computer interaction: In-                 prehension, in: Proceedings of the 58th Annual
     terdisciplinary roots and trends, Journal of systems           Meeting of the Association for Computational Lin-
     and software 43 (1998) 103–118.                                guistics, 2020, pp. 7871–7880.
 [4] J. Bardzell, S. Bardzell, Humanistic hci, Synthesis       [16] N. Duan, D. Tang, P. Chen, M. Zhou, Question
     Lectures on Human-Centered Informatics 8 (2015)                generation for question answering, in: Proceedings
     1–185.                                                         of the 2017 conference on empirical methods in
 [5] C. Kim, H. Kim, S. H. Han, C. Kim, M. K. Kim, S. H.            natural language processing, 2017, pp. 866–874.
     Park, Developing a technology roadmap for con-            [17] R. Puri, R. Spring, M. Shoeybi, M. Patwary, B. Catan-
     struction r&d through interdisciplinary research               zaro, Training question answering models from
     efforts, Automation in Construction 18 (2009)                  synthetic data, in: Proceedings of the 2020 Confer-
     330–337.                                                       ence on Empirical Methods in Natural Language
 [6] D. Rhoten, Interdisciplinary research: Trend or                Processing (EMNLP), Association for Computa-
     transition, Items and Issues 5 (2004) 6–11.                    tional Linguistics, Online, 2020, pp. 5811–5826.
 [7] C. Rusu, V. Rusu, Teaching hci: a challenging in-              URL: https://aclanthology.org/2020.emnlp-main.
     tercultural, interdisciplinary, cross-field experience,        468. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 4 6 8 .
     in: International Workshop on Intercultural Collab-       [18] S. Cao, L. Wang, Controllable open-ended ques-
     oration, Springer, 2007, pp. 344–354.                          tion generation with a new question type ontology,
 [8] P. Dourish, J. Finlay, P. Sengers, P. Wright, Re-              in: Proceedings of the 59th Annual Meeting of the
     flective hci: Towards a critical technical practice,           Association for Computational Linguistics and the
     in: CHI’04 extended abstracts on Human factors in              11th International Joint Conference on Natural Lan-
     computing systems, 2004, pp. 1727–1728.                        guage Processing (Volume 1: Long Papers), 2021,
 [9] W. E. Mackay, A.-L. Fayard, Hci, natural science and           pp. 6424–6439.
     design: a framework for triangulation across disci-       [19] G. Chen, J. Yang, C. Hauff, G.-J. Houben, Learningq:
     plines, in: Proceedings of the 2nd conference on               a large-scale dataset for educational question gener-
     Designing interactive systems: processes, practices,           ation, in: Twelfth International AAAI Conference
     methods, and techniques, 1997, pp. 223–234.                    on Web and Social Media, 2018.
[10] J. Lazar, J. H. Feng, H. Hochheiser, Research meth-       [20] H. Gong, L. Pan, H. Hu, Khanq: A dataset for gener-
     ods in human-computer interaction, Morgan Kauf-                ating deep questions in education, in: Proceedings
     mann, 2017.                                                    of the 29th International Conference on Computa-
[11] M. Yasunaga, J. Kasai, R. Zhang, A. R. Fabbri, I. Li,          tional Linguistics, 2022, pp. 5925–5938.
     D. Friedman, D. R. Radev, Scisummnet: A large an-         [21] S. Pollak, V. Podpecan, J. Kranjc, B. Lesjak,
     notated corpus and content-impact models for sci-              N. Lavrac, Scientific question generation: Pattern-
     entific paper summarization with citation networks,            based and graph-based robochair methods., in:
     in: Proceedings of the AAAI conference on artificial           ICCC, 2021, pp. 140–148.
     intelligence, volume 33, 2019, pp. 7386–7393.             [22] K. Ono, K. Sumita, S. M. Research, D. Center, T. C.
                                                                    Komukai-Toshiba-cho, et al., Abstract generation
     based on rhetorical structure extraction, arXiv             Systems 33 (2020) 9459–9474.
     preprint cmp-lg/9411023 (1994).                        [36] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu,
[23] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim,        S. Edunov, D. Chen, W.-t. Yih, Dense passage re-
     W. Chang, N. Goharian, A discourse-aware atten-             trieval for open-domain question answering, in:
     tion model for abstractive summarization of long            Proceedings of the 2020 Conference on Empirical
     documents, arXiv preprint arXiv:1804.05685 (2018).          Methods in Natural Language Processing (EMNLP),
[24] I. Cachola, K. Lo, A. Cohan, D. S. Weld, Tldr: Ex-          2020, pp. 6769–6781.
     treme summarization of scientific documents, arXiv     [37] M. Gaur, K. Gunaratna, V. Srinivasan, H. Jin, Iseeq:
     preprint arXiv:2004.15011 (2020).                           Information seeking question generation using dy-
[25] Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, N. F.         namic meta-information retrieval and knowledge
     Rajani, Reviewrobot: Explainable paper review               graphs, in: Proceedings of the AAAI Confer-
     generation based on knowledge synthesis, arXiv              ence on Artificial Intelligence, volume 36, 2022, pp.
     preprint arXiv:2010.06119 (2020).                           10672–10680.
[26] Q. Wang, Y. Xiong, Y. Zhang, J. Zhang, Y. Zhu, Au-     [38] J. Li, M. Galley, C. Brockett, J. Gao, W. B. Dolan, A
     tocite: Multi-modal representation fusion for con-          diversity-promoting objective function for neural
     textual citation generation, in: Proceedings of the         conversation models, in: Proceedings of the 2016
     14th ACM International Conference on Web Search             Conference of the North American Chapter of the
     and Data Mining, 2021, pp. 788–796.                         Association for Computational Linguistics: Human
[27] Q. Wang, L. Huang, Z. Jiang, K. Knight, H. Ji,              Language Technologies, 2016, pp. 110–119.
     M. Bansal, Y. Luan, Paperrobot: Incremental draft      [39] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
     generation of scientific ideas, in: Proceedings of          Y. Artzi, Bertscore: Evaluating text generation with
     the 57th Annual Meeting of the Association for              bert, arXiv preprint arXiv:1904.09675 (2019).
     Computational Linguistics, 2019, pp. 1980–1991.
[28] A. Cardoso, T. Veale, G. A. Wiggins, Converging
     on the divergent: The history (and future) of the
     international joint workshops in computational cre-
     ativity, AI magazine 30 (2009) 15–15.
[29] G. Franceschelli, M. Musolesi, Creativity and
     machine learning: A survey, arXiv preprint
     arXiv:2104.02726 (2021).
[30] W. Yu, C. Zhu, T. Zhao, Z. Guo, M. Jiang, Sentence-
     permuted paragraph generation, in: Proceedings
     of the 2021 Conference on Empirical Methods in
     Natural Language Processing, 2021, pp. 5051–5062.
[31] M. A. Boden, The creative mind: Myths and mech-
     anisms, Routledge, 2004.
[32] M. Lee, P. Liang, Q. Yang, Coauthor: Designing
     a human-ai collaborative writing dataset for ex-
     ploring language model capabilities, in: CHI Con-
     ference on Human Factors in Computing Systems,
     2022, pp. 1–19.
[33] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want to
     reduce labeling cost? gpt-3 can help, in: Findings
     of the Association for Computational Linguistics:
     EMNLP 2021, 2021, pp. 4195–4205.
[34] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal,
     C. Zhu, M. Zeng, M. Jiang, Generate rather than
     retrieve: Large language models are strong con-
     text generators, in: International Conference for
     Learning Representation (ICLR), 2023.
[35] P. Lewis, E. Perez, A. Piktus, F. Petroni,
     V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
     Yih, T. Rocktäschel, et al., Retrieval-augmented
     generation for knowledge-intensive nlp tasks,
     Advances in Neural Information Processing