<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generation for Human-Computer Interaction Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yiren Liu</string-name>
          <email>yirenl2@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mengxia Yu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meng Jiang</string-name>
          <email>mjiang2@nd.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yun Huang</string-name>
          <email>yunhuang@illinois.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Illinois Urbana-Champaign</institution>
          ,
          <addr-line>Champaign, IL, 61820</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Notre Dame</institution>
          ,
          <addr-line>Notre Dame, IN, 46556</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>It is essential to develop innovative and original research questions/ideas for interdisciplinary research fields, such as Human-Computer Interaction (HCI). In this work, we focus on discussing how recent natural language generation (NLG) methodologies can be applied to promote the formulation of creative research questions. We collect and curate a dataset that contains texts of RQs and related work sections from HCI papers, and introduce a new NLG task of automatic HCI research question (RQ) generation. In addition to applying common NLG metrics used to evaluate generation accuracy, including ROUGE and BERTScore, we propose two sets of new metrics for evaluating the creativity of generated RQs: 1) DistGain and DifBS for novelty, and 2) PPLGain for the level of surprise. The task is challenging due to the lack of external knowledge. We investigate four approaches to enhance the generation models with (1) general world knowledge, (2) task knowledge, (3) transferred knowledge, and (4) retrieved knowledge. The results of the experiment indicate that the incorporation of additional knowledge benefits both the accuracy and creativity of RQ generation. The dataset used in this study can be found at: https://github.com/yiren-liu/HAI-GEN-release.</p>
      </abstract>
      <kwd-group>
        <kwd>Research</kwd>
        <kwd>datasets</kwd>
        <kwd>text generation</kwd>
        <kwd>creativity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Asking novel research questions (RQ) is key to starting
innovative scientific studies. As David Hilbert states, “ he
who seeks for methods without having an infinite
problem in mind seeks for the most part in ‘vain’”. Proficient
scientists read and analyze representative literature in
a specific domain, in order to identify the limitations of
the existing work and ask new RQs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In computer
science research, methodologies are often derived from a
study’s core research question(s) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Research questions
(RQs) are one of the most important components in HCI
research, which are often explicitly stated in research
papers from the HCI domain. As an outline of the whole
paper, RQs are often proposed at the beginning sections
and often stated in a unified format, e.g., “RQ1: ..., RQ2:
...”. For example, in the HCI paper from Lee et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the
authors listed two RQs at the end of the related work
section:
”RQ1: How do diferent chatting styles
inlfuence people’s self-disclosure? and RQ2:
How do diferent chatting styles influence
people’s self-disclosure over time?”
Australia
†These authors contributed equally.
nEvelop-O
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], recommending related work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and generating
new biomedical hypotheses [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. If a machine could
generate RQs based on existing literature, it would help HCI
researchers discover potential research topics, though
they needed to verify the machine-suggested RQ
candiAI model or research on automating HCI RQ generation.
search on automating HCI RQ generation.
      </p>
      <sec id="sec-1-1">
        <title>In this work, we propose a novel task of research ques</title>
        <p>
          tion generation in the field of HCI research. Given the
related work section (denoted as   
aims to generate one or multiple research questions. We
), the task
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, dates. However, to the best of our knowledge, there is no
notice that given a set of literature, it is easy to come answers are short pieces of text. The QG datasets are
up with plausible but too generic RQs on broad research usually converted from the question-answering datasets.
topics. Therefore, the challenge of our task lies in that Instead of factoid questions, RQs are open-ended
queswhen the same set of literature is surveyed from difer- tions, the generation of which is found to be more
chalent perspectives, i.e., given diferent    , the lenging in prior work [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], because it requires a deep
generated RQs should be diferent, correspondingly. understanding and needs to be addressed with long-form
        </p>
        <p>
          To study this new problem, we build a dataset from answers. Nevertheless, the existing open-ended
quesHCI literature. We collect 8,904 HCI papers from Arxiv tion generation tasks are conditioned on the answers.
and manually extract 158 data examples. Each example More research is needed to be done in order to generate
has the text of the related work section and the text of re- unsolved open-ended HCI research problems.
search questions. In this study, we develop and evaluate For educational domains, QG systems often aim at
four approaches: (1) prompting pre-trained GPT-3 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] generating assessment questions, e.g., multi-choice
questhat has knowledge from pre-training corpus, (2) BART tions, to help students understand the learning materials
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that is fine-tuned on our limited training examples, and reduce the manual workload required from
instruc(3) transfer learning for knowledge augmentation that tors. Emerging studies have proposed datasets for
edwarms up the model to generate paper titles which are ucational QG [
          <xref ref-type="bibr" rid="ref19 ref20 ref21">19, 20, 21</xref>
          ]. However, these works aim
much more accessible than RQs, and (4) retrieval-based to generate questions that help with comprehension of
augmentation that uses information from the HCI litera- learning materials, not exploring potential unsolved
reture text we provide. search problems.
        </p>
        <p>
          We evaluate the RQ generation quality based on three
sets of automated metrics: (1) ROUGE and BERTScore 2.2. Scientific Text Generation
with target RQs as references for accuracy, (2) DistGain
and DifBS for novelty, and (3) PPLGain for level of sur- In order to reduce the burden of scientific writing or
simprise. We propose to use these metrics for evaluation ulate scientists’ behaviors, there is a line of research
aimfor practical reasons. First, when RQs are not explicitly ing at automatic scientific text generation. Since early
spelled out in HCI papers, the model that yields greater ac- work on abstract generation [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], various approaches
curacy could be more efectively utilized. As researchers have been proposed for scientific text summarization
try to quickly form the RQs given a large amount of [
          <xref ref-type="bibr" rid="ref11">23, 24, 11</xref>
          ]. Spangler et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] leverage text mining for
surveyed papers, the model could aid in boosting the efi- scientific hypothesis generation. ReviewerBot [ 25]
uticiency of literature review for both research and learning lizes information extracted from knowledge graphs to
purposes. Second, the model that leads to higher novelty construct synthetic paper reviews from templates.
Auand surprises could be used, when the HCI papers already toCite [26] leverages multi-modal information to
generexplicitly present RQs. In this case, researchers can com- ate contextualized citation texts. PaperRobot [27]
cascadpare the existing “ground truth” RQs with the generated ingly generates abstracts, conclusions, future work, and
RQs to explore “new” directions for future research. titles for a follow-on paper. However, the automatic HCI
        </p>
        <p>The main contributions of this study are: RQ question has not been studied as an NLP task.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Question Generation</title>
        <p>
          Automatic question generation (QG) has been studied as
a data augmentation approach for Question Answering
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and Machine Reading Comprehension [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Most
existing QG studies focus on factoid questions, whose
• We propose the task of HCI research question
        </p>
        <p>generation, collecting and releasing a dataset. 2.3. Evaluating Creativity in Text
• We design and develop four types of models that Generation
leverage various knowledge to improve RQ gen- Methods for enhancing the ability of machine learning
eration. models to produce original content have been a crucial
• We evaluate the accuracy, novelty, and level of topic in the emerging research domain of computational
surprise of generated RQs and find that knowl- creativity [28]. Franceschelli and Musolesi [29]
summaedge transfer is the most promising approach rized existing methods for creativity evaluation and
diswhen the available task data size is small. cussed their potential application in recent deep learning
models (e.g., VAE and GAN). However, most of these
existing evaluation methods are highly subjective and
require strong human intervention. With the recent
advances in text generation methods based on pre-trained
language models, additional research is still needed to
be done in order to automatically and objectively
evaluate the creativity of text generation models. Prior NLP
research has discussed potential methods to
automatically evaluate generation taking into consideration both</p>
        <p>Avg. # of words
per Related Work per RQ</p>
        <p>Avg. # of</p>
        <p>RQs per paper
train
dev
test
the accuracy and diversity of the generated results [30].</p>
        <p>In this work, we employ Boden’s three criteria [31] for
studying machine creativity, defined as “the ability to
generate ideas or artifacts that are new, surprising and
valuable”, to propose new metrics for creativity evaluate
in text generation tasks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Definition and Data</title>
      <p>This resulted in a total of 8,904 HCI-related papers 2. We
then convert these papers from PDF to sectioned XML
format using GROBID3 and SciPDF Parser 4 in order to
further analyze and filter based on their textual context. The
section and title information are preserved in the XML
version of our collected papers. For research questions,
we conducted pattern matching of question sentences
starting with “RQ”. In order to collect text from related
work sections, i.e.,    , we extract sections with
titles containing the keywords “related work”. We
remove RQs from    if it appears.</p>
      <p>The resulting dataset consists of 158 valid examples.</p>
      <p>We then split the dataset into train/dev/test sets with
108/25/25 examples. Note that the splits are carefully
arranged in chronological order, i.e. papers in the dev
and set are published later than those from the train
split. This is to ensure the RQs in the dev/test sets are
the newest and are not revealed in the train set. The
descriptive statistics of the final dataset can be found in
Table 1.</p>
      <p>Definition 1 (HCI Research Question Generation).</p>
      <p>Given the    of an HCI research paper, the
generation model requires maximizing  (|  ) .</p>
      <p>A research question refers to a question that a study
or research project aims to address. In HCI research
publications, RQs are often proposed after the survey of
related work. Based on the understanding of existing
literature and citation purposes, diferent papers will 4. Method
compose the related work sections diferently, even if
they cite the same set of literature. Correspondingly, To tackle the lack of knowledge issue in HCI RQ
gentheir research questions should be diferent. eration, we investigate four types of approaches that</p>
      <p>We formally define the task of HCI RQ generation with leverage diferent types of knowledge. We present three
task variables as follows. sets of quantitative metrics to evaluate the quality of
generated questions from three diferent aspects: accuracy,
novelty and level of surprise.</p>
      <sec id="sec-3-1">
        <title>4.1. Generation Models with Various</title>
      </sec>
      <sec id="sec-3-2">
        <title>Knowledge</title>
        <p>In real-life HCI research scenarios, researchers strive
to propose highly novel and creative research questions
based on existing work. Thus, we propose also to mea- In this section, we describe the diferent models used for
sure the creativity of generated research questions. Based training and evaluating the RQ generation task.
on the theory of Boden’s criteria [31] “the ability to gen- Pre-trained GPT-3 As a large language model (LM)
erate ideas or artifacts that are new, surprising, and valu- with 175 billion parameters, GPT-3 is the
state-of-theable’’, we construct the creativity measurement as a com- art learner succeeding on many NLP tasks and shows
bination of two aspects: 1) novelty and 2) level of surprise. its capability in research paper writing [32], educational
We do not evaluate the value of generated RQs since we question generation [33] and open-domain QA [34].
GPTbelieve it would require extensive expert knowledge and 3 is trained on 45 TB of text data from multiple sources
is hardly feasible without human intervention. which include Wikipedia and books, enabling the model
to store a huge amount of general world knowledge.</p>
        <p>Definition 2 (Generation Creativity). 1) We measure Fine-tuned BART We choose BART, a
Transformersthe novelty of a set of RQs by comparing their similarity based pretrained generation model, as our backbone
to the RQs of prior publications within our collected corpus; model. By fine-tuning BART on our RQ generation
2) We measure the level of surprise of a set of RQs based dataset, the model should acquire specific task
knowlon their perplexity with respect to the perplexity of existing edge, but the knowledge would be limited due to data
RQs using a large PLM (e.g., GPT-2). scarcity.</p>
        <p>Knowledge transfer from title generation
Transfer learning is an efective way to improve the model</p>
        <p>To collect open-access HCI publications, we used
papers available through Arxiv. We collected PDF files of
papers under the category of Human-Computer
Interaction (cs.HC) 1 using the public API provided by Arxiv.</p>
        <sec id="sec-3-2-1">
          <title>2https://github.com/yiren-liu/HAI-GEN-release</title>
          <p>3https://github.com/kermitt2/grobid
4https://github.com/titipata/scipdf_parser</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>1https://arxiv.org/list/cs.HC/recent</title>
          <p>when only a limited amount of data on the target task
DistGain can be written as follows:
from the HCI corpus. The retrieved sentences are ap- averaged over all generated RQs.</p>
          <p>1  |{  } − {  }|
 =</p>
          <p>∑
 =1
|  |</p>
          <p>,
and {
  ∈ {
and  = |{</p>
          <p>existing}|. We average the 
sequence   = (  )∶ denotes the  -th RQ in {
where sequence   = (  )∶ denotes the  -th generated RQ,</p>
          <p>existing},
Diference in BERTScore
(DifBS</p>
          <p>or DBS): In
order to measure the distance between the generated RQ
existing}, we calculate cosine similarity of BERT</p>
          <p>(1)
 of all
.
1 
 =1
  
 =
∑(1 −  BERT(  ,   )),
(2)
In most cases, paper titles can be considered as a high- generated RQs to obtain an overall score 
Knowledge retrieval from</p>
          <p>HCI corpus Knowl- all existing RQs:
to RQ generation. The titles in train/dev/test sets are ex- embeddings [39] between the generated RQ and each
cluded. They are not used as input for the target task. So
there is no data leaking.</p>
          <p>existing}. For each generated RQ, we calculate
the F1-BERTScore for each pair (  ,   ), and average over
is available. The available RQ data may be limited for a
variety of reasons, e.g., errors during PDF parsing, or RQs
that are not explicitly written in some papers. In contrast,
paper titles are more accessible, where the amount we
extracted is 30 times that of research questions. In
semantic space, a paper’s title represents its most significant
contribution, which is strongly tied to its research topics.
level summary of the solution to the research questions.</p>
          <p>Therefore, we propose to augment the BART model with
transfer relevant task knowledge from title generation
edge retrieval is another promising solution to many
knowledge-intensive NLP tasks [35] such as question
answering [36] and information-seeking question
generation [37]. To incorporate external domain knowledge,
we apply the Dense Passage Retriever (DPR) [36] to
retrieve sentences most relevant to the input   
pended to the end of the original related work text as
input.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.2. Evaluation Methods for Novelty and</title>
      </sec>
      <sec id="sec-3-4">
        <title>Surprise</title>
        <p>The task of HCI RQ generation aims to generate
openended research questions to inspire researchers, which
need to be highly creative. Recently, Computational
Creativity has become an emerging field of study in the HCI
domain [29]. Inspired by Boden’s three criteria [31] “the
ability to generate ideas or artifacts that are new,
surprising and valuable”, we introduce evaluation metrics to
measure the novelty and level of surprise of generated
RQs. We do not evaluate the value of generated RQs
to HCI research since it would require extensive expert
knowledge and human intervention.
4.2.1. Measuring novelty
To evaluate the novelty of generated RQs, i.e., how
new/original the RQs are, we measure the diference between
the generated RQs and prior RQs. We introduce two
metrics: 1) an  -gram-based score DistGain, and an
embedding-based score DifBS . We first make a set of
prior RQs, denoted as {</p>
        <p>existing}, from papers published
earlier than the papers in dev/test sets.</p>
        <p>Distinct- gain (DistGain or DG) is defined based
on Distinct- [38]. We calculate the average proportion
between   and   . The final   
where  BERT(  ,   )denotes the F1-BERTScore calculated
for each model is
4.2.2. Measuring level of surprise
To measure the level of surprise, we refer to Boden [31]’s
definition of surprise “an idea may be surprising because
it’s unfamiliar, or even unlikely”. We propose a new
automatic metric to measure the level of surprise in generated
RQs.</p>
        <p>Perplexity Gain (PPLGain). Perplexity, the inverse
probability, is frequently used to measure how uncertain
an LM generates the test data. Given a text, the higher
the perplexity is, the more uncertain the LM is about
generating it. Assuming an LM is successfully pre-trained
with a suficient amount of general text data, the
perplexity reflects the unexpectedness, or level of surprise, of
the LM to the given text. Thus, we employ the perplexity
of GPT-2 of the RQs:</p>
        <p>1
 =1
(
 ) =exp (−
∑ log ( 
| 1, ...,  −1 )) .</p>
        <p>(3)
To measure the level of surprise, or unexpectedness, of
the generated RQs, we calculate the diference between
the perplexity of generated RQs and prior RQs. We define
the perplexity gain as follows:
pared to the total number of  -grams in the { existing}.
of new unique  -grams in the newly generated RQ com- The final  
 
 =
(

1


 ) − 1</p>
        <p>∑</p>
        <p>=1 (
∑
=1 (
 )
 )
.</p>
        <p>(4)
score is averaged over all   .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <sec id="sec-4-1">
        <title>5.1. Evaluation Methods</title>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Experimental Settings</title>
        <p>We evaluate the generation quality with three sets of
metrics: (1) ROUGE and BERTScore for measuring
accuracy; (2) DistGain and DifBS for measuring novelty; (3)
PPLGain for measuring surprise.
of surprise of generated RQs. As a large LM, GPT-3
possesses rich knowledge outside of the HCI research
domain, which enables it to output diferent words from
existing RQs, but those words may be of the research
topic.</p>
        <p>Knowledge augmentation is efective on HCI RQ
generation. Transfer learning augmented model, i.e.,
BART-FT+transfer, outperforms BART baselines in terms
of ROUGE-2 (11.7%↑ on dev and 9.1%↑ on test) and
ROUGE-L (4.7%↑ on dev and 3.1%↑ on test). The
efectiveness of transfer learning shows that learning the task
of title generation helps bridge the gap between existing
research and new research. Retrieval augmented
models, i.e., BART-FT+retrieval, perform at the same level as
BART-FT on dev set and significantly outperforms
BARTFT in terms of ROUGE-2 (20.9%↑) and ROUGE-L (4.2%↑)
on the test set. The model also surpassed the baseline
BART-FT in novelty and surprise metrics. Both
knowledge augmentation methods improve the novelty and
surprise of the generated RQs. This implies that introducing
additional knowledge from publications enables the
language model to generate RQs with new ideas outside the
training set. Although both methods improve generation
novelty and surprise, using knowledge transfer results in
a higher increase. This might be because titles tend to
relfect the contributions of studies in a self-contained and
abstractive manner. Similarity-based retrieved results
tend to be individual sentences that might be confusing,
or even noisy when they are used as input, because they
bring information outside the context paragraph.</p>
        <sec id="sec-4-2-1">
          <title>We evaluated four text generation models with diferent</title>
          <p>types of knowledge over our proposed dataset.</p>
          <p>GPT-3. We prompt GPT-3 (text-davinci-002) with a
one-shot example. We use a temperature of 0.7 and pick
the top-1 generation. To align the output format with
BART-based models, we post-process the GPT-3 output
by replacing the question number. That means, “1.” or
“1)” will be replaced by “RQ1:”.</p>
          <p>BART-FT. We use the    section as input.</p>
          <p>An HCI paper may have multiple research questions.</p>
          <p>The latter ones are highly likely to be dependent on the
previous ones. Thus, instead of an individual RQ, our
output is set as a sequence of concatenated RQs such as
“RQ1: ..., RQ2: ...”. For all the experiments with the BART
model, the maximum input and output length is set as
768 and 128 tokens, respectively.</p>
          <p>BART-FT+transfer. To transfer knowledge from
title generation, we first fine-tune the BART model on
{   , title} pairs and then continue fine-tuning
on {   , RQ} pairs. We carefully construct the
dataset for title generation and avoid dev/test RQ data
leaking in the training data of the title generation. 6. Discussion</p>
          <p>BART-FT+retrieval. To construct the retrieval
corpus, we gather the abstract, introduction, and related 6.1. Case Study of Generated RQs
work section of the existing papers that were published
before dev/test papers, split the text into sentences, and To further validate the proposed creativity metrics, we
form an HCI corpus containing 310,955 sentences. We qualitatively compare examples of RQs generated by
difretrieve top-3 sentences with pre-trained DPR using re- ferent models, as shown in Table. 3. It shows that RQs
lated work as queries, and append the retrieved text to generated by GPT-3 appear to be less relevant compared
input sequences. to other models, where the research topic is generalized
from “GitHub issues” to “online discussion”. Meanwhile,
the results generated by GPT-3 also sufered from
repeti5.3. Results tion as the sequence of “incivility and toxicity in online
Results on automatic evaluation are presented in Table 2. discussions” appeared twice in the given example.
How</p>
          <p>GPT-3 with general world knowledge increases ever, the language/words it uses could be new compared
generation novelty, but under-performs fine-tuned to prior RQs. This implies that the incorporation of
genmodels in accuracy and surprise level. Table 2 shows eral world knowledge generalizes the content of
machinethat, compared to the BART models, GPT-3 performs created RQs to domains other than that of the target
worse in terms of ROUGE and BERTScore on dev and paper. In this example, only BART-FT+transfer captured
test, but it surpassed the other three models on DistGain the information about “maintainers” which is critical in
and DifBS, which are both measurements for genera- the ground truth RQ2, showing the advantage of
transtion novelty. However, all three BART-based models fer learning. We also found that the output of BART
achieved higher PPLGain scores which measure the level achieved the highest PPLGain score (level of surprise),
as the results mentioned interesting concepts including
GPT-3
BART-FT
BART-FT+transfer
BART-FT+retrieval</p>
          <p>R-2
BS</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>6.2. Limitations and Future Work</title>
        <p>one-shot scenario with a manually selected example by
researchers. Future work should take into
consideration the potential impact of the demonstration selection
method on the generation quality of GPT-3.</p>
        <p>Although the experimental results revealed RQ
generation as a promising and meaningful task, several
limitations exist in our current study. First, the training 7. Conclusions
and evaluation of generation methods were conducted
on a relatively small-scale dataset, undermining the so- In this work, we proposed a novel NLP task of HCI RQ
lidity of the conclusions yielded from the experiments. generation. We curated a dataset of 8,904 HCI
publicaFuture work should consider expanding the dataset by tions and a collection of 158 examples of (related work,
collecting more open-access publications and employing RQ)-pairs. In addition to accuracy metrics, we evaluated
careful human annotation to expand the scale and im- the creativity of RQ generation with metrics for
novprove the quality of the dataset. Second, the evaluation elty and surprise. We investigated the performance of
metrics used/proposed in this work did not fully con- four approaches that leverage diferent types of
knowlsider the open-ended nature of the RQ generation tasks. edge. Through experiments, we showed general world
In practice, a well-surveyed research topic should yield knowledge in pre-trained LM helped improve generation
many open-ended creative research questions, while our novelty, and domain knowledge augmentation methods
evaluation was sorely based on the comparison between improved accuracy and level of surprise. Future studies
the generated and ground-truth RQs. Further quantifi- could explore knowledge augmentation methods by
inable human evaluation should be incorporated to validate corporating diferent kinds of knowledge, e.g., general
the quality of generation. Additionally, the evaluation world knowledge, task knowledge, transferred domain
of GPT-3 as an RQ generation method only covered a knowledge, or retrieved textual knowledge.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This material is based upon work supported by the
National Science Foundation under Grant No. 2119589. Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National
Science Foundation.
based on rhetorical structure extraction, arXiv Systems 33 (2020) 9459–9474.</p>
      <p>preprint cmp-lg/9411023 (1994). [36] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu,
[23] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, S. Edunov, D. Chen, W.-t. Yih, Dense passage
reW. Chang, N. Goharian, A discourse-aware atten- trieval for open-domain question answering, in:
tion model for abstractive summarization of long Proceedings of the 2020 Conference on Empirical
documents, arXiv preprint arXiv:1804.05685 (2018). Methods in Natural Language Processing (EMNLP),
[24] I. Cachola, K. Lo, A. Cohan, D. S. Weld, Tldr: Ex- 2020, pp. 6769–6781.</p>
      <p>treme summarization of scientific documents, arXiv [37] M. Gaur, K. Gunaratna, V. Srinivasan, H. Jin, Iseeq:
preprint arXiv:2004.15011 (2020). Information seeking question generation using
dy[25] Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, N. F. namic meta-information retrieval and knowledge
Rajani, Reviewrobot: Explainable paper review graphs, in: Proceedings of the AAAI
Confergeneration based on knowledge synthesis, arXiv ence on Artificial Intelligence, volume 36, 2022, pp.
preprint arXiv:2010.06119 (2020). 10672–10680.
[26] Q. Wang, Y. Xiong, Y. Zhang, J. Zhang, Y. Zhu, Au- [38] J. Li, M. Galley, C. Brockett, J. Gao, W. B. Dolan, A
tocite: Multi-modal representation fusion for con- diversity-promoting objective function for neural
textual citation generation, in: Proceedings of the conversation models, in: Proceedings of the 2016
14th ACM International Conference on Web Search Conference of the North American Chapter of the
and Data Mining, 2021, pp. 788–796. Association for Computational Linguistics: Human
[27] Q. Wang, L. Huang, Z. Jiang, K. Knight, H. Ji, Language Technologies, 2016, pp. 110–119.</p>
      <p>M. Bansal, Y. Luan, Paperrobot: Incremental draft [39] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
generation of scientific ideas, in: Proceedings of Y. Artzi, Bertscore: Evaluating text generation with
the 57th Annual Meeting of the Association for bert, arXiv preprint arXiv:1904.09675 (2019).</p>
      <p>Computational Linguistics, 2019, pp. 1980–1991.
[28] A. Cardoso, T. Veale, G. A. Wiggins, Converging
on the divergent: The history (and future) of the
international joint workshops in computational
creativity, AI magazine 30 (2009) 15–15.
[29] G. Franceschelli, M. Musolesi, Creativity and
machine learning: A survey, arXiv preprint
arXiv:2104.02726 (2021).
[30] W. Yu, C. Zhu, T. Zhao, Z. Guo, M. Jiang,
Sentencepermuted paragraph generation, in: Proceedings
of the 2021 Conference on Empirical Methods in</p>
      <p>Natural Language Processing, 2021, pp. 5051–5062.
[31] M. A. Boden, The creative mind: Myths and
mech</p>
      <p>anisms, Routledge, 2004.
[32] M. Lee, P. Liang, Q. Yang, Coauthor: Designing
a human-ai collaborative writing dataset for
exploring language model capabilities, in: CHI
Conference on Human Factors in Computing Systems,
2022, pp. 1–19.
[33] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want to
reduce labeling cost? gpt-3 can help, in: Findings
of the Association for Computational Linguistics:</p>
      <p>EMNLP 2021, 2021, pp. 4195–4205.
[34] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal,</p>
      <p>C. Zhu, M. Zeng, M. Jiang, Generate rather than
retrieve: Large language models are strong
context generators, in: International Conference for</p>
      <p>Learning Representation (ICLR), 2023.
[35] P. Lewis, E. Perez, A. Piktus, F. Petroni,</p>
      <p>V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.</p>
      <p>Yih, T. Rocktäschel, et al., Retrieval-augmented
generation for knowledge-intensive nlp tasks,
Advances in Neural Information Processing</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Elio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoover</surname>
          </string-name>
          , I. Nikolaidis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salavatipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <source>About computing science research methodology</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yamashita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          , W. Fu, ”
          <article-title>i hear you, i feel you”: encouraging deep self-disclosure through a chatbot</article-title>
          ,
          <source>in: Proceedings of the 2020 CHI conference on human factors in computing systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Hartson</surname>
          </string-name>
          ,
          <article-title>Human-computer interaction: Interdisciplinary roots and trends</article-title>
          ,
          <source>Journal of systems and software 43</source>
          (
          <year>1998</year>
          )
          <fpage>103</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bardzell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bardzell</surname>
          </string-name>
          , Humanistic hci,
          <source>Synthesis Lectures on Human-Centered Informatics</source>
          <volume>8</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>Developing a technology roadmap for construction r&amp;d through interdisciplinary research eforts, Automation in Construction 18 (</article-title>
          <year>2009</year>
          )
          <fpage>330</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rhoten</surname>
          </string-name>
          , Interdisciplinary research: Trend or transition,
          <source>Items and Issues</source>
          <volume>5</volume>
          (
          <year>2004</year>
          )
          <fpage>6</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <article-title>Teaching hci: a challenging intercultural, interdisciplinary, cross-field experience</article-title>
          , in: International Workshop on Intercultural Collaboration, Springer,
          <year>2007</year>
          , pp.
          <fpage>344</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dourish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finlay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sengers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wright</surname>
          </string-name>
          , Relfective hci:
          <article-title>Towards a critical technical practice</article-title>
          , in: CHI'
          <article-title>04 extended abstracts on Human factors in computing systems</article-title>
          ,
          <year>2004</year>
          , pp.
          <fpage>1727</fpage>
          -
          <lpage>1728</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Mackay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Fayard</surname>
          </string-name>
          ,
          <article-title>Hci, natural science and design: a framework for triangulation across disciplines</article-title>
          ,
          <source>in: Proceedings of the 2nd conference on Designing interactive systems: processes</source>
          , practices, methods, and techniques,
          <year>1997</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Feng</surname>
          </string-name>
          , H. Hochheiser,
          <article-title>Research methods in human-computer interaction</article-title>
          , Morgan Kaufmann,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Fabbri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <article-title>Scisummnet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>7386</fpage>
          -
          <lpage>7393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Färber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <article-title>Citation recommendation: approaches and datasets</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>375</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Spangler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Wilkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Bachman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nagarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dayaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Regenbogen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Pickering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Comer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Myers</surname>
          </string-name>
          , et al.,
          <source>Automated hypothesis generation based on mining scientific literature</source>
          ,
          <source>in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1886</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Question generation for question answering</article-title>
          ,
          <source>in: Proceedings of the 2017 conference on empirical methods in natural language processing</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>866</fpage>
          -
          <lpage>874</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Puri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Spring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shoeybi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          ,
          <article-title>Training question answering models from synthetic data</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5811</fpage>
          -
          <lpage>5826</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <source>468. doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 /
          <article-title>2 0 2 0</article-title>
          . e m n l p -
          <source>m a i n . 4</source>
          <volume>6</volume>
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Controllable open-ended question generation with a new question type ontology</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>6424</fpage>
          -
          <lpage>6439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-J.</given-names>
            <surname>Houben</surname>
          </string-name>
          ,
          <article-title>Learningq: a large-scale dataset for educational question generation</article-title>
          , in: Twelfth
          <source>International AAAI Conference on Web and Social Media</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Khanq: A dataset for generating deep questions in education</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5925</fpage>
          -
          <lpage>5938</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pollak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Podpecan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kranjc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lesjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavrac</surname>
          </string-name>
          ,
          <article-title>Scientific question generation: Patternbased and graph-based robochair methods</article-title>
          ., in: ICCC,
          <year>2021</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sumita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Research</surname>
          </string-name>
          , D. Center, T. C.
          <article-title>Komukai-Toshiba-cho</article-title>
          , et al.,
          <source>Abstract generation</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>