Creative Research Question Generation for Human-Computer Interaction Research Yiren Liu1,† , Mengxia Yu2,† , Meng Jiang2 and Yun Huang2 1 University of Illinois Urbana-Champaign, Champaign, IL, 61820, USA 2 University of Notre Dame, Notre Dame, IN, 46556, USA Abstract It is essential to develop innovative and original research questions/ideas for interdisciplinary research fields, such as Human-Computer Interaction (HCI). In this work, we focus on discussing how recent natural language generation (NLG) methodologies can be applied to promote the formulation of creative research questions. We collect and curate a dataset that contains texts of RQs and related work sections from HCI papers, and introduce a new NLG task of automatic HCI research question (RQ) generation. In addition to applying common NLG metrics used to evaluate generation accuracy, including ROUGE and BERTScore, we propose two sets of new metrics for evaluating the creativity of generated RQs: 1) DistGain and DiffBS for novelty, and 2) PPLGain for the level of surprise. The task is challenging due to the lack of external knowledge. We investigate four approaches to enhance the generation models with (1) general world knowledge, (2) task knowledge, (3) transferred knowledge, and (4) retrieved knowledge. The results of the experiment indicate that the incorporation of additional knowledge benefits both the accuracy and creativity of RQ generation. The dataset used in this study can be found at: https://github.com/yiren-liu/HAI-GEN-release. Keywords datasets, text generation, creativity 1. Introduction Besides the important role of RQs, the interdisciplinary nature of HCI research motivates us to perform this study Asking novel research questions (RQ) is key to starting [3, 4]. There is a global trend of interdisciplinary research innovative scientific studies. As David Hilbert states, “he [5, 6]. The fact that HCI is a highly interdisciplinary field who seeks for methods without having an infinite prob- [3, 4] poses unique challenges [7, 8] to education and re- lem in mind seeks for the most part in ‘vain’”. Proficient search. Depending on their interests and skills, students scientists read and analyze representative literature in and scholars could conduct HCI works and contribute a specific domain, in order to identify the limitations of from various perspectives to different disciplinary areas the existing work and ask new RQs [1]. In computer [9]. Because of the interdisciplinary nature, HCI research science research, methodologies are often derived from a could make contributions that are technical-driven, UX- study’s core research question(s) [1]. Research questions focused, and/or method-oriented, etc. [10], which opens (RQs) are one of the most important components in HCI a wide door to innovation. The related work section of research, which are often explicitly stated in research HCI papers often reviews the relevant literature, which papers from the HCI domain. As an outline of the whole can be used to drive the RQs. paper, RQs are often proposed at the beginning sections Human researchers have to spend a lot of time reading and often stated in a unified format, e.g., “RQ1: ..., RQ2: and understanding tons of interdisciplinary literature. ...”. For example, in the HCI paper from Lee et al. [2], the Artificial intelligence (AI) systems have demonstrated authors listed two RQs at the end of the related work their abilities to facilitate some types of scientific re- section: search tasks, such as summarizing scientific literature ”RQ1: How do different chatting styles in- [11], recommending related work [12], and generating fluence people’s self-disclosure? and RQ2: new biomedical hypotheses [13]. If a machine could gen- How do different chatting styles influence erate RQs based on existing literature, it would help HCI people’s self-disclosure over time?” researchers discover potential research topics, though they needed to verify the machine-suggested RQ candi- Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, dates. However, to the best of our knowledge, there is no Australia † AI model or research on automating HCI RQ generation. These authors contributed equally. Envelope-Open yirenl2@illinois.edu (Y. Liu); myu2@nd.edu (M. Yu); search on automating HCI RQ generation. mjiang2@nd.edu (M. Jiang); yunhuang@illinois.edu (Y. Huang) In this work, we propose a novel task of research ques- Orcid 0000-0003-1507-0303 (Y. Liu); 0000-0002-6627-2709 (M. Yu); tion generation in the field of HCI research. Given the 0000-0002-3009-519X (M. Jiang); 0000-0003-0399-8032 (Y. Huang) related work section (denoted as 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘), the task © 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). aims to generate one or multiple research questions. We CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) notice that given a set of literature, it is easy to come answers are short pieces of text. The QG datasets are up with plausible but too generic RQs on broad research usually converted from the question-answering datasets. topics. Therefore, the challenge of our task lies in that Instead of factoid questions, RQs are open-ended ques- when the same set of literature is surveyed from differ- tions, the generation of which is found to be more chal- ent perspectives, i.e., given different 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, the lenging in prior work [18], because it requires a deep generated RQs should be different, correspondingly. understanding and needs to be addressed with long-form To study this new problem, we build a dataset from answers. Nevertheless, the existing open-ended ques- HCI literature. We collect 8,904 HCI papers from Arxiv tion generation tasks are conditioned on the answers. and manually extract 158 data examples. Each example More research is needed to be done in order to generate has the text of the related work section and the text of re- unsolved open-ended HCI research problems. search questions. In this study, we develop and evaluate For educational domains, QG systems often aim at four approaches: (1) prompting pre-trained GPT-3 [14] generating assessment questions, e.g., multi-choice ques- that has knowledge from pre-training corpus, (2) BART tions, to help students understand the learning materials [15] that is fine-tuned on our limited training examples, and reduce the manual workload required from instruc- (3) transfer learning for knowledge augmentation that tors. Emerging studies have proposed datasets for ed- warms up the model to generate paper titles which are ucational QG [19, 20, 21]. However, these works aim much more accessible than RQs, and (4) retrieval-based to generate questions that help with comprehension of augmentation that uses information from the HCI litera- learning materials, not exploring potential unsolved re- ture text we provide. search problems. We evaluate the RQ generation quality based on three sets of automated metrics: (1) ROUGE and BERTScore 2.2. Scientific Text Generation with target RQs as references for accuracy, (2) DistGain and DiffBS for novelty, and (3) PPLGain for level of sur- In order to reduce the burden of scientific writing or sim- prise. We propose to use these metrics for evaluation ulate scientists’ behaviors, there is a line of research aim- for practical reasons. First, when RQs are not explicitly ing at automatic scientific text generation. Since early spelled out in HCI papers, the model that yields greater ac- work on abstract generation [22], various approaches curacy could be more effectively utilized. As researchers have been proposed for scientific text summarization try to quickly form the RQs given a large amount of [23, 24, 11]. Spangler et al. [13] leverage text mining for surveyed papers, the model could aid in boosting the effi- scientific hypothesis generation. ReviewerBot [25] uti- ciency of literature review for both research and learning lizes information extracted from knowledge graphs to purposes. Second, the model that leads to higher novelty construct synthetic paper reviews from templates. Au- and surprises could be used, when the HCI papers already toCite [26] leverages multi-modal information to gener- explicitly present RQs. In this case, researchers can com- ate contextualized citation texts. PaperRobot [27] cascad- pare the existing “ground truth” RQs with the generated ingly generates abstracts, conclusions, future work, and RQs to explore “new” directions for future research. titles for a follow-on paper. However, the automatic HCI The main contributions of this study are: RQ question has not been studied as an NLP task. • We propose the task of HCI research question generation, collecting and releasing a dataset. 2.3. Evaluating Creativity in Text • We design and develop four types of models that Generation leverage various knowledge to improve RQ gen- Methods for enhancing the ability of machine learning eration. models to produce original content have been a crucial • We evaluate the accuracy, novelty, and level of topic in the emerging research domain of computational surprise of generated RQs and find that knowl- creativity [28]. Franceschelli and Musolesi [29] summa- edge transfer is the most promising approach rized existing methods for creativity evaluation and dis- when the available task data size is small. cussed their potential application in recent deep learning models (e.g., VAE and GAN). However, most of these existing evaluation methods are highly subjective and 2. Related Work require strong human intervention. With the recent ad- vances in text generation methods based on pre-trained 2.1. Question Generation language models, additional research is still needed to Automatic question generation (QG) has been studied as be done in order to automatically and objectively evalu- a data augmentation approach for Question Answering ate the creativity of text generation models. Prior NLP [16] and Machine Reading Comprehension [17]. Most research has discussed potential methods to automati- existing QG studies focus on factoid questions, whose cally evaluate generation taking into consideration both Avg. # of words Avg. # of This resulted in a total of 8,904 HCI-related papers 2 . We per Related Work per RQ RQs per paper then convert these papers from PDF to sectioned XML train 409.9 14.2 2.4 format using GROBID 3 and SciPDF Parser 4 in order to fur- dev 369.9 13.8 2.4 ther analyze and filter based on their textual context. The test 332.4 15.7 2.2 section and title information are preserved in the XML version of our collected papers. For research questions, Table 1 we conducted pattern matching of question sentences Descriptive Statistics of the Proposed Dataset starting with “RQ”. In order to collect text from related the accuracy and diversity of the generated results [30]. work sections, i.e., 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, we extract sections with In this work, we employ Boden’s three criteria [31] for titles containing the keywords “related work”. We re- studying machine creativity, defined as “the ability to move RQs from 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 if it appears. generate ideas or artifacts that are new, surprising and The resulting dataset consists of 158 valid examples. valuable”, to propose new metrics for creativity evaluate We then split the dataset into train/dev/test sets with in text generation tasks. 108/25/25 examples. Note that the splits are carefully arranged in chronological order, i.e. papers in the dev and set are published later than those from the train 3. Problem Definition and Data split. This is to ensure the RQs in the dev/test sets are the newest and are not revealed in the train set. The A research question refers to a question that a study descriptive statistics of the final dataset can be found in or research project aims to address. In HCI research Table 1. publications, RQs are often proposed after the survey of related work. Based on the understanding of existing literature and citation purposes, different papers will 4. Method compose the related work sections differently, even if they cite the same set of literature. Correspondingly, To tackle the lack of knowledge issue in HCI RQ gen- their research questions should be different. eration, we investigate four types of approaches that We formally define the task of HCI RQ generation with leverage different types of knowledge. We present three task variables as follows. sets of quantitative metrics to evaluate the quality of gen- erated questions from three different aspects: accuracy, Definition 1 (HCI Research Question Generation). novelty and level of surprise. Given the 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 of an HCI research paper, the gen- eration model requires maximizing 𝑃(𝑅𝑄|𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘). 4.1. Generation Models with Various In real-life HCI research scenarios, researchers strive Knowledge to propose highly novel and creative research questions based on existing work. Thus, we propose also to mea- In this section, we describe the different models used for sure the creativity of generated research questions. Based training and evaluating the RQ generation task. on the theory of Boden’s criteria [31] “the ability to gen- Pre-trained GPT-3 As a large language model (LM) erate ideas or artifacts that are new, surprising, and valu- with 175 billion parameters, GPT-3 is the state-of-the- able’’, we construct the creativity measurement as a com- art learner succeeding on many NLP tasks and shows bination of two aspects: 1) novelty and 2) level of surprise. its capability in research paper writing [32], educational We do not evaluate the value of generated RQs since we question generation [33] and open-domain QA [34]. GPT- believe it would require extensive expert knowledge and 3 is trained on 45 TB of text data from multiple sources is hardly feasible without human intervention. which include Wikipedia and books, enabling the model to store a huge amount of general world knowledge. Definition 2 (Generation Creativity). 1) We measure Fine-tuned BART We choose BART, a Transformers- the novelty of a set of RQs by comparing their similarity based pretrained generation model, as our backbone to the RQs of prior publications within our collected corpus; model. By fine-tuning BART on our RQ generation 2) We measure the level of surprise of a set of RQs based dataset, the model should acquire specific task knowl- on their perplexity with respect to the perplexity of existing edge, but the knowledge would be limited due to data RQs using a large PLM (e.g., GPT-2). scarcity. To collect open-access HCI publications, we used pa- Knowledge transfer from title generation Trans- pers available through Arxiv. We collected PDF files of fer learning is an effective way to improve the model papers under the category of Human-Computer Inter- 2 action (cs.HC) 1 using the public API provided by Arxiv. 3 https://github.com/yiren-liu/HAI-GEN-release https://github.com/kermitt2/grobid 1 4 https://arxiv.org/list/cs.HC/recent https://github.com/titipata/scipdf_parser when only a limited amount of data on the target task DistGain can be written as follows: is available. The available RQ data may be limited for a 𝑀 variety of reasons, e.g., errors during PDF parsing, or RQs 1 |{𝑦𝑗 } − {𝑥𝑖 }| 𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛𝑗 = ∑ , (1) that are not explicitly written in some papers. In contrast, 𝑀 𝑖=1 |𝑌𝑗 | paper titles are more accessible, where the amount we extracted is 30 times that of research questions. In seman- where sequence 𝑌𝑗 = (𝑦𝑗 )∶ denotes the 𝑗-th generated RQ, tic space, a paper’s title represents its most significant sequence 𝑋𝑖 = (𝑥𝑖 )∶ denotes the 𝑖-th RQ in {𝑅𝑄existing }, contribution, which is strongly tied to its research topics. and 𝑀 = |{𝑅𝑄existing }|. We average the 𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛𝑗 of all In most cases, paper titles can be considered as a high- generated RQs to obtain an overall score 𝐷𝑖𝑠𝑡𝐺𝑎𝑖𝑛. level summary of the solution to the research questions. Difference in BERTScore (DiffBS or DBS): In or- Therefore, we propose to augment the BART model with der to measure the distance between the generated RQ transfer relevant task knowledge from title generation and {𝑅𝑄existing }, we calculate cosine similarity of BERT to RQ generation. The titles in train/dev/test sets are ex- embeddings [39] between the generated RQ and each cluded. They are not used as input for the target task. So 𝑋𝑖 ∈ {𝑅𝑄existing }. For each generated RQ, we calculate there is no data leaking. the F1-BERTScore for each pair (𝑌𝑗 , 𝑋𝑖 ), and average over Knowledge retrieval from HCI corpus Knowl- all existing RQs: edge retrieval is another promising solution to many 𝑀 knowledge-intensive NLP tasks [35] such as question an- 1 𝐷𝑖𝑓 𝑓 𝐵𝑆𝑗 = ∑(1 − 𝐹BERT (𝑌𝑗 , 𝑋𝑖 )), (2) swering [36] and information-seeking question genera- 𝑀 𝑖=1 tion [37]. To incorporate external domain knowledge, we apply the Dense Passage Retriever (DPR) [36] to re- where 𝐹BERT (𝑌𝑗 , 𝑋𝑖 ) denotes the F1-BERTScore calculated trieve sentences most relevant to the input 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 between 𝑌𝑗 and 𝑋𝑖 . The final 𝐷𝑖𝑓 𝑓 𝐵𝑆 for each model is from the HCI corpus. The retrieved sentences are ap- averaged over all generated RQs. pended to the end of the original related work text as input. 4.2.2. Measuring level of surprise To measure the level of surprise, we refer to Boden [31]’s 4.2. Evaluation Methods for Novelty and definition of surprise “an idea may be surprising because Surprise it’s unfamiliar, or even unlikely”. We propose a new auto- matic metric to measure the level of surprise in generated The task of HCI RQ generation aims to generate open- RQs. ended research questions to inspire researchers, which Perplexity Gain (PPLGain). Perplexity, the inverse need to be highly creative. Recently, Computational Cre- probability, is frequently used to measure how uncertain ativity has become an emerging field of study in the HCI an LM generates the test data. Given a text, the higher domain [29]. Inspired by Boden’s three criteria [31] “the the perplexity is, the more uncertain the LM is about gen- ability to generate ideas or artifacts that are new, surpris- erating it. Assuming an LM is successfully pre-trained ing and valuable”, we introduce evaluation metrics to with a sufficient amount of general text data, the perplex- measure the novelty and level of surprise of generated ity reflects the unexpectedness, or level of surprise, of RQs. We do not evaluate the value of generated RQs the LM to the given text. Thus, we employ the perplexity to HCI research since it would require extensive expert of GPT-2 of the RQs: knowledge and human intervention. 𝑇 1 4.2.1. Measuring novelty 𝑝𝑝𝑙(𝑌𝑗 ) = exp (− ∑ log 𝑝(𝑦𝑖 |𝑦1 , ..., 𝑦𝑖−1 )) . (3) 𝑇 𝑖=1 To evaluate the novelty of generated RQs, i.e., how new/o- riginal the RQs are, we measure the difference between To measure the level of surprise, or unexpectedness, of the generated RQs and prior RQs. We introduce two the generated RQs, we calculate the difference between metrics: 1) an 𝑛-gram-based score DistGain, and an the perplexity of generated RQs and prior RQs. We define embedding-based score DiffBS. We first make a set of the perplexity gain as follows: prior RQs, denoted as {𝑅𝑄existing }, from papers published 1 𝑀 earlier than the papers in dev/test sets. 𝑝𝑝𝑙(𝑌𝑗 ) − 𝑀 ∑𝑖=1 𝑝𝑝𝑙(𝑋𝑖 ) 𝑃𝑃𝐿𝐺𝑎𝑖𝑛𝑗 = . (4) Distinct-𝑘 gain (DistGain or DG) is defined based 1 𝑀 ∑ 𝑝𝑝𝑙(𝑋𝑖 ) on Distinct-𝑘 [38]. We calculate the average proportion 𝑀 𝑖=1 of new unique 𝑛-grams in the newly generated RQ com- The final 𝑃𝑃𝐿𝐺𝑎𝑖𝑛 score is averaged over all 𝑌𝑗 . pared to the total number of 𝑛-grams in the {𝑅𝑄existing }. 5. Experiments of surprise of generated RQs. As a large LM, GPT-3 pos- sesses rich knowledge outside of the HCI research do- 5.1. Evaluation Methods main, which enables it to output different words from existing RQs, but those words may be off the research We evaluate the generation quality with three sets of topic. metrics: (1) ROUGE and BERTScore for measuring accu- Knowledge augmentation is effective on HCI RQ racy; (2) DistGain and DiffBS for measuring novelty; (3) generation. Transfer learning augmented model, i.e., PPLGain for measuring surprise. BART-FT+transfer, outperforms BART baselines in terms of ROUGE-2 (11.7%↑ on dev and 9.1%↑ on test) and 5.2. Experimental Settings ROUGE-L (4.7%↑ on dev and 3.1%↑ on test). The effec- tiveness of transfer learning shows that learning the task We evaluated four text generation models with different of title generation helps bridge the gap between existing types of knowledge over our proposed dataset. research and new research. Retrieval augmented mod- GPT-3. We prompt GPT-3 (text-davinci-002) with a els, i.e., BART-FT+retrieval, perform at the same level as one-shot example. We use a temperature of 0.7 and pick BART-FT on dev set and significantly outperforms BART- the top-1 generation. To align the output format with FT in terms of ROUGE-2 (20.9%↑) and ROUGE-L (4.2%↑) BART-based models, we post-process the GPT-3 output on the test set. The model also surpassed the baseline by replacing the question number. That means, “1.” or BART-FT in novelty and surprise metrics. Both knowl- “1)” will be replaced by “RQ1:”. edge augmentation methods improve the novelty and sur- BART-FT. We use the 𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘 section as input. prise of the generated RQs. This implies that introducing An HCI paper may have multiple research questions. additional knowledge from publications enables the lan- The latter ones are highly likely to be dependent on the guage model to generate RQs with new ideas outside the previous ones. Thus, instead of an individual RQ, our training set. Although both methods improve generation output is set as a sequence of concatenated RQs such as novelty and surprise, using knowledge transfer results in “RQ1: ..., RQ2: ...”. For all the experiments with the BART a higher increase. This might be because titles tend to re- model, the maximum input and output length is set as flect the contributions of studies in a self-contained and 768 and 128 tokens, respectively. abstractive manner. Similarity-based retrieved results BART-FT+transfer. To transfer knowledge from ti- tend to be individual sentences that might be confusing, tle generation, we first fine-tune the BART model on or even noisy when they are used as input, because they {𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, title} pairs and then continue fine-tuning bring information outside the context paragraph. on {𝑅𝑒𝑙𝑎𝑡𝑒𝑑𝑊 𝑜𝑟𝑘, RQ} pairs. We carefully construct the dataset for title generation and avoid dev/test RQ data leaking in the training data of the title generation. 6. Discussion BART-FT+retrieval. To construct the retrieval cor- pus, we gather the abstract, introduction, and related 6.1. Case Study of Generated RQs work section of the existing papers that were published before dev/test papers, split the text into sentences, andTo further validate the proposed creativity metrics, we form an HCI corpus containing 310,955 sentences. We qualitatively compare examples of RQs generated by dif- retrieve top-3 sentences with pre-trained DPR using re- ferent models, as shown in Table. 3. It shows that RQs lated work as queries, and append the retrieved text to generated by GPT-3 appear to be less relevant compared input sequences. to other models, where the research topic is generalized from “GitHub issues” to “online discussion”. Meanwhile, the results generated by GPT-3 also suffered from repeti- 5.3. Results tion as the sequence of “incivility and toxicity in online Results on automatic evaluation are presented in Table 2. discussions” appeared twice in the given example. How- GPT-3 with general world knowledge increases ever, the language/words it uses could be new compared generation novelty, but under-performs fine-tuned to prior RQs. This implies that the incorporation of gen- models in accuracy and surprise level. Table 2 shows eral world knowledge generalizes the content of machine- that, compared to the BART models, GPT-3 performs created RQs to domains other than that of the target worse in terms of ROUGE and BERTScore on dev and paper. In this example, only BART-FT+transfer captured test, but it surpassed the other three models on DistGain the information about “maintainers” which is critical in and DiffBS, which are both measurements for genera- the ground truth RQ2, showing the advantage of trans- tion novelty. However, all three BART-based models fer learning. We also found that the output of BART achieved higher PPLGain scores which measure the level achieved the highest PPLGain score (level of surprise), as the results mentioned interesting concepts including Dev Test R-2 R-L BS DG DBS PG R-2 R-L BS DG DBS PG GPT-3 8.48 21.1 80.26 78.1 13.1 -74.9 10.27 22.97 80.42 76.9 13.2 -72.2 BART-FT 12.92∗∗∗ 26.88∗∗∗ 83.48∗∗∗ 60.1 10.4 -45.9∗∗∗ 12.37 26.65 82.57∗∗ 59.9 10.5 -39.9∗∗∗ BART-FT+transfer 14.43∗∗∗ 28.14∗∗∗ 84.04∗∗∗ 65.1 10.8 -37.1∗∗∗ 13.5 27.48∗ 82.97∗∗ 64.5 11.1 -27.6∗∗∗ BART-FT+retrieval 12.87∗∗ 26.93∗∗ 83.22∗∗∗ 60.5 10.5 -42.1∗∗∗ 14.96 27.76∗ 82.54∗∗∗ 63.2 10.8 -35.5∗∗∗ * Metric notations: R-2: ROUGE-2, R-L: ROUGE-L, BS: BERTSore, DG: DistGain, DBS: DiffBS, PG: PPLGain. For all metrics, higher means better. T-tests are conducted for all BART-based models using GPT-3 as a baseline (with 𝑝 < 0.001∗∗∗ , < 0.01∗∗ , < 0.05∗ ). Table 2 Automatic evaluation results of four models on HCI RQ generation. GPT-3 is prompted with one-shot examples. BART-FT denotes BART fine-tuned on our dataset. BART-FT+transfer denotes BART fine-tuned with transfer learning. BART- FT+retrieval denotes BART fine-tuned with knowledge retrieval. Gold RQ1: What are the characteristics of github locked is- R-2 R-L BS DG DBS PG sues? RQ2: What are the justifications given in the comments by project maintainers when locking issues as too heated? GPT-3 RQ1: How effective are github locked issues in curbing in- 7.7 23.7 81.2 86.1 13.4 -79.7 civility and toxicity in online discussions? RQ2: How well do current machine learning detectors perform in detecting incivility and toxicity in online discussions? BART RQ1: What kinds of incivility are present in code review 11.5 38.4 86.7 69.7 13.8 -46.5 discussions of rejected patches? RQ2: How common are code review issues locked as too heated? BART-FT+transfer RQ1: How do maintainers respond to github issues locked 19.2 33.3 90.3 82.4 13.4 -53.9 as too heated? RQ2: What are maintainers’ reactions to the locked issues? BART-FT+retrieval RQ1: What kinds of incivility exist in github issues locked 23.0 39.2 87.5 90.9 13.5 -59.9 as too heated? RQ2: What are the most common types of incivility in github? Table 3 Generated RQs on a test example of the paper titled “How heated is it? Understanding GitHub locked issues”. “code review” and “rejected patches”. one-shot scenario with a manually selected example by researchers. Future work should take into considera- 6.2. Limitations and Future Work tion the potential impact of the demonstration selection method on the generation quality of GPT-3. Although the experimental results revealed RQ gener- ation as a promising and meaningful task, several lim- itations exist in our current study. First, the training 7. Conclusions and evaluation of generation methods were conducted on a relatively small-scale dataset, undermining the so- In this work, we proposed a novel NLP task of HCI RQ lidity of the conclusions yielded from the experiments. generation. We curated a dataset of 8,904 HCI publica- Future work should consider expanding the dataset by tions and a collection of 158 examples of (related work, collecting more open-access publications and employing RQ)-pairs. In addition to accuracy metrics, we evaluated careful human annotation to expand the scale and im- the creativity of RQ generation with metrics for nov- prove the quality of the dataset. Second, the evaluation elty and surprise. We investigated the performance of metrics used/proposed in this work did not fully con- four approaches that leverage different types of knowl- sider the open-ended nature of the RQ generation tasks. edge. Through experiments, we showed general world In practice, a well-surveyed research topic should yield knowledge in pre-trained LM helped improve generation many open-ended creative research questions, while our novelty, and domain knowledge augmentation methods evaluation was sorely based on the comparison between improved accuracy and level of surprise. Future studies the generated and ground-truth RQs. Further quantifi- could explore knowledge augmentation methods by in- able human evaluation should be incorporated to validate corporating different kinds of knowledge, e.g., general the quality of generation. Additionally, the evaluation world knowledge, task knowledge, transferred domain of GPT-3 as an RQ generation method only covered a knowledge, or retrieved textual knowledge. Acknowledgments [12] M. Färber, A. Jatowt, Citation recommendation: approaches and datasets, International Journal on This material is based upon work supported by the Na- Digital Libraries 21 (2020) 375–405. tional Science Foundation under Grant No. 2119589. Any [13] S. Spangler, A. D. Wilkins, B. J. Bachman, M. Na- opinions, findings, and conclusions or recommendations garajan, T. Dayaram, P. Haas, S. Regenbogen, C. R. expressed in this material are those of the author(s) and Pickering, A. Comer, J. N. Myers, et al., Automated do not necessarily reflect the views of the National Sci- hypothesis generation based on mining scientific lit- ence Foundation. erature, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 1877–1886. References [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [1] R. Elio, J. Hoover, I. Nikolaidis, M. Salavatipour, plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- L. Stewart, K. Wong, About computing science re- try, A. Askell, et al., Language models are few-shot search methodology, 2011. learners, Advances in neural information process- [2] Y.-C. Lee, N. Yamashita, Y. Huang, W. Fu, ” i hear ing systems 33 (2020) 1877–1901. you, i feel you”: encouraging deep self-disclosure [15] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- through a chatbot, in: Proceedings of the 2020 CHI hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart: conference on human factors in computing systems, Denoising sequence-to-sequence pre-training for 2020, pp. 1–12. natural language generation, translation, and com- [3] H. R. Hartson, Human–computer interaction: In- prehension, in: Proceedings of the 58th Annual terdisciplinary roots and trends, Journal of systems Meeting of the Association for Computational Lin- and software 43 (1998) 103–118. guistics, 2020, pp. 7871–7880. [4] J. Bardzell, S. Bardzell, Humanistic hci, Synthesis [16] N. Duan, D. Tang, P. Chen, M. Zhou, Question Lectures on Human-Centered Informatics 8 (2015) generation for question answering, in: Proceedings 1–185. of the 2017 conference on empirical methods in [5] C. Kim, H. Kim, S. H. Han, C. Kim, M. K. Kim, S. H. natural language processing, 2017, pp. 866–874. Park, Developing a technology roadmap for con- [17] R. Puri, R. Spring, M. Shoeybi, M. Patwary, B. Catan- struction r&d through interdisciplinary research zaro, Training question answering models from efforts, Automation in Construction 18 (2009) synthetic data, in: Proceedings of the 2020 Confer- 330–337. ence on Empirical Methods in Natural Language [6] D. Rhoten, Interdisciplinary research: Trend or Processing (EMNLP), Association for Computa- transition, Items and Issues 5 (2004) 6–11. tional Linguistics, Online, 2020, pp. 5811–5826. [7] C. Rusu, V. Rusu, Teaching hci: a challenging in- URL: https://aclanthology.org/2020.emnlp-main. tercultural, interdisciplinary, cross-field experience, 468. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 4 6 8 . in: International Workshop on Intercultural Collab- [18] S. Cao, L. Wang, Controllable open-ended ques- oration, Springer, 2007, pp. 344–354. tion generation with a new question type ontology, [8] P. Dourish, J. Finlay, P. Sengers, P. Wright, Re- in: Proceedings of the 59th Annual Meeting of the flective hci: Towards a critical technical practice, Association for Computational Linguistics and the in: CHI’04 extended abstracts on Human factors in 11th International Joint Conference on Natural Lan- computing systems, 2004, pp. 1727–1728. guage Processing (Volume 1: Long Papers), 2021, [9] W. E. Mackay, A.-L. Fayard, Hci, natural science and pp. 6424–6439. design: a framework for triangulation across disci- [19] G. Chen, J. Yang, C. Hauff, G.-J. Houben, Learningq: plines, in: Proceedings of the 2nd conference on a large-scale dataset for educational question gener- Designing interactive systems: processes, practices, ation, in: Twelfth International AAAI Conference methods, and techniques, 1997, pp. 223–234. on Web and Social Media, 2018. [10] J. Lazar, J. H. Feng, H. Hochheiser, Research meth- [20] H. Gong, L. Pan, H. Hu, Khanq: A dataset for gener- ods in human-computer interaction, Morgan Kauf- ating deep questions in education, in: Proceedings mann, 2017. of the 29th International Conference on Computa- [11] M. Yasunaga, J. Kasai, R. Zhang, A. R. Fabbri, I. Li, tional Linguistics, 2022, pp. 5925–5938. D. Friedman, D. R. Radev, Scisummnet: A large an- [21] S. Pollak, V. Podpecan, J. Kranjc, B. Lesjak, notated corpus and content-impact models for sci- N. Lavrac, Scientific question generation: Pattern- entific paper summarization with citation networks, based and graph-based robochair methods., in: in: Proceedings of the AAAI conference on artificial ICCC, 2021, pp. 140–148. intelligence, volume 33, 2019, pp. 7386–7393. [22] K. Ono, K. Sumita, S. M. Research, D. Center, T. C. Komukai-Toshiba-cho, et al., Abstract generation based on rhetorical structure extraction, arXiv Systems 33 (2020) 9459–9474. preprint cmp-lg/9411023 (1994). [36] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, [23] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, S. Edunov, D. Chen, W.-t. Yih, Dense passage re- W. Chang, N. Goharian, A discourse-aware atten- trieval for open-domain question answering, in: tion model for abstractive summarization of long Proceedings of the 2020 Conference on Empirical documents, arXiv preprint arXiv:1804.05685 (2018). Methods in Natural Language Processing (EMNLP), [24] I. Cachola, K. Lo, A. Cohan, D. S. Weld, Tldr: Ex- 2020, pp. 6769–6781. treme summarization of scientific documents, arXiv [37] M. Gaur, K. Gunaratna, V. Srinivasan, H. Jin, Iseeq: preprint arXiv:2004.15011 (2020). Information seeking question generation using dy- [25] Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, N. F. namic meta-information retrieval and knowledge Rajani, Reviewrobot: Explainable paper review graphs, in: Proceedings of the AAAI Confer- generation based on knowledge synthesis, arXiv ence on Artificial Intelligence, volume 36, 2022, pp. preprint arXiv:2010.06119 (2020). 10672–10680. [26] Q. Wang, Y. Xiong, Y. Zhang, J. Zhang, Y. Zhu, Au- [38] J. Li, M. Galley, C. Brockett, J. Gao, W. B. Dolan, A tocite: Multi-modal representation fusion for con- diversity-promoting objective function for neural textual citation generation, in: Proceedings of the conversation models, in: Proceedings of the 2016 14th ACM International Conference on Web Search Conference of the North American Chapter of the and Data Mining, 2021, pp. 788–796. Association for Computational Linguistics: Human [27] Q. Wang, L. Huang, Z. Jiang, K. Knight, H. Ji, Language Technologies, 2016, pp. 110–119. M. Bansal, Y. Luan, Paperrobot: Incremental draft [39] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, generation of scientific ideas, in: Proceedings of Y. Artzi, Bertscore: Evaluating text generation with the 57th Annual Meeting of the Association for bert, arXiv preprint arXiv:1904.09675 (2019). Computational Linguistics, 2019, pp. 1980–1991. [28] A. Cardoso, T. Veale, G. A. Wiggins, Converging on the divergent: The history (and future) of the international joint workshops in computational cre- ativity, AI magazine 30 (2009) 15–15. [29] G. Franceschelli, M. Musolesi, Creativity and machine learning: A survey, arXiv preprint arXiv:2104.02726 (2021). [30] W. Yu, C. Zhu, T. Zhao, Z. Guo, M. Jiang, Sentence- permuted paragraph generation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 5051–5062. [31] M. A. Boden, The creative mind: Myths and mech- anisms, Routledge, 2004. [32] M. Lee, P. Liang, Q. Yang, Coauthor: Designing a human-ai collaborative writing dataset for ex- ploring language model capabilities, in: CHI Con- ference on Human Factors in Computing Systems, 2022, pp. 1–19. [33] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want to reduce labeling cost? gpt-3 can help, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 4195–4205. [34] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, M. Jiang, Generate rather than retrieve: Large language models are strong con- text generators, in: International Conference for Learning Representation (ICLR), 2023. [35] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing