1. Introduction

Generation for Human-Computer Interaction Research

Yiren Liu

yirenl2@illinois.edu 0

Mengxia Yu

Meng Jiang

mjiang2@nd.edu 1

Yun Huang

yunhuang@illinois.edu 1 0 University of Illinois Urbana-Champaign , Champaign, IL, 61820 , USA 1 University of Notre Dame , Notre Dame, IN, 46556 , USA

It is essential to develop innovative and original research questions/ideas for interdisciplinary research fields, such as Human-Computer Interaction (HCI). In this work, we focus on discussing how recent natural language generation (NLG) methodologies can be applied to promote the formulation of creative research questions. We collect and curate a dataset that contains texts of RQs and related work sections from HCI papers, and introduce a new NLG task of automatic HCI research question (RQ) generation. In addition to applying common NLG metrics used to evaluate generation accuracy, including ROUGE and BERTScore, we propose two sets of new metrics for evaluating the creativity of generated RQs: 1) DistGain and DifBS for novelty, and 2) PPLGain for the level of surprise. The task is challenging due to the lack of external knowledge. We investigate four approaches to enhance the generation models with (1) general world knowledge, (2) task knowledge, (3) transferred knowledge, and (4) retrieved knowledge. The results of the experiment indicate that the incorporation of additional knowledge benefits both the accuracy and creativity of RQ generation. The dataset used in this study can be found at: https://github.com/yiren-liu/HAI-GEN-release.

Research datasets text generation creativity

1. Introduction

Asking novel research questions (RQ) is key to starting innovative scientific studies. As David Hilbert states, “ he who seeks for methods without having an infinite problem in mind seeks for the most part in ‘vain’”. Proficient scientists read and analyze representative literature in a specific domain, in order to identify the limitations of the existing work and ask new RQs [ 1 ]. In computer science research, methodologies are often derived from a study’s core research question(s) [ 1 ]. Research questions (RQs) are one of the most important components in HCI research, which are often explicitly stated in research papers from the HCI domain. As an outline of the whole paper, RQs are often proposed at the beginning sections and often stated in a unified format, e.g., “RQ1: ..., RQ2: ...”. For example, in the HCI paper from Lee et al. [ 2 ], the authors listed two RQs at the end of the related work section: ”RQ1: How do diferent chatting styles inlfuence people’s self-disclosure? and RQ2: How do diferent chatting styles influence people’s self-disclosure over time?” Australia †These authors contributed equally. nEvelop-O [ 11 ], recommending related work [ 12 ], and generating new biomedical hypotheses [ 13 ]. If a machine could generate RQs based on existing literature, it would help HCI researchers discover potential research topics, though they needed to verify the machine-suggested RQ candiAI model or research on automating HCI RQ generation. search on automating HCI RQ generation.

In this work, we propose a novel task of research ques

tion generation in the field of HCI research. Given the related work section (denoted as aims to generate one or multiple research questions. We ), the task Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, dates. However, to the best of our knowledge, there is no notice that given a set of literature, it is easy to come answers are short pieces of text. The QG datasets are up with plausible but too generic RQs on broad research usually converted from the question-answering datasets. topics. Therefore, the challenge of our task lies in that Instead of factoid questions, RQs are open-ended queswhen the same set of literature is surveyed from difer- tions, the generation of which is found to be more chalent perspectives, i.e., given diferent , the lenging in prior work [ 18 ], because it requires a deep generated RQs should be diferent, correspondingly. understanding and needs to be addressed with long-form

To study this new problem, we build a dataset from answers. Nevertheless, the existing open-ended quesHCI literature. We collect 8,904 HCI papers from Arxiv tion generation tasks are conditioned on the answers. and manually extract 158 data examples. Each example More research is needed to be done in order to generate has the text of the related work section and the text of re- unsolved open-ended HCI research problems. search questions. In this study, we develop and evaluate For educational domains, QG systems often aim at four approaches: (1) prompting pre-trained GPT-3 [ 14 ] generating assessment questions, e.g., multi-choice questhat has knowledge from pre-training corpus, (2) BART tions, to help students understand the learning materials [ 15 ] that is fine-tuned on our limited training examples, and reduce the manual workload required from instruc(3) transfer learning for knowledge augmentation that tors. Emerging studies have proposed datasets for edwarms up the model to generate paper titles which are ucational QG [ 19, 20, 21 ]. However, these works aim much more accessible than RQs, and (4) retrieval-based to generate questions that help with comprehension of augmentation that uses information from the HCI litera- learning materials, not exploring potential unsolved reture text we provide. search problems.

We evaluate the RQ generation quality based on three sets of automated metrics: (1) ROUGE and BERTScore 2.2. Scientific Text Generation with target RQs as references for accuracy, (2) DistGain and DifBS for novelty, and (3) PPLGain for level of sur- In order to reduce the burden of scientific writing or simprise. We propose to use these metrics for evaluation ulate scientists’ behaviors, there is a line of research aimfor practical reasons. First, when RQs are not explicitly ing at automatic scientific text generation. Since early spelled out in HCI papers, the model that yields greater ac- work on abstract generation [ 22 ], various approaches curacy could be more efectively utilized. As researchers have been proposed for scientific text summarization try to quickly form the RQs given a large amount of [ 23, 24, 11 ]. Spangler et al. [ 13 ] leverage text mining for surveyed papers, the model could aid in boosting the efi- scientific hypothesis generation. ReviewerBot [ 25] uticiency of literature review for both research and learning lizes information extracted from knowledge graphs to purposes. Second, the model that leads to higher novelty construct synthetic paper reviews from templates. Auand surprises could be used, when the HCI papers already toCite [26] leverages multi-modal information to generexplicitly present RQs. In this case, researchers can com- ate contextualized citation texts. PaperRobot [27] cascadpare the existing “ground truth” RQs with the generated ingly generates abstracts, conclusions, future work, and RQs to explore “new” directions for future research. titles for a follow-on paper. However, the automatic HCI

The main contributions of this study are: RQ question has not been studied as an NLP task.

2. Related Work 2.1. Question Generation

Automatic question generation (QG) has been studied as a data augmentation approach for Question Answering [ 16 ] and Machine Reading Comprehension [ 17 ]. Most existing QG studies focus on factoid questions, whose • We propose the task of HCI research question

generation, collecting and releasing a dataset. 2.3. Evaluating Creativity in Text • We design and develop four types of models that Generation leverage various knowledge to improve RQ gen- Methods for enhancing the ability of machine learning eration. models to produce original content have been a crucial • We evaluate the accuracy, novelty, and level of topic in the emerging research domain of computational surprise of generated RQs and find that knowl- creativity [28]. Franceschelli and Musolesi [29] summaedge transfer is the most promising approach rized existing methods for creativity evaluation and diswhen the available task data size is small. cussed their potential application in recent deep learning models (e.g., VAE and GAN). However, most of these existing evaluation methods are highly subjective and require strong human intervention. With the recent advances in text generation methods based on pre-trained language models, additional research is still needed to be done in order to automatically and objectively evaluate the creativity of text generation models. Prior NLP research has discussed potential methods to automatically evaluate generation taking into consideration both

Avg. # of words per Related Work per RQ

Avg. # of

RQs per paper train dev test the accuracy and diversity of the generated results [30].

In this work, we employ Boden’s three criteria [31] for studying machine creativity, defined as “the ability to generate ideas or artifacts that are new, surprising and valuable”, to propose new metrics for creativity evaluate in text generation tasks.

3. Problem Definition and Data

This resulted in a total of 8,904 HCI-related papers 2. We then convert these papers from PDF to sectioned XML format using GROBID3 and SciPDF Parser 4 in order to further analyze and filter based on their textual context. The section and title information are preserved in the XML version of our collected papers. For research questions, we conducted pattern matching of question sentences starting with “RQ”. In order to collect text from related work sections, i.e., , we extract sections with titles containing the keywords “related work”. We remove RQs from if it appears.

The resulting dataset consists of 158 valid examples.

We then split the dataset into train/dev/test sets with 108/25/25 examples. Note that the splits are carefully arranged in chronological order, i.e. papers in the dev and set are published later than those from the train split. This is to ensure the RQs in the dev/test sets are the newest and are not revealed in the train set. The descriptive statistics of the final dataset can be found in Table 1.

Definition 1 (HCI Research Question Generation).

Given the of an HCI research paper, the generation model requires maximizing (| ) .

A research question refers to a question that a study or research project aims to address. In HCI research publications, RQs are often proposed after the survey of related work. Based on the understanding of existing literature and citation purposes, diferent papers will 4. Method compose the related work sections diferently, even if they cite the same set of literature. Correspondingly, To tackle the lack of knowledge issue in HCI RQ gentheir research questions should be diferent. eration, we investigate four types of approaches that

We formally define the task of HCI RQ generation with leverage diferent types of knowledge. We present three task variables as follows. sets of quantitative metrics to evaluate the quality of generated questions from three diferent aspects: accuracy, novelty and level of surprise.

4.1. Generation Models with Various Knowledge

In real-life HCI research scenarios, researchers strive to propose highly novel and creative research questions based on existing work. Thus, we propose also to mea- In this section, we describe the diferent models used for sure the creativity of generated research questions. Based training and evaluating the RQ generation task. on the theory of Boden’s criteria [31] “the ability to gen- Pre-trained GPT-3 As a large language model (LM) erate ideas or artifacts that are new, surprising, and valu- with 175 billion parameters, GPT-3 is the state-of-theable’’, we construct the creativity measurement as a com- art learner succeeding on many NLP tasks and shows bination of two aspects: 1) novelty and 2) level of surprise. its capability in research paper writing [32], educational We do not evaluate the value of generated RQs since we question generation [33] and open-domain QA [34]. GPTbelieve it would require extensive expert knowledge and 3 is trained on 45 TB of text data from multiple sources is hardly feasible without human intervention. which include Wikipedia and books, enabling the model to store a huge amount of general world knowledge.

Definition 2 (Generation Creativity). 1) We measure Fine-tuned BART We choose BART, a Transformersthe novelty of a set of RQs by comparing their similarity based pretrained generation model, as our backbone to the RQs of prior publications within our collected corpus; model. By fine-tuning BART on our RQ generation 2) We measure the level of surprise of a set of RQs based dataset, the model should acquire specific task knowlon their perplexity with respect to the perplexity of existing edge, but the knowledge would be limited due to data RQs using a large PLM (e.g., GPT-2). scarcity.

Knowledge transfer from title generation Transfer learning is an efective way to improve the model

To collect open-access HCI publications, we used papers available through Arxiv. We collected PDF files of papers under the category of Human-Computer Interaction (cs.HC) 1 using the public API provided by Arxiv.

2https://github.com/yiren-liu/HAI-GEN-release

3https://github.com/kermitt2/grobid 4https://github.com/titipata/scipdf_parser

1https://arxiv.org/list/cs.HC/recent

when only a limited amount of data on the target task DistGain can be written as follows: from the HCI corpus. The retrieved sentences are ap- averaged over all generated RQs.

1 |{ } − { }| =

∑ =1 | |

, and { ∈ { and = |{

existing}|. We average the sequence = ( )∶ denotes the -th RQ in { where sequence = ( )∶ denotes the -th generated RQ,

existing}, Diference in BERTScore (DifBS

or DBS): In order to measure the distance between the generated RQ existing}, we calculate cosine similarity of BERT

(1) of all . 1 =1 = ∑(1 − BERT( , )), (2) In most cases, paper titles can be considered as a high- generated RQs to obtain an overall score Knowledge retrieval from

HCI corpus Knowl- all existing RQs: to RQ generation. The titles in train/dev/test sets are ex- embeddings [39] between the generated RQ and each cluded. They are not used as input for the target task. So there is no data leaking.

existing}. For each generated RQ, we calculate the F1-BERTScore for each pair ( , ), and average over is available. The available RQ data may be limited for a variety of reasons, e.g., errors during PDF parsing, or RQs that are not explicitly written in some papers. In contrast, paper titles are more accessible, where the amount we extracted is 30 times that of research questions. In semantic space, a paper’s title represents its most significant contribution, which is strongly tied to its research topics. level summary of the solution to the research questions.

Therefore, we propose to augment the BART model with transfer relevant task knowledge from title generation edge retrieval is another promising solution to many knowledge-intensive NLP tasks [35] such as question answering [36] and information-seeking question generation [37]. To incorporate external domain knowledge, we apply the Dense Passage Retriever (DPR) [36] to retrieve sentences most relevant to the input pended to the end of the original related work text as input.

4.2. Evaluation Methods for Novelty and Surprise

The task of HCI RQ generation aims to generate openended research questions to inspire researchers, which need to be highly creative. Recently, Computational Creativity has become an emerging field of study in the HCI domain [29]. Inspired by Boden’s three criteria [31] “the ability to generate ideas or artifacts that are new, surprising and valuable”, we introduce evaluation metrics to measure the novelty and level of surprise of generated RQs. We do not evaluate the value of generated RQs to HCI research since it would require extensive expert knowledge and human intervention. 4.2.1. Measuring novelty To evaluate the novelty of generated RQs, i.e., how new/original the RQs are, we measure the diference between the generated RQs and prior RQs. We introduce two metrics: 1) an -gram-based score DistGain, and an embedding-based score DifBS . We first make a set of prior RQs, denoted as {

existing}, from papers published earlier than the papers in dev/test sets.

Distinct- gain (DistGain or DG) is defined based on Distinct- [38]. We calculate the average proportion between and . The final where BERT( , )denotes the F1-BERTScore calculated for each model is 4.2.2. Measuring level of surprise To measure the level of surprise, we refer to Boden [31]’s definition of surprise “an idea may be surprising because it’s unfamiliar, or even unlikely”. We propose a new automatic metric to measure the level of surprise in generated RQs.

Perplexity Gain (PPLGain). Perplexity, the inverse probability, is frequently used to measure how uncertain an LM generates the test data. Given a text, the higher the perplexity is, the more uncertain the LM is about generating it. Assuming an LM is successfully pre-trained with a suficient amount of general text data, the perplexity reflects the unexpectedness, or level of surprise, of the LM to the given text. Thus, we employ the perplexity of GPT-2 of the RQs:

1 =1 ( ) =exp (− ∑ log ( | 1, ..., −1 )) .

(3) To measure the level of surprise, or unexpectedness, of the generated RQs, we calculate the diference between the perplexity of generated RQs and prior RQs. We define the perplexity gain as follows: pared to the total number of -grams in the { existing}. of new unique -grams in the newly generated RQ com- The final = ( 1 ) − 1

∑

=1 ( ∑ =1 ( ) ) .

(4) score is averaged over all .

5. Experiments 5.1. Evaluation Methods 5.2. Experimental Settings

We evaluate the generation quality with three sets of metrics: (1) ROUGE and BERTScore for measuring accuracy; (2) DistGain and DifBS for measuring novelty; (3) PPLGain for measuring surprise. of surprise of generated RQs. As a large LM, GPT-3 possesses rich knowledge outside of the HCI research domain, which enables it to output diferent words from existing RQs, but those words may be of the research topic.

Knowledge augmentation is efective on HCI RQ generation. Transfer learning augmented model, i.e., BART-FT+transfer, outperforms BART baselines in terms of ROUGE-2 (11.7%↑ on dev and 9.1%↑ on test) and ROUGE-L (4.7%↑ on dev and 3.1%↑ on test). The efectiveness of transfer learning shows that learning the task of title generation helps bridge the gap between existing research and new research. Retrieval augmented models, i.e., BART-FT+retrieval, perform at the same level as BART-FT on dev set and significantly outperforms BARTFT in terms of ROUGE-2 (20.9%↑) and ROUGE-L (4.2%↑) on the test set. The model also surpassed the baseline BART-FT in novelty and surprise metrics. Both knowledge augmentation methods improve the novelty and surprise of the generated RQs. This implies that introducing additional knowledge from publications enables the language model to generate RQs with new ideas outside the training set. Although both methods improve generation novelty and surprise, using knowledge transfer results in a higher increase. This might be because titles tend to relfect the contributions of studies in a self-contained and abstractive manner. Similarity-based retrieved results tend to be individual sentences that might be confusing, or even noisy when they are used as input, because they bring information outside the context paragraph.

We evaluated four text generation models with diferent

types of knowledge over our proposed dataset.

GPT-3. We prompt GPT-3 (text-davinci-002) with a one-shot example. We use a temperature of 0.7 and pick the top-1 generation. To align the output format with BART-based models, we post-process the GPT-3 output by replacing the question number. That means, “1.” or “1)” will be replaced by “RQ1:”.

BART-FT. We use the section as input.

An HCI paper may have multiple research questions.

The latter ones are highly likely to be dependent on the previous ones. Thus, instead of an individual RQ, our output is set as a sequence of concatenated RQs such as “RQ1: ..., RQ2: ...”. For all the experiments with the BART model, the maximum input and output length is set as 768 and 128 tokens, respectively.

BART-FT+transfer. To transfer knowledge from title generation, we first fine-tune the BART model on { , title} pairs and then continue fine-tuning on { , RQ} pairs. We carefully construct the dataset for title generation and avoid dev/test RQ data leaking in the training data of the title generation. 6. Discussion

BART-FT+retrieval. To construct the retrieval corpus, we gather the abstract, introduction, and related 6.1. Case Study of Generated RQs work section of the existing papers that were published before dev/test papers, split the text into sentences, and To further validate the proposed creativity metrics, we form an HCI corpus containing 310,955 sentences. We qualitatively compare examples of RQs generated by difretrieve top-3 sentences with pre-trained DPR using re- ferent models, as shown in Table. 3. It shows that RQs lated work as queries, and append the retrieved text to generated by GPT-3 appear to be less relevant compared input sequences. to other models, where the research topic is generalized from “GitHub issues” to “online discussion”. Meanwhile, the results generated by GPT-3 also sufered from repeti5.3. Results tion as the sequence of “incivility and toxicity in online Results on automatic evaluation are presented in Table 2. discussions” appeared twice in the given example. How

GPT-3 with general world knowledge increases ever, the language/words it uses could be new compared generation novelty, but under-performs fine-tuned to prior RQs. This implies that the incorporation of genmodels in accuracy and surprise level. Table 2 shows eral world knowledge generalizes the content of machinethat, compared to the BART models, GPT-3 performs created RQs to domains other than that of the target worse in terms of ROUGE and BERTScore on dev and paper. In this example, only BART-FT+transfer captured test, but it surpassed the other three models on DistGain the information about “maintainers” which is critical in and DifBS, which are both measurements for genera- the ground truth RQ2, showing the advantage of transtion novelty. However, all three BART-based models fer learning. We also found that the output of BART achieved higher PPLGain scores which measure the level achieved the highest PPLGain score (level of surprise), as the results mentioned interesting concepts including GPT-3 BART-FT BART-FT+transfer BART-FT+retrieval

R-2 BS

6.2. Limitations and Future Work

one-shot scenario with a manually selected example by researchers. Future work should take into consideration the potential impact of the demonstration selection method on the generation quality of GPT-3.

Although the experimental results revealed RQ generation as a promising and meaningful task, several limitations exist in our current study. First, the training 7. Conclusions and evaluation of generation methods were conducted on a relatively small-scale dataset, undermining the so- In this work, we proposed a novel NLP task of HCI RQ lidity of the conclusions yielded from the experiments. generation. We curated a dataset of 8,904 HCI publicaFuture work should consider expanding the dataset by tions and a collection of 158 examples of (related work, collecting more open-access publications and employing RQ)-pairs. In addition to accuracy metrics, we evaluated careful human annotation to expand the scale and im- the creativity of RQ generation with metrics for novprove the quality of the dataset. Second, the evaluation elty and surprise. We investigated the performance of metrics used/proposed in this work did not fully con- four approaches that leverage diferent types of knowlsider the open-ended nature of the RQ generation tasks. edge. Through experiments, we showed general world In practice, a well-surveyed research topic should yield knowledge in pre-trained LM helped improve generation many open-ended creative research questions, while our novelty, and domain knowledge augmentation methods evaluation was sorely based on the comparison between improved accuracy and level of surprise. Future studies the generated and ground-truth RQs. Further quantifi- could explore knowledge augmentation methods by inable human evaluation should be incorporated to validate corporating diferent kinds of knowledge, e.g., general the quality of generation. Additionally, the evaluation world knowledge, task knowledge, transferred domain of GPT-3 as an RQ generation method only covered a knowledge, or retrieved textual knowledge.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 2119589. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. based on rhetorical structure extraction, arXiv Systems 33 (2020) 9459–9474.

preprint cmp-lg/9411023 (1994). [36] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, [23] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, S. Edunov, D. Chen, W.-t. Yih, Dense passage reW. Chang, N. Goharian, A discourse-aware atten- trieval for open-domain question answering, in: tion model for abstractive summarization of long Proceedings of the 2020 Conference on Empirical documents, arXiv preprint arXiv:1804.05685 (2018). Methods in Natural Language Processing (EMNLP), [24] I. Cachola, K. Lo, A. Cohan, D. S. Weld, Tldr: Ex- 2020, pp. 6769–6781.

treme summarization of scientific documents, arXiv [37] M. Gaur, K. Gunaratna, V. Srinivasan, H. Jin, Iseeq: preprint arXiv:2004.15011 (2020). Information seeking question generation using dy[25] Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, N. F. namic meta-information retrieval and knowledge Rajani, Reviewrobot: Explainable paper review graphs, in: Proceedings of the AAAI Confergeneration based on knowledge synthesis, arXiv ence on Artificial Intelligence, volume 36, 2022, pp. preprint arXiv:2010.06119 (2020). 10672–10680. [26] Q. Wang, Y. Xiong, Y. Zhang, J. Zhang, Y. Zhu, Au- [38] J. Li, M. Galley, C. Brockett, J. Gao, W. B. Dolan, A tocite: Multi-modal representation fusion for con- diversity-promoting objective function for neural textual citation generation, in: Proceedings of the conversation models, in: Proceedings of the 2016 14th ACM International Conference on Web Search Conference of the North American Chapter of the and Data Mining, 2021, pp. 788–796. Association for Computational Linguistics: Human [27] Q. Wang, L. Huang, Z. Jiang, K. Knight, H. Ji, Language Technologies, 2016, pp. 110–119.

M. Bansal, Y. Luan, Paperrobot: Incremental draft [39] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, generation of scientific ideas, in: Proceedings of Y. Artzi, Bertscore: Evaluating text generation with the 57th Annual Meeting of the Association for bert, arXiv preprint arXiv:1904.09675 (2019).

Computational Linguistics, 2019, pp. 1980–1991. [28] A. Cardoso, T. Veale, G. A. Wiggins, Converging on the divergent: The history (and future) of the international joint workshops in computational creativity, AI magazine 30 (2009) 15–15. [29] G. Franceschelli, M. Musolesi, Creativity and machine learning: A survey, arXiv preprint arXiv:2104.02726 (2021). [30] W. Yu, C. Zhu, T. Zhao, Z. Guo, M. Jiang, Sentencepermuted paragraph generation, in: Proceedings of the 2021 Conference on Empirical Methods in

Natural Language Processing, 2021, pp. 5051–5062. [31] M. A. Boden, The creative mind: Myths and mech

anisms, Routledge, 2004. [32] M. Lee, P. Liang, Q. Yang, Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities, in: CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–19. [33] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want to reduce labeling cost? gpt-3 can help, in: Findings of the Association for Computational Linguistics:

EMNLP 2021, 2021, pp. 4195–4205. [34] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal,

C. Zhu, M. Zeng, M. Jiang, Generate rather than retrieve: Large language models are strong context generators, in: International Conference for

Learning Representation (ICLR), 2023. [35] P. Lewis, E. Perez, A. Piktus, F. Petroni,

V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.

Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing

[1]

Elio ,

Hoover , I. Nikolaidis,

Salavatipour ,

Stewart ,

Wong , About computing science research methodology , 2011 .

[2]

Y.-C.

Lee ,

Yamashita ,

Huang , W. Fu, ” i hear you, i feel you”: encouraging deep self-disclosure through a chatbot , in: Proceedings of the 2020 CHI conference on human factors in computing systems , 2020 , pp. 1 - 12 .

[3]

H. R.

Hartson , Human-computer interaction: Interdisciplinary roots and trends , Journal of systems and software 43 ( 1998 ) 103 - 118 .

[4]

Bardzell ,

Bardzell , Humanistic hci, Synthesis Lectures on Human-Centered Informatics 8 ( 2015 ) 1 - 185 .

[5]

Kim ,

S. H.

Han ,

Kim ,

M. K.

Kim ,

S. H.

Park , Developing a technology roadmap for construction r&d through interdisciplinary research eforts, Automation in Construction 18 ( 2009 ) 330 - 337 .

[6]

Rhoten , Interdisciplinary research: Trend or transition, Items and Issues 5 ( 2004 ) 6 - 11 .

[7]

Rusu ,

Rusu , Teaching hci: a challenging intercultural, interdisciplinary, cross-field experience , in: International Workshop on Intercultural Collaboration, Springer, 2007 , pp. 344 - 354 .

[8]

Dourish ,

Finlay ,

Sengers ,

Wright , Relfective hci: Towards a critical technical practice , in: CHI' 04 extended abstracts on Human factors in computing systems , 2004 , pp. 1727 - 1728 .

[9]

W. E.

Mackay ,

A.-L.

Fayard , Hci, natural science and design: a framework for triangulation across disciplines , in: Proceedings of the 2nd conference on Designing interactive systems: processes , practices, methods, and techniques, 1997 , pp. 223 - 234 .

[10]

Lazar ,

J. H.

Feng , H. Hochheiser, Research methods in human-computer interaction , Morgan Kaufmann, 2017 .

[11]

Yasunaga ,

Kasai ,

Zhang ,

A. R.

Fabbri ,

Li ,

Friedman ,

D. R.

Radev , Scisummnet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks , in: Proceedings of the AAAI conference on artificial intelligence , volume 33 , 2019 , pp. 7386 - 7393 .

[12]

Färber ,

Jatowt , Citation recommendation: approaches and datasets , International Journal on Digital Libraries 21 ( 2020 ) 375 - 405 .

[13]

Spangler ,

A. D.

Wilkins ,

B. J.

Bachman ,

Nagarajan ,

Dayaram ,

Haas ,

Regenbogen ,

C. R.

Pickering ,

Comer ,

J. N.

Myers , et al., Automated hypothesis generation based on mining scientific literature , in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , 2014 , pp. 1877 - 1886 .

[14]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[15]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 7871 - 7880 .

[16]

Duan ,

Tang ,

Chen ,

Zhou , Question generation for question answering , in: Proceedings of the 2017 conference on empirical methods in natural language processing , 2017 , pp. 866 - 874 .

[17]

Puri ,

Spring ,

Shoeybi ,

Patwary ,

Catanzaro , Training question answering models from synthetic data , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 5811 - 5826 . URL: https://aclanthology.org/ 2020 .emnlp-main. 468. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 4 6 8 .

[18]

Cao ,

Wang , Controllable open-ended question generation with a new question type ontology , in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, 2021 , pp. 6424 - 6439 .

[19]

Chen ,

Yang ,

Hauf ,

G.-J.

Houben , Learningq: a large-scale dataset for educational question generation , in: Twelfth International AAAI Conference on Web and Social Media , 2018 .

[20]

Gong ,

Pan ,

Hu , Khanq: A dataset for generating deep questions in education , in: Proceedings of the 29th International Conference on Computational Linguistics , 2022 , pp. 5925 - 5938 .

[21]

Pollak ,

Podpecan ,

Kranjc ,

Lesjak ,

Lavrac , Scientific question generation: Patternbased and graph-based robochair methods ., in: ICCC, 2021 , pp. 140 - 148 .

[22]

Ono ,

Sumita ,

S. M.

Research , D. Center, T. C. Komukai-Toshiba-cho , et al., Abstract generation