1st Workshop on AI + Informetrics - AII2021 Automatic Related Work Section Generation by Sentence Extraction and Reordering Zekun Deng( )[0000−0001−7297−4056] , Zixin Zeng, Weiye Gu, Jiawen Ji, and Bolin Hua Department of Information Management, Peking University, China dzk@pku.edu.cn Abstract. Related work section is essential in a scientific publication, for it elaborates past studies relevant to the topic in comparison with the current one. The automatic generation of related work section in scientific papers is a meaningful yet challenging task. While prior works have gained encouraging results, they have not fully addressed the issue of informativeness and the difficulty of obtaining citation sentences due to delay of publication. In this paper, we introduce SERGE, a novel and effective system for generating descriptive related work section automat- ically by sentence extraction and reordering. Our system first employs a BERT-based ensemble model to select the most salient sentences in ref- erence papers, and then uses a similar model to reorder these sentences for better readability. Automatic evaluation results show that SERGE significantly outperforms existing baselines on ROUGE metrics, gaining an improvement of 18% to 56% on recall and 4% to 33% on F-score. Hu- man evaluation shows that SERGE gains a higher informativeness score than human-written gold standard as well as the baseline, indicating its ability to provide valuable information that matches the real interest of researchers. In contrast to existing methods, since our system is free from delayed citation problem and yields high informativeness, it shows a great potential for various applications. Keywords: Related work section · Literature review · Scientific docu- ments · Summarization. 1 Introduction Scientific papers usually contain a related work section, which is also known as a literature review. It summarizes previous works relevant to the research topic in order to establish the link between existing knowledge and new findings[1]. Very often, authors of scientific papers cite existing papers in this section to show the appropriateness of their research question, to justify their adopted methods, and/or to present the creativeness and superiority of their ideas. However, it is quite challenging to produce a high-quality related work section, since it involves identifying crucial points from a long piece of paper and reorganizing them in a neat and logical way. Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Z. Deng et al. It is generally accepted that there are two distinct styles of literature reviews: descriptive and integrative[2, 3]. Descriptive literature reviews focus on individ- ual papers and provide more detailed description on the methods, results, and implications of each study. They illustrate previous researches in high accuracy and are thus more objective and rigorous. In contrast, integrative literature reviews focus more on synthesis of ideas. Although including fewer details of individual studies, integrative literature reviews provide more high-level critical summaries of topics and are thus more condensed and structurally complex.[4] In this paper, we particularly focus on the generation of descriptive related work section. On this matter, Cohan and Goharian[5] have proposed a sentence ranking algorithm that takes advantage of citation context to summarize scien- tific papers. Abura’ed et al.[6] have proposed a citation-based summarizer for scientific documents based on supervised learning and acheived competitive re- sults in CLScisumm-17 challenges. However, most of the existing studies require citing sentences (a.k.a., “citances”) of citing publications as inputs. Thus, these strategies are limited by delay of publications——Mostly a new publication may not be widely recognized and cited within a short period of time and, therefore, it is quite hard to obtain the citing sentences mentioning the publication. To this end, in this paper, we propose a novel method for automatic genera- tion of descriptive related work section in scientific papers by extracting salient sentences from scientific literature and rearranging them into a logical order. In contrast to most existing methods which suffer from citation delay problem, our method does not require any citances to achieve its goal, making it applicable even when no citation data is available. The main contributions of this paper are as follows: 1. We propose a novel and effective approach to automatic descriptive related work section generation based on extractive document summarization tech- niques, including sentence extraction and reordering. 2. Our method does not need any citation data to achieve its goal, which implies that the method does not suffer from the delay of citing publications or require the input of citation data. Such a characteristic offers more potentials of our proposed method with various applications. 2 Related Work Automatic related work section generation is a special case of multi-document summarization tailored for scientific articles[7]. Multi-document summarization could be either extractive or abstractive, depending on whether the summary contains sentences from source articles[8]. Partly due to scarcity of training data and computational challenges, a large proportion of previous research are in the extractive track, which typically constitutes of a sentence classification sub- task and a sentence reordering sub-task[7–9]. Common approaches for extracting relevant sentences include graph-based ranking algorithms[10] and neural clas- sification models[11]. Subsequently, extracted sentences are reordered based on Automatic Related Work Section Generation ... and Reordering 3 heuristic criteria or neural architectures with sentence ordering mechanisms[12, 13]. Automatic related work generation differs from summarization of generic texts in the following aspects: 1) summarization of generic texts often focus on the content of source papers, whereas the related work section should also de- lineate contributions and limitations of reference papers (i.e., cited papers) as well as the relationship between reference papers and the current paper; 2) com- pared to generic texts, scientific articles contain more domain-specific concepts and technical terms, which poses great challenges for language modeling; and 3) scientific articles are more structured than generic texts and reference prior research[8, 14]; accordingly, various unique approaches have been put forward. For instance, Jaidka, Khoo and Na[15] proposed a literature review generation framework that imitates human writing behavior. Many other algorithms are based on citing behavior in scientific articles. Hu and Wan[16] used Probabilistic Latent Semantic Analysis (pLSA) to rank sentences from a set of reference pa- pers. Chen and Zhuge[17] analyzed citation sentences to detect common facts, which were used to find relevant sentences. More recently, Saggoin, Shvets and Bravo[3] exploited pointer-generator architecture with copy-attention technique and coverage mechanism to produce descriptive related work sections. However, we believe prior research on related work generation, with ROUGE scores as the most popular evaluation metric, haven’t fully addressed the issue of informativeness: the property of conveying useful information[9]. This could be potentially problematic because the ROUGE metric may penalize summaries that contain relevant sentences not included in the golden standard summary[8]. Also, as is discussed in Section 1, most of the previous methods are citation- based, which are infeasible in case of delayed citation. Therefore, a novel method is proposed in this paper to tackle these problems. 3 System Design 3.1 Overall Architecture Here, we introduce our system, SERGE, which stands for “Sentence extraction and rEordering based Related work section GEnerator”. The overall architec- ture of SERGE is briefly illustrated in Fig. 1. Given a set of papers from which a related work section will be automatically generated, the system takes the full text of these papers as input, and generates a descriptive related work section involving all these papers as output. SERGE consists of two main parts: a clas- sification model and a reordering model. The classification model is used to determine whether a sentence is sufficiently salient that it should be included in the generated output. For each sentence as input, the classification model generates a probability value indicating the salience of the sentence. Then, the sentences with highest probability values are fed in to the reordering model, which determines their best order (sequence of sentences). Lastly, the sentences, sorted by the reordering model into the most sensible order, are modified with citation tags and proper pronouns, forming the final output of the system. 4 Z. Deng et al. Set of Related papers work section Sentences Modifying in paper citations & s1 … sn pronouns Reordered Ensemble Sentence paragraph Classification Model !" … !# Most salient BERT Ensemble sentences with Reordering Formula Model probability BOW (Eq. 1) (si , y(si)) Fig. 1. Overall architecture of SERGE 3.2 Classification Model The task of classification model is formally stated as follows: Given a sentence s, the model is supposed to assign a label Y (s) ∈ {0, 1} to the sentence, or rather, compute a probability value y(s) ∈ [0, 1] indicating the salience of sentence s. To accomplish the task, we adopt an ensemble model which consists of two sub-models: a deep neural network model and a bag-of-words (BOW) sentence matching model. The input sentence s is fed in to both models simultaneously. The architecture of the deep neural network model follows Google’s original BERT paper[13], as is illustrated in Fig. 2. The input sequence s is first processed by BERT with pre-trained parameters. Then, the final hidden vector of the BERT model C ∈ RH corresponding to the first input token ([CLS]) is fed in to a classification layer on the top. The classification layer computes a vector Z = softmax(CW T ), where W ∈ R2×H . Here, W is learnable, and Z = (z0 , z1 ) is a 2-dimensional vector, where zi is the estimated probability of the true label of input sequence being i (i = 0, 1). The training of the model requires annotated data pairs. However, due to the lack of suitable training corpus, we opt to annotate a new dataset automatically by leveraging the ScisummNet corpus[14], a large annotated dataset containing 1000 ACL Anthology papers with their citations. For each paper in the corpus, the dataset includes its full text and incoming citation sentences. Based on the generally agreed assumption that citation sentences usually underscore the most Automatic Related Work Section Generation ... and Reordering 5 Linear layer Z C BERT [CLS] This study … [SEP] Tokenization & Add embedding s Fig. 2. Architecture of BERT classification model important aspects of the cited paper and highlight its key contributions, we make use of the citation sentences of a paper in the corpus to produce a gold label of whether a sentence in the paper is salient. An algorithm similar to Nallapati et al.’s paper[11] is applied to annotate the label of each sentence in a paper, which is stated as follows: (1) Join all citation sentences of the paper together to form a benchmark paragraph P ∗ , and create an empty paragraph P with no sentence in it. (2) Select and append to P the sentence from the abstract or conclusion part of paper which maximizes the ROUGE score between the newly updated paragraph P and P ∗ and has not been appended to P before. (3) Repeat Step 2 until the ROUGE score does not increase anymore. (4) Label all the sentences included in P as 1 (being salient) and all else as 0 (being not salient). 6 Z. Deng et al. By employing the greedy annotating algorithm as described above, we obtain an annotated dataset that can be used to train our neural classification model. The dataset includes 11954 training samples, in which 3541 are positive ones. The bag-of-words sentence matching model simply checks whether the input sentence s contains any of the words in a curated feature word set B. We manu- ally choose the words to be contained in word set B according to the findings of Shin[18], who proposed a dictionary for detecting innovative points in academic literature. Examples of words in B are “novel”, “propose”, and “improve”. De- note the predicted label of this BOW model as r(s), then r(s) = 1 if and only if s contains at least one word in B, otherwise r(s) = 0. The combined output of the ensemble model is defined by the following equa- tions:   0, 0 ≤ z1 ≤ τL y(s) = r(s) · z1 , τL < z1 < τH (1) z1 , τH ≤ z1 ≤ 1   0, 0 ≤ y(s) ≤ 0.5 Y (s) = (2) 1, 0.5 < y(s) ≤ 1 Essentially, Eq. 1 considers a trade-off between precision and recall. By setting τL = 0.2 and τH = 0.4, the ensemble model achieves the most desirable overall performance, with a precision of 0.622 and a recall of 0.793. 3.3 Reordering Model The task of reordering model is formally stated as follows: Given a set of sen- tences S = (s1 , s2 , ..., sn ), the model is supposed to find their optimal arrange- ment si1 , si2 , ..., sin (ir 6= it , ∀r 6= t, ik , k, r, t ∈ {1, 2, ..., n}), where the proba- bility of the sequence P(si1 , si2 , ..., sin ) is maximized. However, considering its sheer scale, it is practically impossible to obtain the solution of the problem directly. Therefore, we decompose the big problem into much smaller ones using a method similar to Chen et al.’s paper[12]. Using the definition of conditional probability, it is obvious that n Y P(si1 , si2 , ..., sin ) = P(si1 ) P(sik |si1 , si2 , ..., sik−1 ) (3) k=2 To simplify the calculation, let P(sik |si1 , si2 , ..., sik−1 ) = P(sik |sik−1 ) (4) where k ∈ {2, 3, ..., n}. We also assume P(si1 ) = 1. Thus, Eq. 3 becomes n Y P(si1 , si2 , ..., sin ) = P(sik |sik−1 ) (5) k=2 Automatic Related Work Section Generation ... and Reordering 7 It can be seen from Eq. 5 that the computation of the probability of all possible arrangements can be approached by simply computing the conditional proba- bility of each sentence appearing after another, reducing the complexity of the problem significantly. We use a MobileBERT[19] based model to compute the conditional probabili- ties. MobileBERT is a compact task-agnostic BERT that runs more than 5 times faster than BERTBASE while still achieving comparable results on a variety of benchmarks. We adopt MobileBERT instead of BERT mainly for practical rea- sons: the system running time would get intolerable if BERT is used in real-world application. To obtain the desired output, MobileBERT is fine-tuned on next sentence prediction (NSP) task, which generates a probabilistic value indicating whether the first sentence in the input is followed by the second one in the source docu- ment. The architecture of our NSP model is identical to our neural classification model as is described in Section 3.2, except that the BERT layers are substi- tuted by MobileBERT. The training data for NSP model is also extracted from ScisummNet corpus, whose writing style closely matches the expected input. To build the dataset, we pick out every pair of neighboring sentences from the 1000 papers as positive sample, combined with roughly the same amount of negative samples where the two sentences are not adjacent, yielding a total of 360509 training samples. Finally, to find the optimal sentence sequence, the model computes the prob- ability of all possible arrangements of the sentences in S using Eq. 5. For the sake of the concision of the final output, if the output of the classification model exceeds nmax = 3 sentences, only the ones with highest predicted value y(s) are kept, and the rest are discarded. 4 Evaluation and Results Both automatic and human evaluation approaches are employed to assess the performance of SERGE. 10 descriptive paragraphs of related work section are randomly extracted from 8 computer science papers published in journals or proceedings. The papers cited in each paragraph are collected to form a set of reference papers. MEAD[20] is used as baseline. During automatic evaluation, SERGE and baseline system generate a related work section respectively on each set of reference papers. The output of the two systems are then compared with the human-written paragraph in published papers by computing the ROUGE score between them. During human evaluation, 3 computer science experts are instructed to grade the paragraphs generated by the two systems and the gold standard on three aspects: informativeness (INF), fluency (FLU), and succinctness (SUC). The experts are uninformed about the authorship of the texts. The range of score is 1-5. The result of automatic evaluation is presented in Table 1. Except for the precision of ROUGE-1, SERGE outperforms the baseline on all metrics. Notably, 8 Z. Deng et al. Table 1. Mean score of automatic evaluation of SERGE and baseline ROUGE-1 ROUGE-2 ROUGE-L System P R F P R F P R F SERGE .258 .361 .297 .045 .064 .052 .228 .290 .254 MEAD .275 .302 .283 .038 .041 .039 .226 .245 .233 Table 2. Mean score of human evaluation of SERGE, baseline, and gold standard Source INF FLU SUC SERGE 4.23 3.83 3.83 MEAD 3.97 3.97 3.80 Gold standard 4.10 4.03 4.00 our system achieves a significant higher recall on all the three ROUGE metrics, yielding a relative gain of 20%, 56% and 18%, respectively. The result of human evaluation is presented in Table 2. SERGE gains an informativeness score of 4.23, which is about 6% higher than the baseline and 3% higher than the gold standard. SERGE also outperforms the baseline on succinctness. In short, the result above indicates that our system performs significantly better than previous baseline in numerous aspects, and even yields a higher informativeness than human-written gold standards. 5 Discussion and Conclusion In this paper, we propose a novel method for automatic generation of descriptive related work section in scientific papers by extracting salient sentences from past literature and reordering them into a smooth paragraph, which is made possible by two BERT-based neural models. The performance of our method is evaluated by both automatic metrics and human experts. The results show that our method gains a substantial improvement compared with past baselines and achieves a high degree of informativeness comparable to human authors. Our method addresses the problems of existing works in two ways. First, our method improves the informativeness of automatically generated related work sections significantly, providing more valuable information in existing literature which matches the real interest of researchers. Second, our method is immune from citation delay problem, suggesting its prospect for a wider range of appli- cations. There are several limitations in the current study. For example, the evaluation is not sufficiently robust due to the high cost of human assessment. Also, our method is not necessarily optimal in fluency and a few other metrics. These issues leave room for future exploration. The result of this study clearly shows the effectiveness of our novel method for related work section generation. Although the corpus used in this study is limited to the computer science field, it is effortless to adapt our method to other disciplines. Considering its universality and adaptiveness, our method shows a Automatic Related Work Section Generation ... and Reordering 9 tremendous potential for becoming an intelligent and helpful tool which can increase the efficiency of researchers and boost scientific innovations. In the future, we may continue to explore new methods for this task via various paths, for instance, by abstractive summarization approaches or entity extraction. We are also interested in integrating summarization problem with certain knowledge bases which brings more intelligence to automatic systems. Acknowledgements This work was supported in part by The National Social Science Fundation of China (Number: 17BTQ066). References 1. Webster, J., Watson, R.T.: Analyzing the past to prepare for the fu- ture: Writing a literature review. MIS Quarterly 26(2), xiii–xxiii (2002), http://www.jstor.org/stable/4132319 2. Jaidka, K., Khoo, C.S., Na, J.C.: Literature review writing: how information is selected and transformed. In: Aslib proceedings: New Information Perspectives. vol. 65, pp. 303–325. Emerald (2013). https://doi.org/10.1108/00012531311330665 3. Saggion, H., Shvets, A., Bravo, À., et al.: Automatic related work section gen- eration: experiments in scientific document abstracting. Scientometrics 125(3), 3159–3185 (2020). https://doi.org/10.1007/s11192-020-03630-2 4. Khoo, C.S., Na, J.C., Jaidka, K.: Analysis of the macro-level discourse structure of literature. Online Information Review 35(2), 255–271 (2011). https://doi.org/10.1108/14684521111128032 5. Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 390–400 (2015). https://doi.org/10.18653/v1/D15-1045 6. Chiruzzo, L., Saggion, H., Accuosto, P., Bravo, À., et al.: Lastus/taln@ clscisumm- 17: Cross-document sentence matching and scientific text summarization systems. In: BIRNDL@ SIGIR (2) (2017) 7. Teslyuk, A.: The concept of system for automated scientific literature re- views generation. In: Krzhizhanovskaya, V.V., Závodszky, G., Lees, M.H., Don- garra, J.J., Sloot, P.M.A., Brissos, S., Teixeira, J. (eds.) Computational Sci- ence – ICCS 2020. pp. 437–443. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-50420-5 32 8. Ibrahim Altmami, N., El Bachir Menai, M.: Automatic summarization of scientific articles: A survey. Journal of King Saud University - Computer and Information Sciences (2020). https://doi.org/10.1016/j.jksuci.2020.04.020, https://www.sciencedirect.com/science/article/pii/S1319157820303554 9. Liu, Y., Lapata, M.: Hierarchical Transformers for Multi-Document Summariza- tion. arXiv e-prints arXiv:1905.13164 (May 2019) 10. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research 22, 457–479 (2004). https://doi.org/10.1613/jair.1523 10 Z. Deng et al. 11. Nallapati, R., Zhai, F., Zhou, B.: SummaRuNNer: A recurrent neural net- work based sequence model for extractive summarization of documents. Pro- ceedings of the AAAI Conference on Artificial Intelligence 31(1) (Feb 2017), https://ojs.aaai.org/index.php/AAAI/article/view/10958 12. Chen, X., Qiu, X., Huang, X.: Neural Sentence Ordering. arXiv e-prints arXiv:1607.06952 (Jul 2016) 13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints arXiv:1810.04805 (Oct 2018) 14. Yasunaga, M., Kasai, J., Zhang, R., Fabbri, A., Li, I., Friedman, D., Radev, D.: ScisummNet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks. In: Proceedings of AAAI 2019 (2019). https://doi.org/10.1609/aaai.v33i01.33017386 15. Jaidka, K., Khoo, C., Na, J.C.: Deconstructing human literature re- views – a framework for multi-document summarization. In: Proceedings of the 14th European Workshop on Natural Language Generation. pp. 125– 135. Association for Computational Linguistics, Sofia, Bulgaria (Aug 2013), https://www.aclweb.org/anthology/W13-2116 16. Hu, Y., Wan, X.: Automatic generation of related work sections in scien- tific papers: An optimization approach. In: Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). pp. 1624–1633. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1170, https://www.aclweb.org/anthology/D14- 1170 17. Chen, J., Zhuge, H.: Automatic generation of related work through summarizing citations. Concurrency and Computation: Practice and Experience 31(3), e4261 (2019). https://doi.org/10.1002/cpe.4261, https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4261, e4261 CPE-16- 0462.R2 18. Shin, Y.: Research on Innovative Point Identification and Mining of Academic Literature. Master’s thesis, Peking University (Jun 2020) 19. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobile- BERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics. pp. 2158–2170. Association for Computational Lin- guistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.195, https://www.aclweb.org/anthology/2020.acl-main.195 20. Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J., Çelebi, A., Dimitrov, S., Drabek, E.F., Hakim, A., Lam, W., Liu, D., et al.: Mead-a platform for multi- document multilingual text summarization. In: Proceedings of the Fourth Inter- national Conference on Language Resources and Evaluation (LREC’04) (2004). https://doi.org/10.7916/D8MG7XZT