Automatic Generation of Research Highlights from Scientific Abstracts

Automatic Generation of Research Highlights from Scientific Abstracts TohidaRehman tohida.rehman@gmail.com Jadavpur University Kolkata

India

DebarshiKumarSanyal debarshisanyal@gmail.com Indian Association for the Cultivation of Science Kolkata

India

SamiranChattopadhyay samirancju@gmail.com TCG CREST Jadavpur University Kolkata

India

PlabanKumar Bhowmick plaban@cet.iitkgp.ac.in IIT Kharagpur

India

ParthaPratimDas IIT Kharagpur

India

Automatic Generation of Research Highlights from Scientific Abstracts B90CCB098B5502D8099872C403997D55 GROBID - A machine learning software for extracting information from scholarly documents CCS CONCEPTS Information systems → Information extraction; Summarization Pointer-generator network Deep learning Natural language generation

The huge growth in scientific publications makes it difficult for researchers to keep track of new research even in narrow sub-fields. While an abstract is a traditional way to present a high level view of the paper, recently it is getting supplemented with research highlights that explicitly identify the important findings in the paper. In this poster, we aim to automatically construct research highlights given the abstract of a paper. We use deep neural network-based models for this purpose and achieve high ROUGE and METEOR scores on a large corpus of computer science papers.

INTRODUCTION

The count of scientific publications doubles roughly every 9 years [10], making it hard for researchers to track even their own fields. One recent trend is to provide research highlights -a bulleted list of the main contributions of the paper -along with the abstract and the main text. They are potentially easier to read than abstracts, especially on mobile devices, and focus more on findings than on background. Additionally research highlights could be useful for other tasks like finding surrogates for access-restricted papers [5,7] and keyphrase extraction [6]. We use a pointer-generator network with coverage mechanism to automatically generate highlights given the abstract of a research paper. Distinct from a prior work [2] that classifies sentences in the full text as highlights or not, our focus is on generation of highlights.

METHODOLOGY

We use a dataset released by Collins et al. [2] containing URLs of 10142 computer science publications from ScienceDirect 1 . Each 1 https://www.sciencedirect.com/ EEKE '21, September 30, 2021, Online example in the dataset is organized as (abstract, author-written research highlights): 8115 pairs are used for training, 1014 pairs for validation and 1013 pairs for testing. In this dataset, the average abstract size is 186 words while that of highlights is 52; for 98% of the papers, highlights are 1.5 times or more shorter than the abstract.

We have used three deep learning-based models to generate research highlights. Model 1 is the sequence-to-sequence (seq2seq) model with attention [3]. Each abstract is tokenized and the tokens are converted to 128-dimensional GloVe vectors [4] that are sequentially fed into the encoder which is a single-layer bidirectional Long Short-Term Memory (BiLSTM). The decoder is a single-layer unidirectional LSTM. The model uses neural attention [1] to attend to the words in the source document while generating the target words for the summary. Model 2 is a pointer-generator network [8], which augments the above seq2seq model with a special copying mechanism. When generating words, the decoder probabilistically decides between generating new words from the vocabulary (i.e. from the training corpus) and copying words from the input abstract (by sampling from the attention distribution). While the generator helps in novel paraphrasing, copying helps to tackle out-of-vocabulary (OOV) words. Model 3 augments the second model with coverage mechanism of Tu et al. [9] to avoid erroneously repeating the same words during decode. For all the models, we used the same vocabulary of around 50K tokens, beam search in the decoder with size 4, maximum input size of 400 tokens and maximum output size of 100 tokens.

RESULTS & ANALYSIS

Results are shown in Table 1

CONCLUSION

We applied three different deep neural models to generate research highlights from the abstract of a research paper. The pointer-generator network with coverage mechanism achieved the best performance.

But the predicted research highlights are not yet perfect. A simple post-processing operation could be to remove sentences that contain entities that are absent in the given abstract. We are currently exploring this and other techniques to improve the system.

for ROUGE-1, ROUGE-2, ROUGE-L and METEOR as (R)ecall, (P)recision and (F1)-score. Author-written highlights are used as the golden output. Model 3 (pointer-generator model with coverage mechanism) always achieved highest F1-score. In the case study in Fig. 1, Model 1 generated many OOV words and factual errors . Model 2 generates more meaningful research highlights and even relevant novel words that capture the context of the paper much better. Model 2 sometimes outputs repeating words but Model 3 reduces them. The first sentence from Model 3 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Table 1 :1Comparison of models for research highlight generation. Multiscale computation for transient heat conduction problem with radiation boundary condition in porous materials Abstract: This paper reports a multiscale asymptotic analysis and computation for predicting heat transfer performance of periodic porous materials with radiation boundary condition. In these porous materials thermal radiation effect at micro-scale have an important impact on the macroscopic temperature field, which is our particular interest in this study. The multiscale asymptotic expansions for computing temperature field of the problem are constructed, and associated explicit convergence rates are obtained on some regularity hypothesis. Finally, the corresponding finite element algorithms based on the multiscale method are brought forward and some numerical results are given in details. The numerical tests indicate that the developed method is feasible and valid for predicting the heat transfer performance of periodic porous materials, and support the approximate convergence results proposed in this paper. Author-written highlights: A novel multiscale analysis and computation is proposed. Heat transfer problem of periodic porous materials with radiation boundary condition are considered. Error estimates of the multiscale approximate solution are derived on some regularity hypothesis. Some numerical results are given in details to validate the multiscale method. Output of Model 1: A non-intrusive numerical tool is developed for solar artery supply planning . The results were analyzed, based on the fe modeling of finite element model . The approaches provides practical advantages of the cohort and accuracy of concrete equipment . Agent-based fe technology neural network procedures are used for assessment assessment. Results obtained from a real composite sample are considered and discussed. Output of Model 2: This paper reports a multiscale asymptotic analysis and developed protocol. The proposed approach is based on regularity hypothesis expansions . The proposed method is robust and can achieve predicting heat transfer performance . The proposed method is robust and efficient for given bone microstructure samples .ROUGEMETEORROUGE-1ROUGE-2ROUGE-LSynonym/paraphrase/stemRPF1RPF1RPF1RPF1Final ScoreModel 120.90 20.47 19.90 02.02 02.02 1.93 19.49 19.16 18.58 17.86 17.69 17.787.39Model 230.99 32.0730.97.488.06 7.55 28.66 30.34 28.62 25.53 26.61 26.0611.04Model 331.6 33.32 31.46 8.529.28.57 29.2 30.9 29.14 27.64 29.26 28.4312.01Title: Output of Model 3: Reports amultiscale asymptotic analysiswithout object propagation using minimal porous properties .

Predicting heat transfer performance of periodic porous materials with radiation boundary condition. Finite element algorithms and computation of approximate convergence results . Figure 1: Original abstract, author-written research highlights and model-generated research highlights. The meaning of the colors (e.g., green = correct) is explained in main text. Abstract taken from https:/

/www.sciencedirect.com/science/article/abs/pii/S0168874X15000621 containswords ( 'without object ... properties' ) that do not fit into the context, but its other highlights are meaningful.

ACKNOWLEDGMENTS

This work is supported by research grant from Department of Science and Technology, Government of India at Indian Association for the Cultivation of Science, Kolkata and National Digital Library of India Project sponsored by the Ministry of Education, Government of India at IIT Kharagpur.

Neural machine translation by jointly learning to align and translate DzmitryBahdanau KyunghyunCho YoshuaBengio ICLR 2015 A supervised approach to extractive summarisation of scientific papers EdCollins IsabelleAugenstein SebastianRiedel CoNLL 2017 Abstractive text summarization using sequence-to-sequence RNNs and beyond RameshNallapati BowenZhou CaglarGulcehre BingXiang CoNLL 2016 GloVe: Global vectors for word representation JeffreyPennington RichardSocher ChristopherDManning EMNLP 2014 Surrogator: A tool to enrich a digital library with open access surrogate resources DebarshiTyss Santosh PlabanKumar Sanyal ParthaPratimKumar Bhowmick Das JCDL 2018 DAKE: Document-Level Attention for Keyphrase Extraction TokalaYaswanth SriSai Santosh DebarshiKumar Sanyal PlabanKumar Bhowmick ParthaPratimDas ECIR 2020 Enhancing access to scholarly publications with surrogate resources DebarshiKumar Sanyal PlabanKumar Bhowmick ParthaPratimDas SamiranChattopadhyay Santosh Scientometrics 121 2 2019. 2019 Get to the point: Summarization with pointer-generator networks AbigailSee PeterJLiu ChristopherDManning ACL 2017 Modeling coverage for neural machine translation ZhaopengTu ZhengdongLu YangLiu XiaohuaLiu HangLi ACL 2016 Global scientific output doubles every nine years RichardVan Noorden Nature news blog 2014. 2014