Improving Scientific Article Visibility by Neural Title Simplification

Improving Scientific Article Visibility by Neural Title Simplification AlexanderShvets alexander.shvets@upf.edu Universitat Pompeu Fabra

08018 Barcelona Spain

Improving Scientific Article Visibility by Neural Title Simplification 1AE38894E6C805A4C57E39686DF9A1BC GROBID - A machine learning software for extracting information from scholarly documents Scientific Text Summarization Machine Translation Recommender Systems Personalized Simplification

The rapidly growing amount of data that scientific content providers should deliver to a user makes them create effective recommendation tools. A title of an article is often the only shown element to attract people's attention. We offer an approach to automatic generating titles with various levels of informativeness to benefit from different categories of users. Statistics from Re-searchGate used to bias train datasets and specially designed post-processing step applied to neural sequence-to-sequence models allow reaching the desired variety of simplified titles to gain a trade-off between the attractiveness and transparency of recommendation.

Introduction

The amount of information scientific society produces on a daily basis results in the necessity of researchers to have proper guidance in a digital space. The function of the virtual assistance is performed by various scientometric systems, research paper recommender systems (Haruna et al., 2017) and different kinds of search engines. (Shvets et al., 2015) summarizes the most common types of systems for scientometric analysis. The recent trend in scientific paper delivery is purpose specific webresources, blogs, and e-journals often coupled with email subscriptions. They often provide personalized recommendations based on users' behavior and preferences.

The recommendation usually has a form of imprint often limited only by a title (as in the case with email subscriptions associated with limited space and lack of time to attract people's attention). Eventually, the success of recommendation depends on the informativeness of the title of an article subject to user's intentions and acknowledgment with a certain scientific field. This denotes the necessity of finding a way of varying the title of the same paper for different categories of users.

The focus of this paper is in developing models for creating a variety of simplified versions of the titles of scientific articles which would be condensed and informative enough and at the same time would correspond to the original topic of a paper to maintain users' loyalty. We aim at supporting two scenarios of personalized simplification: the first ensuring narrow focus on specific scientific concepts for goal-oriented experts and the second providing a general overview for researchers working on the edge of a topic willing to expand their horizons. The second case should not be treated as a generation of clickbaits (catchy short misleading headings) that are to be blocked with the use of efficient machine learning approaches (Biyani et al., 2016).

There is a variety of algorithms that could be used for title simplification which is a rapidly growing research area (Bouayad-Agha et al., 2009;Saggion et al., 2015;Guo et al., 2018). As long as the defined task is similar to text compression and abstractive summarization we made a choice towards encoder-decoder neural architectures (Nallapati et.al, 2016, Nikolov et.al, 2018).

The remainder of the paper is structured as follows. In Section 2, we propose a method for scientific title diversification and simplification. Section 3 is devoted to describing the datasets used for training. Section 4 denotes the experiment setup. Section 5 provides results of numerical experiments. Section 6 is devoted to human evaluation. In Section 7, finally, we discuss results and outline future work.

Method

Recent advances in natural machine translation (NMT) incite to solve the task in a supervised manner controlling the style of a title by conditioning training data. The method we propose comprises the following steps: a) selecting a subset from an abstract-to-title dataset to impose conditions that would force a model to generate hypotheses with desirable properties; b) training a sequence-to-sequence (seq2seq) model; c) applying a model to title-to-title generation; d) performing post-processing step to remove unnecessarily repeated tokens; e) filtering titles with improper structure. The remainder of this section describes each step in details.

To create titles of different styles for various categories of researchers several datasets should be used. The set of highly popular scientific titles may help to generate attractive headings for users with interests peripheral to the subject of a paper. The condition to have a multi-word noun phrase NPmw in a target text is to avoid producing overly shortened pointless titles. In case each training example contains a reference text Rt and a target text Tt that have similar NPmw (at least 2 common terms), a model might learn to preserve the most important concepts from original titles needed by experts. Figure 1 shows the training example with similar NPmw-s in Rt and Tt.

Input sequence (an abstract, lower case, tokenized, truncated)

Target sequence (a title, tokenized, lower case) effects of order of presentation on conditional reasoning the main goal of this research is to study whether or not the order of presentation of the premises in a logical argument form , such as a conditional reasoning task , could affect In particular, we used the realization in OpenNMT toolkit (Klein et al., 2018) enabling pointer that allows copying tokens from the reference text. The trained model is to be applied to new unseen titles, which are, in opposite to abstracts (cut-off after 50 tokens in our experiments), not truncated. Since the task differs from general NMT task and summarization task by the absence of need in tracking alignment, traditional coverage mechanism (Wu et al., 2016), that discourage repetitions, is not included not to impose potentially harmful restrictions and not to overcomplicate the model. Instead, we introduce the postprocessing step PS as follows. Firstly, each repetition of a term is removed leaving the only occurrence closest to the beginning of a text. Secondly, all the auxiliary tokens without required terms in between or after them are eliminated. In the end, we iteratively remove the last token in a text if it is an adjective or auxiliary token and, in addition, capitalize the title.

The last step consists in filtering improper titles, i.e., generated sequences that have less than two NPmw-s similar to some NPmw-s of the source title. In those use cases when even potentially pointless output is required, this step should be skipped.

Datasets

We chose ResearchGate1 platform as a source of data. It has a recommender system and therefore openly counts the number of times a page with a paper was visited to provide reasonable recommendations that motivates authors to be more visible.

We selected 150K imprints of articles on various topics using a wide list of general scientific words (Osipov et al., 2014) as an entry point to the articles. Figure 2 To detect noun phrases we used Spacy chunker (Honnibal and Montani, 2017) that we elaborated for detecting complex phrases, which map single concepts (e.g., "vertex energy of a graph" that is a lexical variation of the concept "graph energy").

Experiment Setup

Selecting the first 75 characters of the reference text is generally used as a baseline in summarization tasks; cf., e.g., (Rush et al., 2015). We added subsequent cut-off after the last noun in a phrase. This improved baseline is referred to as MBase henceforth. Several seq2seq models (M1, M2, …) with the above-described architectures differed by a number of layers were applied to various datasets to bias the style of output text. They were then extended with post-processing PS (M1ps, M2ps, …) and filtering steps, which are novel for the best knowledge of the author; cf.

Results

The most of the basic models performed reasonably: produced titles were in general shorter than original, multiple-word noun phrases from reference title covered a significant part of the generated title (NPdiff-p = 0.68 on average). However, some models, especially M5, introduced many repetitions (for all checkpoints): the BLEU value reflected it being equal to 0.18 for M5 while the average value for the rest of models was equal to 0.35. Since BLEU depends on a number of same word occurrences, the increase of it by 24% on average due to PS attests usefulness of the step (cf. Table 2). Filtering step allowed dropping less informative titles so that one can take advantage even of poor models reducing a risk to present misleading picking-eye headings or generic topics to an user (cf. Figure 3 for examples of generated texts).

Fig. 3. Filtering step

The extension of basic models led to an increase of NPdiff-p by 9% and rouge-L-f by 11% on average. Table 3 gives an idea of the variation of titles of different models in style and in compression rate.

Uninformative titles

Final titles

It is worth noting that 1-layer models M1 and M2 trained on conditioned datasets reached higher values for the majority of measures in comparison to models M3 and M4 fed with generic data. This highlights rationality in pre-directing the training.

Human Evaluation

For human evaluation, we selected five papers of the NLP research group (TALN UPF) with titles longer than 93 characters (10-18 words). Their authors who own Ph.D. degrees were asked to rank output titles for these papers including original title by preference on clicking if they saw a title briefly in a daily email digest. To face different decision criteria assessors worked with papers of their authorship (for simulating expert behavior) and with papers of their colleagues (expanding horizons use case). If some titles in a set were the same or assessors did not have any preference between two similar titles they were allowed to rank them equally. The top models sorted by the average rank and examples of titles from one set are listed in Table 4. Noted final increase of NPdiff-p and rouge-L-f indicates that common subsequences became longer in relation to the length of titles meaning that offered post-processing step with filtering plays an important role in forming a fluent text. At the same time, the output should not have been just one of the original subsequences, therefore, we did not aim at reaching too high precision values. Pure state-of-the-art seq2seq models without post-processing step got low ranks on human evaluation. The models M1ps and M2ps have a higher average rank of 6. Their titles are well-formed and represent a combination of original multi-word expressions (cf. Table 3 for relatively high scores of rouge-2-r), however, less corresponding to the topic that is partly reflected by comparatively lower values of rouge-L-p. The outputs of the models M3ps and M5ps were often preferred to original titles. Having 1.3 times shorter titles than M5ps, conditionally trained M6ps achieved almost the same average score. The baseline has the highest rank since it often better preserves the meaning although does not always form a complete phrase. The main drawback is that it usually only generalize a title to some extent (in case of well-turned subsequence) and miss details experts might need.

The close average ranks of models and rouge-L-f on the same level for all models denote an opportunity to overcome the general problem of lacking the variability in neural seq2seq generation. Different title styles give a possibility to reach a preferable trade-off between the conciseness of the title and its transparency.

For future work, we plan to gain value from methods of paraphrasing (Cao et al., 2017), advanced simplification (Zhang and Lapata, 2017; Štajner and Saggion, 2018) and surface realization for deep input representations (Belz et al., 2018) to obtain diverse semantically close outputs differ from text reformulated with mostly the same words. Fake-paper detecting (Byrne and Labbé, 2017) and assessing the quality of scientific texts (Shvets, 2015) will help to avoid training the models on misleading titles. Finally, pre-existing taxonomies (e.g., JEL codes in Economics, the ACM taxonomy in Computer Science, the Web of Science categories attached to journals), and meta information of papers such as authors' keywords or KeywordsPlus items inferred from the references cited (Garfield and Sher, 1993) are to be used for preselecting the most relevant concepts to bias the training.

Fig. 1 .1Fig. 1. Training example with similar noun phrases in reference and target text

shows the correlation between the number of paper views Nv and the title lengths Lt (in characters) in the collection. The top-viewed articles along negative correlation formed the desired set of highly popular titles. The whole pool of imprints formed a generic dataset. Random split for training and validation (93/7) was carried out. The set of 1000 imprints with Nv = 1 and Lt > 100 was used for testing the models.

Fig. 2 .2Fig. 2. Dynamics for titles in a professional social network ResearchGate (Nv > 5, Lt > 20).The texts were pre-processed on the fly applying language detection with langid.py 2 and sentence detection with tokenization from NLTK 3 . Cleaning of training and vali-

Table 1 .1Table 1 for details. Distinctive details of basic and extended modelsModel #layersDatasetM1 / M1ps1conditioned (Rt and Tt have at least 2 pairs of similar NPmw), 11KM2 / M2ps1strongly conditioned (Rt and Tt have at least 1 pair of equal NPmw), 5.5KM3 / M3ps1weakly conditioned (Rt and Tt have a common term), 66KM4 / M4ps1top-views weakly conditioned (Rt and Tt have a common term), 18KM5 / M5ps2weakly conditioned (Rt and Tt have a common term), 66KM6 / M6ps2conditioned (Rt

and Tt have at least 2 pairs of similar NPmw), 11KFor the final model assessment, we used measures BLEU(Papineni et al., 2002), ROUGE-1, ROUGE-2, ROUGE-L(Lin, 2004), and specially designed NPdiff-p, i.e., NPmw-based precision evaluated as rouge-L-p but considering only one occurrence of similar NPmw-s in a hypothesis.The intermediate models created at checkpoints during the training were assessed and the best by NPdiff-p were selected as resulting.

Table 2 .2Improvement of a title by post-processing step PSOriginal titleA Study on Knowledge Management System for Knowledge Competitive-(reference)ness with One Stop Knowledge ServiceInitial hypoth-esis before PSknowledge management system for knowledge competitiveness with one stop knowledge service with one stop service with one stop service with one stop service with one stop service with one stop service with…Resulting title Knowledge Management System for Competitiveness with One Stop Service

Table 3 .3ROUGE measures for inspected modelsAddiction and the New Black?picking-eyeThe Romans Know? Spain: a FocusActive Learning for Biomedical Data ClassificationAccess to Specialist Medical Services: a Pilot Studygeneric topicsConsumer Loyalty Financial Cooperationmodelrouge -1-rrouge -1-prouge -1-frouge -2-rrouge -2-prouge -2-frouge -L-rrouge -L-prouge -L-frouge-L-f (basic Mn)MBase0.60 1.00 0.74 0.54 1.00 0.690.601.00 0.660.64M1ps+F0.59 0.99 0.73 0.41 0.76 0.520.520.88 0.570.55M2ps+F0.58 0.98 0.72 0.42 0.78 0.530.530.89 0.580.53M3ps+F0.50 0.99 0.65 0.36 0.83 0.490.480.95 0.530.43M4ps+F0.52 1.00 0.67 0.34 0.75 0.460.470.89 0.520.43M5ps+F0.65 0.99 0.77 0.51 0.84 0.620.620.94 0.670.64M6ps+F0.50 1.00 0.65 0.38 0.89 0.520.480.96 0.520.48

Table 4 .4Top models according to the average rank given by assessorsModelFinal TitleRAVGMBaseMultisensor: Development of Multimedia Content Integration Technologies 1.9M3psMultimedia Content Integration Technologies for Journalism3.7M5psDevelopment of Multimedia Content Integration for Journalism, Media and International Exporting and Decision Support4.2MOrigMultisensor: Development of multimedia content integration technologies for journalism, media monitoring and international exporting decision support4.3M6psMultimedia Content Integration Technologies for Journalism, Media4.4M4psMultimedia Content Integration for Journalism5.7

7 Discussion and Future Workhttps://www.researchgate.net/ BIRWorkshop on Bibliometric-enhanced Information Retrievalhttps://github.com/saffsd/langid.pyhttps://www.nltk.org

Acknowledgments

The presented work was supported by the European Commission under the contract numbers H2020-700024-RIA, H2020-700475-IA, H2020-779962-RIA, H2020-786731-RIA, and H2020-825079-RIA and by the Russian Foundation for Basic Research under the contract number 18-37-00198. Many thanks to the four anonymous reviewers for their valuable comments, and to the five postdoctoral researchers for their high responsiveness in the evaluation and insightful feedback.

A collaborative approach for research paper recommender system KHaruna MAIsmail DDamiasih JSutopo THerawan PloS one 12 10 e0184516 2017 Detection of current research directions based on full-text clustering AShvets DDevyatkin ISochenkov ITikhomirov KPopov KYarygin Science and Information Conference (SAI) IEEE 2015. 2015 Detecting Clickbaits in News Streams Using Article Informality PBiyani KTsioutsiouliklis JBlackmer Thirtieth AAAI Conference on Artificial Intelligence 2016 8 Amazing Secrets for Getting More Clicks Simplification of patent claim sentences for their paraphrasing and summarization NBouayad-Agha GCasamayor GFerraro LWanner 22 nd FLAIRS Conference 2009 Making it simplext: Implementation and evaluation of a text simplification system for spanish HSaggion SŠtajner SBott SMille LRello BDrndarevic ACM Transactions on Accessible Computing (TACCESS) 6 4 14 2015 Abstractive text summarization using sequence-to-sequence rnns and beyond RNallapati BZhou CGulcehre BXiang CoNLL 280 2016. 2016 Dynamic Multi-Level Multi-Task Learning for Sentence Simplification HGuo RPasunuru MBansal Proceedings of the 27th International Conference on Computational Linguistics the 27th International Conference on Computational Linguistics 2018 Data-driven Summarization of Scientific Articles NINikolov MPfeiffer RHHahnloser Proc. of the 7th International Workshop on Mining Scientific Publications of the 7th International Workshop on Mining Scientific Publications LREC 2018. 2018 Effective approaches to attention-based neural machine translation MTLuong HPham CDManning Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing the 2015 Conference on Empirical Methods in Natural Language Processing 2015 Incorporating copying mechanism in sequence-tosequence learning JGu ZLu HLi VOLi Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics the 54th Annual Meeting of the Association for Computational Linguistics 2016 Get to the point: Summarization with pointer-generator networks ASee PJLiu CDManning Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics the 55th Annual Meeting of the Association for Computational Linguistics 2017 OpenNMT: Neural Machine Translation Toolkit GKlein YKim YDeng VNguyen JSenellart AMRush Proceedings of the 13th Conference of the Association for Machine Translation in the Americas the 13th Conference of the Association for Machine Translation in the Americas 2018 Google's neural machine translation system: Bridging the gap between human and machine translation YWu MSchuster ZChen QVLe MNorouzi WMacherey ..Klingner J arXiv:1609.08144 2016 arXiv preprint Information retrieval for R&D support GOsipov ISmirnov ITikhomirov ISochenkov AShelmanov AShvets Professional search in the modern world Springer 2014 8830 MHonnibal IMontani spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing 2017 AMRush SChopra JWeston Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing the 2015 Conference on Empirical Methods in Natural Language Processing 2015 BLEU: a method for automatic evaluation of machine translation KPapineni SRoukos TWard WJZhu Proceedings of the 40th annual meeting on association for computational linguistics the 40th annual meeting on association for computational linguistics 2002 Rouge: A package for automatic evaluation of summaries CYLin Text Summarization Branches Out 2004 Joint copying and restricted generation for paraphrase ZCao CLuo WLi SLi Thirty-First AAAI Conference on Artificial Intelligence 2017 Sentence Simplification with Deep Reinforcement Learning XZhang MLapata Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing the 2017 Conference on Empirical Methods in Natural Language Processing 2017 Data-Driven Text Simplification SŠtajner HSaggion Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts the 27th International Conference on Computational Linguistics: Tutorial Abstracts 2018 ABelz BBohnet EPitler LWanner SMille The First Multilingual Surface Realisation Shared Task (SR'18): Overview and Evaluation Results 2018 Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines JAByrne CLabbé Scientometrics 110 3 2017 A Method of Automatic Detection of Pseudoscientific Publications AShvets Intelligent Systems' 2014. 2015 323 Key Words Plus [TM]-Algorithmic Derivative Indexing EGarfield IHSher Journal-American Society For Information Science 44 1993