=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_27
|storemode=property
|title=The DELICES Project: Indexing Scientific Literature Through Semantic Expansion
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_27.pdf
|volume=Vol-2621
|authors=Florian Boudin,Béatrice Daille,Evelyne Jacquey,Jian-Yun Nie
|dblpUrl=https://dblp.org/rec/conf/circle/BoudinDJN20
}}
==The DELICES Project: Indexing Scientific Literature Through Semantic Expansion==
The DELICES project: Indexing scientific literature through semantic expansion Florian Boudin Béatrice Daille LS2N, Université de Nantes LS2N, Université de Nantes Nantes, France Nantes, France florian.boudin@univ-nantes.fr beatrice.daille@univ-nantes.fr Evelyne Jacquey Jian-Yun Nie CNRS, Université de Lorraine RALI, Université de Montréal Nancy, France Montréal, Canada evelyne.jacquey@atilf.fr nie@iro.umontreal.ca ABSTRACT licensing issues, in the case of commercial publishers, but also to re- Scientific digital libraries play a critical role in the development and source limitations in the case of public and preprint repositories [6]. dissemination of scientific literature. Despite dedicated search en- One straightforward solution to alleviate this problem is to supple- gines, retrieving relevant publications from the ever-growing body ment paper indexing with keyphrases, that is, single or multi-word of scientific literature remains challenging and time-consuming. lexical units that capture the main topics of a document [1, 4, 11]. Indexing scientific articles is indeed a difficult matter, and current Most documents, however, do not come with associated keyphrases, models solely rely on a small portion of the articles (title and ab- and manual annotation is simply not a feasible option [7]. Needless stract) and on author-assigned keyphrases when available. This to say, there is an acute need for automated keyphrase assignment results in a frustratingly limited access to scientific knowledge. The models, task on which researchers both in information retrieval goal of the DELICES project is to address this pitfall by exploit- and natural language processing are now devoting their efforts. ing semantic relations between scientific articles to both improve Despite the higher performance brought by the use of neural and enrich indexing. To this end, we will rely on the latest ad- network architectures, current models for identifying keyphrases vances in semantic representations to both increase the relevance still achieve fairly low precision scores [8, 12]. The primary reason of keyphrases extracted from the documents, and extend indexing behind this is their inability to accurately produce keyphrases that to new terms borrowed from semantically similar documents. do not occur in the documents. Those so-called absent keyphrases are especially valuable for indexing documents, accounting for CCS CONCEPTS about half of the manually assigned keyphrases [5]. Overlooking these absent keyphrases ultimately results in relevant documents • Information systems → Digital libraries and archives; In- that are not retrieved, thus preventing a thorough exploration of formation retrieval. the scientific knowledge. The goal of the DELICES project is to provide a solution to this critical problem by exploiting the relation- 1 INTRODUCTION ships between scientific articles to improve and enrich indexing. Scientific digital libraries (e.g. arXiv, ACM Digital Library) play a More precisely, as depicted in Figure 1, we will rely on automati- critical role in the development and dissemination of scientific liter- cally detected, semantically similar documents to both enhance the ature. They provide researchers with access to millions of scientific weighting scheme for keyphrases that occur in the documents and articles, as well as an effective way to communicate their findings. extend the index with new terms borrowed from similar documents. The picture, however, is not so rosy when it comes to searching and document collection navigating through this ever-growing body of scientific literature. Indeed, even with dedicated search engines, retrieving relevant pub- find semantically similar documents lications is becoming increasingly challenging and time-consuming. There are two main reasons for this situation. First and foremost, the global explosion of the amount of scientific output is overwhelm- ing researchers with an unsustainable torrent of publications. The new document enhance expand exponential growth of the number of submissions in arXiv is a very graph-based index compelling illustration of that issue.1 representation The second reason is that the current practice in indexing scien- modelization ranking tific articles typically relies on a small fraction of their content (title and abstract), which makes it ineffective [10]. This is not only due to 1 https://arxiv.org/stats/monthly_submissions "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0)." Figure 1: Overall architecture of the indexing model. Boudin, et al. The unique nature of the DELICES project comes from the un- document representation with these new terms in order for the derlying graph-based ranking model that sits at its core, which model to predict absent keyphrases. The main difficulty with this allows for the flexible integration of prior domain knowledge that approach would be to precisely control how present and absent is required to accurately predict absent keyphrases. This is moti- keyphrases are interleaved in the overall ranking. We believe that vated by the fact that current neural-based models not only have jointly encoding the document and domain knowledge into a multi- difficulties harnessing such knowledge, but also exhibit poor gen- graph data structure would allow for a finer-grained weighting of eralization performance across domains which severely limits their the keyphrases. use in practice [2]. Another important aspect of this project is the automatic acquisition of the prior domain knowledge through the 3 PRELIMINARY RESULTS compilation of sets of semantically related documents. These latter One critical issue with the current literature on keyphrase assign- will enable us to enrich the graph representations of the documents, ment is the lack of unified evaluation methodology, making the facilitating the overall ranking process [9] while allowing the pre- direct comparison between previous models not possible. In an at- diction of absent keyphrases that only occur in related documents. tempt to solve that issue, we conducted the first large-scale analysis The rest of this paper discusses the two main challenges we aim to of state-of-the-art keyphrase extraction models involving multiple address in DELICES, and concludes with a summary of the work benchmark datasets from various sources and domains [3]. Our already accomplished since the project started. main results reveal that existing models are still challenged by simple baselines on some datasets, and yield insights about the 2 CHALLENGES negative impact of using author-assigned keyphrases as a proxy 2.1 Improving keyphrase ranking for gold standard. We also provide specific recommendations on which baselines and datasets should be included in future work. The content of scientific papers alone is often not sufficient to ef- Neural models for keyphrase generation do not generalize well fectively rank keyphrases [5, 9] – an issue that is exacerbated by across domain, but the extent to which their performances are the limited availability of full-text articles. It is thus essential that degraded is not clearly understood as only one sufficiently large external knowledge can be made accessible to the model so that it training dataset was available [8]. To fill that gap, we collected a can make better predictions. That being said, selecting the appro- new large-scale dataset, namely KPTimes, composed of news texts priate amount of knowledge and integrating it into the keyphrase paired with editor-curated keyphrases. Using it, we provided an in- generation model is not straightforward, as this process can easily depth analysis of the performance of state-of-the-art neural models cause the indexation to drift away from the original content of and investigated their transferability to the news domain as well as the documents if not done carefully. To avoid this pitfall, we will the impact of domain shift [2]. seek to identify precisely where, in the graph representation, more information is needed, and devise a fine-grained enrichment mech- ACKNOWLEDGMENTS anism that can efficiently and reliably improve the overall ranking of keyphrases. A first step in this direction would be to act on the We thank the anonymous reviewers for their valuable comments. weakly connected components of the graph by introducing and This work was supported by the French National Research Agency strengthening edges between keyphrase candidates that co-occur (ANR) through the DELICES project (ANR-19-CE38-0005-01). within the related documents. REFERENCES [1] Frances H Barker, Douglas C Veal, and Barry K Wyatt. 1972. Comparative 2.2 Generating absent keyphrases efficiency of searching titles, abstracts, and index terms in a free-text data base. As stated earlier, absent keyphrases are of utmost importance when Journal of Documentation 28, 1 (1972), 22–36. [2] Ygor Gallina, Florian Boudin, and Beatrice Daille. 2019. KPTimes: A Large-Scale indexing scientific articles [5, 8] – they act as a means for expanding Dataset for Keyphrase Generation on News Documents. In Proceedings of INLG. documents, and thus alleviate the “vocabulary mismatch” problem 130–135. [3] Ygor Gallina, Florian Boudin, and Béatrice Daille. 2020. Large-Scale Evaluation between query terms and relevant documents. Predicting appropri- of Keyphrase Extraction Models. In Proceedings of JCDL. ate absent keyphrases is obviously a very challenging task because [4] Carl Gutwin, Gordon Paynter, Ian Witten, Craig Nevill-Manning, and Eibe Frank. of the unrestricted search space: every possible term can be con- 1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Decis. Support Syst. 27, 1-2 (Nov. 1999), 81–104. sidered as keyphrase. It therefore comes as no surprise that recent [5] Kazi Saidul Hasan and Vincent Ng. 2014. Automatic Keyphrase Extraction: A neural keyphrase generation models still achieve fairly low ac- Survey of the State of the Art. In Proceedings of ACL. 1262–1273. curacies on that task despite being trained on large amounts of [6] Chien-yu Huang, Arlene Casey, Dorota Głowacka, and Alan Medlar. 2019. Holes in the Outline: Subject-Dependent Abstract Quality and Its Implications for annotated data [8, 12]. Furthermore, their predictions are bound to Scientific Literature Search. In Proceedings of CHIIR. 289–293. a fixed-size output vocabulary built from the set of gold standard [7] Yuqing Mao and Zhiyong Lu. 2017. MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. Journal of Biomedical Semantics 8, 1 (2017). keyphrases, which means that they can only produce keyphrases [8] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu that were already assigned to other documents. With this in mind, Chi. 2017. Deep Keyphrase Generation. In Proceedings of ACL. 582–592. we are looking to move away from these data hungry models, and [9] Xiaojun Wan and Jianguo Xiao. 2008. Single Document Keyphrase Extraction Using Neighborhood Knowledge. In Proceedings of AAAI. 855–860. study how absent keyphrases could be inferred from related docu- [10] F. Xia, W. Wang, T. M. Bekele, and H. Liu. 2017. Big Scholarly Data: A Survey. ments present in the entire collection. Here we argue that carefully IEEE Transactions on Big Data 3, 1 (March 2017), 18–35. selected sets of semantically related documents are the right spot [11] Chengxiang Zhai. 1997. Fast Statistical Parsing of Noun Phrases for Document Indexing. In Proceedings of ANLP. 312–319. for mining new, yet likely to be relevant indexing terms. To ver- [12] Jing Zhao and Yuxiang Zhang. 2019. Incorporating Linguistic Constraints into ify this claim, a straightforward approach would be to expand the Keyphrase Generation. In Proceedings of ACL. 5224–5233.