=Paper= {{Paper |id=Vol-2621/CIRCLE20_27 |storemode=property |title=The DELICES Project: Indexing Scientific Literature Through Semantic Expansion |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_27.pdf |volume=Vol-2621 |authors=Florian Boudin,Béatrice Daille,Evelyne Jacquey,Jian-Yun Nie |dblpUrl=https://dblp.org/rec/conf/circle/BoudinDJN20 }} ==The DELICES Project: Indexing Scientific Literature Through Semantic Expansion== https://ceur-ws.org/Vol-2621/CIRCLE20_27.pdf
      The DELICES project: Indexing scientific literature through
                        semantic expansion
                                Florian Boudin                                                                Béatrice Daille
                         LS2N, Université de Nantes                                                  LS2N, Université de Nantes
                                Nantes, France                                                              Nantes, France
                       florian.boudin@univ-nantes.fr                                                beatrice.daille@univ-nantes.fr

                               Evelyne Jacquey                                                                  Jian-Yun Nie
                        CNRS, Université de Lorraine                                                RALI, Université de Montréal
                               Nancy, France                                                             Montréal, Canada
                          evelyne.jacquey@atilf.fr                                                     nie@iro.umontreal.ca

ABSTRACT                                                                             licensing issues, in the case of commercial publishers, but also to re-
Scientific digital libraries play a critical role in the development and             source limitations in the case of public and preprint repositories [6].
dissemination of scientific literature. Despite dedicated search en-                 One straightforward solution to alleviate this problem is to supple-
gines, retrieving relevant publications from the ever-growing body                   ment paper indexing with keyphrases, that is, single or multi-word
of scientific literature remains challenging and time-consuming.                     lexical units that capture the main topics of a document [1, 4, 11].
Indexing scientific articles is indeed a difficult matter, and current               Most documents, however, do not come with associated keyphrases,
models solely rely on a small portion of the articles (title and ab-                 and manual annotation is simply not a feasible option [7]. Needless
stract) and on author-assigned keyphrases when available. This                       to say, there is an acute need for automated keyphrase assignment
results in a frustratingly limited access to scientific knowledge. The               models, task on which researchers both in information retrieval
goal of the DELICES project is to address this pitfall by exploit-                   and natural language processing are now devoting their efforts.
ing semantic relations between scientific articles to both improve                       Despite the higher performance brought by the use of neural
and enrich indexing. To this end, we will rely on the latest ad-                     network architectures, current models for identifying keyphrases
vances in semantic representations to both increase the relevance                    still achieve fairly low precision scores [8, 12]. The primary reason
of keyphrases extracted from the documents, and extend indexing                      behind this is their inability to accurately produce keyphrases that
to new terms borrowed from semantically similar documents.                           do not occur in the documents. Those so-called absent keyphrases
                                                                                     are especially valuable for indexing documents, accounting for
CCS CONCEPTS                                                                         about half of the manually assigned keyphrases [5]. Overlooking
                                                                                     these absent keyphrases ultimately results in relevant documents
• Information systems → Digital libraries and archives; In-                          that are not retrieved, thus preventing a thorough exploration of
formation retrieval.                                                                 the scientific knowledge. The goal of the DELICES project is to
                                                                                     provide a solution to this critical problem by exploiting the relation-
1    INTRODUCTION                                                                    ships between scientific articles to improve and enrich indexing.
Scientific digital libraries (e.g. arXiv, ACM Digital Library) play a                More precisely, as depicted in Figure 1, we will rely on automati-
critical role in the development and dissemination of scientific liter-              cally detected, semantically similar documents to both enhance the
ature. They provide researchers with access to millions of scientific                weighting scheme for keyphrases that occur in the documents and
articles, as well as an effective way to communicate their findings.                 extend the index with new terms borrowed from similar documents.
The picture, however, is not so rosy when it comes to searching and
                                                                                                                        document collection
navigating through this ever-growing body of scientific literature.
Indeed, even with dedicated search engines, retrieving relevant pub-                              find semantically
                                                                                                 similar documents
lications is becoming increasingly challenging and time-consuming.
There are two main reasons for this situation. First and foremost, the
global explosion of the amount of scientific output is overwhelm-
ing researchers with an unsustainable torrent of publications. The                            new document            enhance                    expand
exponential growth of the number of submissions in arXiv is a very                                                               graph-based
                                                                                                                                                               index
compelling illustration of that issue.1                                                                                         representation

    The second reason is that the current practice in indexing scien-                                        modelization                            ranking
tific articles typically relies on a small fraction of their content (title
and abstract), which makes it ineffective [10]. This is not only due to

1 https://arxiv.org/stats/monthly_submissions

"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0)."                                 Figure 1: Overall architecture of the indexing model.
                                                                                                                                                      Boudin, et al.


   The unique nature of the DELICES project comes from the un-              document representation with these new terms in order for the
derlying graph-based ranking model that sits at its core, which             model to predict absent keyphrases. The main difficulty with this
allows for the flexible integration of prior domain knowledge that          approach would be to precisely control how present and absent
is required to accurately predict absent keyphrases. This is moti-          keyphrases are interleaved in the overall ranking. We believe that
vated by the fact that current neural-based models not only have            jointly encoding the document and domain knowledge into a multi-
difficulties harnessing such knowledge, but also exhibit poor gen-          graph data structure would allow for a finer-grained weighting of
eralization performance across domains which severely limits their          the keyphrases.
use in practice [2]. Another important aspect of this project is the
automatic acquisition of the prior domain knowledge through the             3     PRELIMINARY RESULTS
compilation of sets of semantically related documents. These latter         One critical issue with the current literature on keyphrase assign-
will enable us to enrich the graph representations of the documents,        ment is the lack of unified evaluation methodology, making the
facilitating the overall ranking process [9] while allowing the pre-        direct comparison between previous models not possible. In an at-
diction of absent keyphrases that only occur in related documents.          tempt to solve that issue, we conducted the first large-scale analysis
The rest of this paper discusses the two main challenges we aim to          of state-of-the-art keyphrase extraction models involving multiple
address in DELICES, and concludes with a summary of the work                benchmark datasets from various sources and domains [3]. Our
already accomplished since the project started.                             main results reveal that existing models are still challenged by
                                                                            simple baselines on some datasets, and yield insights about the
2 CHALLENGES                                                                negative impact of using author-assigned keyphrases as a proxy
2.1 Improving keyphrase ranking                                             for gold standard. We also provide specific recommendations on
                                                                            which baselines and datasets should be included in future work.
The content of scientific papers alone is often not sufficient to ef-
                                                                               Neural models for keyphrase generation do not generalize well
fectively rank keyphrases [5, 9] – an issue that is exacerbated by
                                                                            across domain, but the extent to which their performances are
the limited availability of full-text articles. It is thus essential that
                                                                            degraded is not clearly understood as only one sufficiently large
external knowledge can be made accessible to the model so that it
                                                                            training dataset was available [8]. To fill that gap, we collected a
can make better predictions. That being said, selecting the appro-
                                                                            new large-scale dataset, namely KPTimes, composed of news texts
priate amount of knowledge and integrating it into the keyphrase
                                                                            paired with editor-curated keyphrases. Using it, we provided an in-
generation model is not straightforward, as this process can easily
                                                                            depth analysis of the performance of state-of-the-art neural models
cause the indexation to drift away from the original content of
                                                                            and investigated their transferability to the news domain as well as
the documents if not done carefully. To avoid this pitfall, we will
                                                                            the impact of domain shift [2].
seek to identify precisely where, in the graph representation, more
information is needed, and devise a fine-grained enrichment mech-           ACKNOWLEDGMENTS
anism that can efficiently and reliably improve the overall ranking
of keyphrases. A first step in this direction would be to act on the        We thank the anonymous reviewers for their valuable comments.
weakly connected components of the graph by introducing and                 This work was supported by the French National Research Agency
strengthening edges between keyphrase candidates that co-occur              (ANR) through the DELICES project (ANR-19-CE38-0005-01).
within the related documents.
                                                                            REFERENCES
                                                                             [1] Frances H Barker, Douglas C Veal, and Barry K Wyatt. 1972. Comparative
2.2    Generating absent keyphrases                                              efficiency of searching titles, abstracts, and index terms in a free-text data base.
As stated earlier, absent keyphrases are of utmost importance when               Journal of Documentation 28, 1 (1972), 22–36.
                                                                             [2] Ygor Gallina, Florian Boudin, and Beatrice Daille. 2019. KPTimes: A Large-Scale
indexing scientific articles [5, 8] – they act as a means for expanding          Dataset for Keyphrase Generation on News Documents. In Proceedings of INLG.
documents, and thus alleviate the “vocabulary mismatch” problem                  130–135.
                                                                             [3] Ygor Gallina, Florian Boudin, and Béatrice Daille. 2020. Large-Scale Evaluation
between query terms and relevant documents. Predicting appropri-                 of Keyphrase Extraction Models. In Proceedings of JCDL.
ate absent keyphrases is obviously a very challenging task because           [4] Carl Gutwin, Gordon Paynter, Ian Witten, Craig Nevill-Manning, and Eibe Frank.
of the unrestricted search space: every possible term can be con-                1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Decis.
                                                                                 Support Syst. 27, 1-2 (Nov. 1999), 81–104.
sidered as keyphrase. It therefore comes as no surprise that recent          [5] Kazi Saidul Hasan and Vincent Ng. 2014. Automatic Keyphrase Extraction: A
neural keyphrase generation models still achieve fairly low ac-                  Survey of the State of the Art. In Proceedings of ACL. 1262–1273.
curacies on that task despite being trained on large amounts of              [6] Chien-yu Huang, Arlene Casey, Dorota Głowacka, and Alan Medlar. 2019. Holes
                                                                                 in the Outline: Subject-Dependent Abstract Quality and Its Implications for
annotated data [8, 12]. Furthermore, their predictions are bound to              Scientific Literature Search. In Proceedings of CHIIR. 289–293.
a fixed-size output vocabulary built from the set of gold standard           [7] Yuqing Mao and Zhiyong Lu. 2017. MeSH Now: automatic MeSH indexing at
                                                                                 PubMed scale via learning to rank. Journal of Biomedical Semantics 8, 1 (2017).
keyphrases, which means that they can only produce keyphrases                [8] Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu
that were already assigned to other documents. With this in mind,                Chi. 2017. Deep Keyphrase Generation. In Proceedings of ACL. 582–592.
we are looking to move away from these data hungry models, and               [9] Xiaojun Wan and Jianguo Xiao. 2008. Single Document Keyphrase Extraction
                                                                                 Using Neighborhood Knowledge. In Proceedings of AAAI. 855–860.
study how absent keyphrases could be inferred from related docu-            [10] F. Xia, W. Wang, T. M. Bekele, and H. Liu. 2017. Big Scholarly Data: A Survey.
ments present in the entire collection. Here we argue that carefully             IEEE Transactions on Big Data 3, 1 (March 2017), 18–35.
selected sets of semantically related documents are the right spot          [11] Chengxiang Zhai. 1997. Fast Statistical Parsing of Noun Phrases for Document
                                                                                 Indexing. In Proceedings of ANLP. 312–319.
for mining new, yet likely to be relevant indexing terms. To ver-           [12] Jing Zhao and Yuxiang Zhang. 2019. Incorporating Linguistic Constraints into
ify this claim, a straightforward approach would be to expand the                Keyphrase Generation. In Proceedings of ACL. 5224–5233.