LDKP - A Dataset for Identifying Keyphrases from Long Scientific Documents Debanjan Mahata1,2,*,† , Navneet Agarwal2,† , Dibya Gautam2,† , Amardeep Kumar3,† , Swapnil Parekh4 , Yaman Kumar Singla5,2 , Anish Acharya6 and Rajiv Ratn Shah2 1 Moody’s Analytics, USA 2 MIDAS Labs, IIIT-Delhi, India 3 Instabase, India 4 New York University, USA 5 Adobe Media and Data Science Research (MDSR), India 6 University of Texas at Austin, USA Abstract Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (≈ 8 sentences). This presents three challenges for real-world applications: i) human-written summaries are unavailable for most documents, ii) a vast majority of the documents are long, and iii) a high percentage of KPs are directly found beyond the limited context of the title and the abstract. Therefore, we release two extensive corpora mapping KPs of ≈ 1.3𝑀 and ≈ 100𝐾 scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem. Additionally, we also benchmark and report the performances of different unsupervised as well as supervised algorithms for keyphrase extraction on long scientific documents. Our experiments show that formulating keyphrase extraction as a sequence tagging task with modern transformer language models capable of processing long text sequences such as longformer has advantages over the traditional algorithms, not only resulting in better performances in terms of F1 metrics but also in learning to extract optimal number of keyphrases from the input documents. Keywords keyphrase extraction, keyphrase generation, keyphrasification, automatic identification of keyphrases, long documents, long- former, language models 1. Introduction and Background mining [9] to name a few. This has motivated researchers to explore machine learning algorithms for automatically Identifying keyphrases (KPs) is a form of extreme sum- mapping documents to a set of keyphrases commonly re- marization, where given an input document, the task is to ferred as the keyphrase extraction (KPE) task [10, 6], for find a set of representative phrases that can effectively extractive approaches, and keyphrase generation (KPG) summarize it [1]. Over the last decade, we have seen an task [11, 12] for generative approaches. Recently, it was exponential increase in the velocity at which unstructured also referred as Keyphrasification [1]. text is produced on the web, with the vast majority of Various algorithms have been proposed over time to them untagged or poorly tagged. KPs provide an effec- solve the problem of identifying keyphrases from text doc- tive way to search, summarize, tag, and manage these uments that can primarily be categorized into supervised documents. Identifying KPs have proved to be useful as and unsupervised approaches [18]. Majority of these ap- preprocessing, pre-training [2], or supplementary tasks proaches take an abstract (a summary) of a text document in other tasks such as search [3, 4, 5], recommendation as the input and produce keyphrases as output. How- systems [6], advertising [7], summarization [8], opinion ever, in industrial applications across different domains such as advertising [19], search and indexing [20], finance DL4SR’22: Workshop on Deep Learning for Search and Recommen- [21], law [22], and many other real-world use cases, doc- dation, co-located with the 31st ACM International Conference on ument summaries are not readily available. Moreover, Information and Knowledge Management (CIKM), October 17-21, most of the documents encountered in these applications 2022, Atlanta, USA * are greater than 8 sentences (the average length of ab- Debanjan Mahata participated in this work as an Adjunct Faculty at IIIT-Delhi. stracts in KP datasets, see Table 1). We also find that a † These authors contributed equally. significant percentage of keyphrases (>18%) are directly $ debanjanmahata85@gmail.com (D. Mahata) found beyond the limited context of a document’s title and © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). abstract/summary. These constraints limit the potential CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Size Long Avg no. Avg no. Present Absent Dataset (no. of docs) Documents of sentences of words KPs KPs SemEval 2017 [6] 0.5K × 7.36 176.13 42.01% 57.69% KDD [13] 0.75K × 8.05 188.43 45.99% 54.01% Inspec [14] 2K × 5.45 130.57 55.69% 44.31% KP20K [11] 568K × 7.42 188.47 57.4% 42.6% OAGKx [15] 22M × 8.87 228.50 52.7% 47.3% NUS [16] 0.21K ✓ 375.93 7644.43 67.75% 32.25% SemEval 2010 [10] 0.24K ✓ 319.32 7434.52 42.01% 57.99% Krapivin [17] 2.3K ✓ 370.48 8420.76 44.74% 52.26% LDKP3K 100K ✓ 280.67 6027.10 76.11% 23.89% (S2ORC ← KP20K) LDKP10K 1.3M ✓ 194.76 4384.58 63.65% 36.35% (S2ORC ← OAGKx) Table 1 Characteristics of the proposed datasets compared to the existing datasets. of currently developed KPE and KPG algorithms to only with the datasets1 and transformerkp2 libraries. We hope theoretical pursuits. that researchers working in this area would acknowl- Many previous studies have pointed out the constraints edge the shortcomings of the popularly used datasets and imposed on KPE algorithms due to the short inputs and methods in KPE and KPG and devise exciting new ap- artificial nature of available datasets [23, 24, 25, 26, 27]. proaches for overcoming the challenges related to identify- In particular, Cano and Bojar [25] while explaining the ing keyphrases from long documents and contexts beyond limitations of their proposed algorithms, note that the summaries. This would make the models more useful in title and the abstract may not carry sufficient topical in- practical real-world settings. We think that LDKP can formation about the article, even when joined together. also complement recent efforts towards creating suitable While most datasets in the domain of KPE consist of ti- benchmarks [29] for evaluating methods being developed tles and abstracts [15], there have been some attempts at to understand and process long text sequences. providing long document KP datasets as well (Table 1). Krapivin et al. [17] released 2,000 full-length scientific papers from the computer science domain. Kim et al. 2. Dataset [10] in a SemEval-2010 challenge released a dataset con- taining 244 full scientific articles along with their author We propose two datasets resulting from the mapping of and reader assigned keyphrases. Nguyen and Kan [16] S2ORC with KP20K and OAGKx corpus, respectively. released 211 full-length scientific documents with mul- Lo et al. [28] publicly released S2ORC as a huge cor- tiple annotated keyphrases. All of these datasets were pus of 8.1M scientific documents. While it has full text released more than a decade ago and were more suitable and metadata (see Table 2) the corpus does not contain for machine-learning models available back then. With keyphrases. We took this as an opportunity to create a today’s deep learning paradigms like un/semi-supervised new corpus for identifying keyphrases from full-length learning requiring Wikipedia sized corpora (>6M arti- scientific articles. Therefore, we took the KP20K and cles), it becomes imperative to update the KPE and KPG OAGKx scientific corpus for which keyphrases were al- tasks with similar sized corpus. ready available and mapped them to their corresponding In this work, we develop two large datasets (LDKP documents in S2ORC. - Long Document Keyphrase) comprising of 100K and This is the first time in the keyphrase community that 1.3M documents for identifying keyphrases from full- such a large number of full-length documents with com- length scientific articles along with their metadata infor- prehensive metadata information have been made publicly mation such as venue, year of publication, author infor- available for academic use. Here, we want to acknowl- mation, inbound and outbound citations, and citation con- edge another concurrent work [30] that looks at the task texts, among others. We achieve this by mapping the of keyphrase generation from a newly constructed corpus existing KP20K [11] and OAGKx [15] corpus to the doc- 1 uments available in S2ORC dataset [28]. We make the https://github.com/huggingface/datasets 2 transformerkp - is a transformer based deep learning dataset publicly available on Huggingface hub (Section library for training and evaluating keyphrase extrac- 2.2) and also integrate the processing of these datasets tion and generation algorithms, https://github. com/Deep-Learning-for-Keyphrase/ transformerkp Figure 1: B-I-O tagged tokens from a random sample in the LDKP dataset where, ‘B’ - start of a keyphrase span, ‘I’ - inside keyphrase span, ‘O’ - outside keyphrase span. Paper details Paper Identifier Citations and References Paper ID ArXiv ID Outbound Citations performance of their methods on a smaller dataset. Title ACL ID Inbound Citations Authors PMC ID Bibliography Year PubMed ID References 2.1. Dataset Preparation Venue MAG ID Journal DOI In the absence of any unique identifier shared across Field of Study S2 URL datasets, we used paper title to map documents in S2ORC Table 2 with KP20K/OAGKx. This had its own set of challenges. Information available in the metadata of each scientific For example, some papers in KP20K and OAGKx had paper in LDKP corpus. unigram titles like “Editorial" or “Preface". Multiple pa- pers can be found with the same title. We ignored all the papers with unigram and bigram titles. We resolved of long documents - FULLTEXTKP. However, they do the title conflicts through manual verification. We also not make the corpus publicly available and the corpus is found out that some of the keyphrases in OAGKx and significantly smaller than ours containing only ≈ 142𝐾 KP20K datasets were parsed incorrectly. Keyphrases that documents. contain delimiters such as comma (which is also used as We release two datasets LDKP3K and LDKP10K cor- a separator for keyphrase list) have been broken down responding to KP20K and OAGKx, respectively. The into two or more keyphrases, e.g., the keyphrase ‘2,4- first corpus consists of ≈ 100K long documents with dichlorophenoxyacetic acid’ has been broken down into keyphrases obtained by mapping KP20K to S2ORC. [‘2’, ‘4- dichlorophenoxyacetic acid’]. In some cases, the The KP20K corpus mainly contains title, abstract and publication year, page number, DOI, e.g., 1999:14:555- keyphrases for computer science research articles from 558, were inaccurately added to the list of keyphrases. To online digital libraries like ACM Digital Library, Sci- solve this, we filtered out all the keyphrases that did not enceDirect, and Wiley. Using S2ORC documents, we have any alphabetical characters in them. increase the average length of the documents in KP20K Next, in order to facilitate the usage of particular sec- from 7.42 sentences to 280.67 sentences. This also in- tions in KPE algorithms, we standardized the section creased the percentage of present keyphrases in the input names across all the papers. The section names varied text by 18.7%. across different papers in the S2ORC dataset. For exam- The second corpus corresponding to OAGKx consists ple, some papers have a section named “Introduction" of 1.3M full scientific articles from various domains while others have it as “1.Introduction", “I. Introduction", with their corresponding keyphrases collected from aca- “I Introduction" etc. To deal with this problem, we replaced demic graph [31, 32]. The resulting corpus contains 194.7 the unique section names with a common generic sec- sentences (up from 8.87 sentences) on an average with tion name, like “introduction", across all the papers. We 10.95% increase in present keyphrases. An increase in did this for common sections which includes introduc- percentage of present keyphrases in both the corpus when tion, related work, conclusion, methodology, results and expanded to full length articles clearly indicates the oc- analysis. currence of a significant chunk of the keyphrases beyond In order to make the dataset useful for training a se- the abstract. Since both datasets consist of a large number quence tagging model we also provide token level tags in of documents, we present three versions of each dataset B-I-O format as previously done in [33]. We marked all with the training data split into small, medium and large the words in the document belonging to the keyphrases sizes, as given in Table 3. This was done in order to as ‘B’ or ‘I’ depending on whether they are the first word provide an opportunity to the researchers and practition- of the keyphrase or otherwise. Every other word, which ers with scarcity of computing resources to evaluate the were not a part of a keyphrase were tagged as ‘O’. The LDKP3K Size Dataset (no. of docs) (no. of docs) Small 20,000 20,000 Train Medium 50,000 50,000 Large 90,019 1,296,613 Test 3,413 10,000 Validation 3,339 10,000 Table 3 LDKP datasets with their train, validation and test dataset distributions. ground truth keyphrases associated with the documents were identified by searching for the same string pattern in the document’s text. The text is tokenized using a whites- pace tokenizer and a mapping between each token and it’s Figure 3: Distribution of field of studies for train, test and corresponding tag is provided as shown in Figure 1. validation split of LDKP10k dataset. Figure 4: Sample code for downloading the ‘small’ split Figure 2: Distribution of field of studies for train, test and of the LDKP3K dataset. validation split of LDKP3k dataset. The proposed dataset LDKP3k and LDKP10k are fur- Please refer to the Huggingface hub pages for LDKP3k ther divided into train, test and validation splits as shown and LDKP10k for detailed information about download- in Table-3. For LDKP3k, these splits are based on the ing and using the dataset. original KP20K dataset. For LDKP10k, we resorted to ran- dom sampling method to create these splits since OAGKx, 1. LDKP3K - https://huggingface.co/ the keyphrase dataset corresponding to LDKP10k, wasn’t datasets/midas/ldkp3k originally divided into train, test and validation splits. Fig- 2. LDKP10K - https://huggingface.co/ ures 2 and 3 show the distribution of papers in terms of datasets/midas/ldkp10k field of study across all the splits of the LDKP3k and We also enable access of the datasets using the LDKP10k datasets, respectively. transformerkp library, which abstracts away the prepro- cessing steps and make the data splits readily available 2.2. Dataset Usage to the user for the tasks of keyphrase extraction using sequence tagging and keyphrase generation using We make all the datasets publicly available on Hugging- seq2seq methods, respectively with different transformer face hub and enable programmatic access to the data using based language models. Details of downloading and the datasets library. For example, Figure 4 shows a sam- using the datasets with transformerkp for the tasks of ple code for downloading the LDKP3K dataset with the keyphrase extraction and generation could be found over ‘small’ training data split. Similarly, other configurations here - https://deep-learning-for-keyphrase. like ‘medium’ and ‘large’ can also be downloaded, each github.io/transformerkp/how-to-guides/ having different sizes of the training data but the same keyphrase-data/ validation and test dataset. Figure 4 also shows how each split of the dataset can be accessed. Krapivin NUS SemEval-2010 LDKP3k LDKP10k Method F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 PositionRank 0.042 0.052 0.060 0.086 0.074 0.098 0.059 0.062 0.052 0.061 TextRank 0.036 0.047 0.071 0.090 0.085 0.117 0.082 0.094 0.068 0.074 TopicRank 0.071 0.080 0.130 0.152 0.111 0.132 0.108 0.110 0.098 0.102 SingleRank 0.001 0.003 0.005 0.008 0.009 0.010 0.016 0.025 0.011 0.014 MultipartiteRank 0.103 0.107 0.150 0.193 0.116 0.145 0.129 0.110 0.104 0.106 TopicalPageRank 0.009 0.012 0.046 0.059 0.014 0.024 0.019 0.027 0.020 0.031 SGRank 0.140 0.131 0.195 0.203 0.177 0.201 0.138 0.128 0.136 0.132 Table 4 Results on long document datasets using unsupervised graph-based models. Krapivin NUS SemEval-2010 LDKP3k LDKP10k Method F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 TFIDF 0.033 0.052 0.063 0.111 0.062 0.070 0.093 0.099 0.072 0.080 KPMiner 0.125 0.151 0.169 0.212 0.155 0.181 0.164 0.152 0.151 0.142 Yake 0.105 0.107 0.177 0.235 0.088 0.129 0.140 0.132 0.114 0.114 Table 5 Results on long document datasets using unsupervised statistical models. 3. Experiments 3.2. Supervised Methods In this section, we evaluate several popular keyphrase For supervised keyphrase extraction, we report results extraction algorithms on the proposed LDKP3 K and for two traditional models, namely - KEA [43] and LDKP10 K datasets, along with three of the other existing WINGNUS [23], which treat keyphrase extraction as smaller datasets in scientific domain comprising of full a binary classification task. A recent trend is to treat length documents - Krapivin, SemEval-2010, and NUS. A keyphrase extraction as a sequence tagging task [33, 2, 1]. majority of the previous works have reported scores for Transformer based language models like BERT [44], Krapivin, SemEval-2010, and NUS, by only considering RoBERTa [45], KBIR [2], have already shown to achieve the title and abstract as the input We further report the SOTA results on the task of keyphrase extraction when benchmark results and also discuss the comparative ad- only the title and abstract is taken as the input. However, vantage of different algorithms to provide future research all these models have a limitation of processing only 512 direction. sub-word tokens. This led us to try Longformer [46], which can handle long sequences of text of up to 4,096 sub-word tokens. We acknowledge that there are several 3.1. Unsupervised Methods other recent models such as [47, 48] which could have There are multiple unsupervised methods for extracting been also tried. We are surely interested to try the others keyphrases from a document. We used the following in a future work. Further, we would train larger models popular statistical models: TfIDf, KPMiner [34], YAKE on the LDKP large corpus. [35] and the following graph-based algorithms: TextRank [36], PositionRank [37], SingleRank [38], TopicRank 3.3. Evaluation Metrics [39], MultipartiteRank [40] and SGRank [41]. All the implementations were taken from the PKE toolkit [42], We used 𝐹 1@5 and 𝐹 1@10 as our evaluation metrics except SGRank, for which we used the implementation [10]. Equations 1, 2 and 3 shows how 𝐹 1@𝑘 is cal- available in the textacy3 library. These algorithms first culated. Before evaluating, we lower-cased, stemmed, identify the candidate keyphrases using lexical rules fol- and removed punctuations from the ground truth as well lowed by ranking the candidates using either a statistical as the predicted keyphrases, and used actual matching. approach or a graph-based approach [1]. We directly re- Let 𝑌 denote the ground truth keyphrases and 𝑌¯ = ported the performance scores of these methods on the (𝑦¯1 , 𝑦¯2 , . . . , 𝑦¯𝑚 ) denote the predicted keyphrases ordered test datasets (Table 4). by their quality of prediction. Then we can define the metrics as follows: |𝑌 ∩ 𝑌¯𝑘 | 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 = (1) 3 https://github.com/chartbeat-labs/ 𝑚𝑖𝑛{|𝑌¯𝑘 |, 𝑘} textacy Krapivin NUS SemEval-2010 LDKP3k LDKP10k Method F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 F1@5 F1@10 Kea 0.041 0.063 0.069 0.134 0.077 0.090 0.109 0.118 0.087 0.096 WINGNUS 0.059 0.151 0.057 0.085 0.059 0.152 0.099 0.109 0.093 0.102 longformer-base-4096 0.229 0.232 0.253 0.284 0.203 0.219 0.240 0.216 0.236 0.212 Table 6 Results on long document datasets for supervised models. keyphrases. The other algorithms might get benefited by |𝑌 ∩ 𝑌¯𝑘 | revisiting their pipeline and make necessary changes for 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 = (2) |𝑌 | processing long documents and tune their heuristics to generate better quality candidates, which are to be ranked 2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 * 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 later for identifying the keyphrases. 𝐹 1@𝑘 = (3) For the supervised approaches using Longformer in a 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 + 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 sequence tagging setup proved to be the most promising where 𝑌¯𝑘 denotes the top k elements of the set 𝑌¯ . technique as shown by the performance reported in Table 6. Treating keyphrase extraction as a sequence tagging problem also automatically learns the optimal amount of 3.4. Results keyphrases to be predicted and helps to overcome the chal- lenges with other strategies that has to deal with a large Algorithm LDKP3K LDKP10K number of candidates as discussed above. The longformer SGRank 86.96 85.56 model on an average predicted 6.25 and 6.08 number TopicRank 636.02 520.81 PositionRank 678.65 547.66 of keyphrases for the LDKP10k and LDKP3k test sets, TopicalPageRank 709.51 574.50 respectively. Singlerank 773.11 624.25 TextRank 773.11 624.25 Multipartite 636.02 520.78 4. Conclusion Yake 2475.20 1965.73 TfIDF 6472.93 4922.29 In this work, we identified the shortage of corpus com- KPMiner 79.51 74.81 prising of long documents for training and evaluating WINGNUS 659.47 544.91 Kea 2534.71 2032.84 keyphrase extraction and generation models. We created two very large corpus - LDKP3K and LDKP10K com- Table 7 prising of ≈ 100K and ≈ 1.3M documents and made it Average number of candidate keyphrases generated by publicly available. The results of keyphrase extraction on the supervised and unsupervised algorithms on LDKP3K long documents with some of the existing unsupervised and LDKP10K datasets. and supervised models clearly depicts the challenging nature of the problem. We hope this would encourage Unsupervised algorithms did not show better perfor- the researchers to innovate and propose new models ca- mance than their supervised counterparts on long docu- pable of identifying high quality keyphrases from long ments as shown in Tables 4, 5, and 6. For the unsupervised multi-page documents. approaches SGRank and KPMiner outperformed every other algorithm in the graph-based ranking and statisti- cal categories respectively. One possible reason for the References low performance of the other unsupervised techniques could be that during the candidate generation and ranking [1] R. Meng, D. Mahata, F. Boudin, From fundamentals phases these models had to deal with more noise than to recent advances: A tutorial on keyphrasification, what they have been tuned to. Table 7 shows the number in: European Conference on Information Retrieval, of candidates generated by the strategies used by each Springer, 2022, pp. 582–588. of these algorithms. We can easily observe that most of [2] M. Kulkarni, D. Mahata, R. Arora, R. Bhowmik, the techniques resulted in generating a huge number of Learning rich representation of keyphrases from text, candidate keyphrases which might have made the down- arXiv preprint arXiv:2112.08547 (2021). stream ranking process challenging. On the other hand, [3] D. K. Sanyal, P. K. Bhowmick, P. P. Das, S. Chat- we can see that both SGRank and KPMiner had strate- topadhyay, T. Santosh, Enhancing access to schol- gies which were able to significantly reduce the number arly publications with surrogate resources, Sciento- of generated candidates and come up with better set of metrics 121 (2019) 1129–1164. [4] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, [15] E. Çano, OAGKX keyword generation dataset, E. Frank, Improving browsing in digital libraries 2019. URL: http://hdl.handle.net/11234/ with keyphrase indexes, Decision Support Systems 1-3062, LINDAT/CLARIAH-CZ digital library 27 (1999) 81–104. at the Institute of Formal and Applied Linguis- [5] I. Y. Song, R. B. Allen, Z. Obradovic, M. Song, tics (ÚFAL), Faculty of Mathematics and Physics, Keyphrase extraction-based query expansion in dig- Charles University. ital libraries, in: Proceedings of the 6th ACM/IEEE- [16] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in CS Joint Conference on Digital Libraries (JCDL’06), scientific publications, in: D. H.-L. Goh, T. H. Cao, IEEE, 2006, pp. 202–209. I. T. Sølvberg, E. Rasmussen (Eds.), Asian Digi- [6] I. Augenstein, M. Das, S. Riedel, L. Vikraman, tal Libraries. Looking Back 10 Years and Forging A. McCallum, Semeval 2017 task 10: Scienceie- New Frontiers, Springer Berlin Heidelberg, Berlin, extracting keyphrases and relations from scientific Heidelberg, 2007, pp. 317–326. publications, arXiv preprint arXiv:1704.02853 [17] M. Krapivin, A. Autayeu, M. Marchese, (2017). E. Blanzieri, N. Segata, Keyphrases extrac- [7] W.-t. Yih, J. Goodman, V. R. Carvalho, Finding tion from scientific documents: Improving advertising keywords on web pages, in: Proceedings machine learning approaches with natural language of the 15th international conference on World Wide processing, volume 6102, 2010, pp. 102–111. Web, 2006, pp. 213–222. doi:10.1007/978-3-642-13654-2_12. [8] V. Qazvinian, D. Radev, A. Özgür, Citation sum- [18] E. Papagiannopoulou, G. Tsoumakas, A review marization through keyphrase extraction, in: Pro- of keyphrase extraction, Wiley Interdisciplinary ceedings of the 23rd international conference on Reviews: Data Mining and Knowledge Discovery computational linguistics (COLING 2010), 2010, 10 (2020) e1339. pp. 895–903. [19] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, [9] G. Berend, Opinion expression mining by exploiting Z. Agha, N. Ong, A. Kovashka, Automatic under- keyphrase extraction (2011). standing of image and video advertisements, in: [10] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin, Proceedings of the IEEE Conference on Computer Automatic keyphrase extraction from scientific arti- Vision and Pattern Recognition, 2017, pp. 1705– cles, Language resources and evaluation 47 (2013) 1715. 723–742. [20] W. Magdy, K. Darwish, Book search: indexing [11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, the valuable parts, in: Proceedings of the 2008 Y. Chi, Deep keyphrase generation, in: Pro- ACM workshop on Research advances in large digi- ceedings of the 55th Annual Meeting of the As- tal book repositories, 2008, pp. 53–56. sociation for Computational Linguistics (Volume [21] A. Gupta, V. Dengre, H. A. Kheruwala, M. Shah, 1: Long Papers), Association for Computational Comprehensive review of text-mining applications Linguistics, Vancouver, Canada, 2017, pp. 582–592. in finance, Financial Innovation 6 (2020) 1–25. URL: https://www.aclweb.org/anthology/ [22] R. Bhargava, S. Nigwekar, Y. Sharma, Catchphrase P17-1054. doi:10.18653/v1/P17-1054. extraction from legal documents using lstm net- [12] A. Swaminathan, H. Zhang, D. Mahata, R. Gosangi, works., in: FIRE (Working Notes), 2017, pp. 72–73. R. Shah, A. Stent, A preliminary exploration of [23] T. D. Nguyen, M.-T. Luong, Wingnus: Keyphrase gans for keyphrase generation, in: Proceedings extraction utilizing document logical structure, in: of the 2020 Conference on Empirical Methods in Proceedings of the 5th international workshop on Natural Language Processing (EMNLP), 2020, pp. semantic evaluation, 2010, pp. 166–169. 8021–8030. [24] K. S. Hasan, V. Ng, Automatic keyphrase extrac- [13] C. Caragea, F. A. Bulgarov, A. Godea, S. D. Gol- tion: A survey of the state of the art, in: Proceedings lapalli, Citation-enhanced keyphrase extraction of the 52nd Annual Meeting of the Association for from research papers: A supervised approach, in: Computational Linguistics (Volume 1: Long Pa- EMNLP, 2014. pers), 2014, pp. 1262–1273. [14] A. Hulth, Improved automatic keyword ex- [25] E. Cano, O. Bojar, Keyphrase generation: A traction given more linguistic knowledge, in: text summarization struggle, arXiv preprint Proceedings of the 2003 Conference on Empir- arXiv:1904.00110 (2019). ical Methods in Natural Language Processing, [26] Y. Gallina, F. Boudin, B. Daille, Large-scale evalu- EMNLP ’03, Association for Computational Lin- ation of keyphrase extraction models, in: Proceed- guistics, USA, 2003, p. 216–223. URL: https:// ings of the ACM/IEEE Joint Conference on Digital doi.org/10.3115/1119355.1119383. doi:10. Libraries in 2020, 2020, pp. 271–278. 3115/1119355.1119383. [27] C. G. Kontoulis, E. Papagiannopoulou, G. Tsoumakas, Keyphrase extraction from in: International joint conference on natural lan- scientific articles via extractive summarization, in: guage processing (IJCNLP), 2013, pp. 543–551. Proceedings of the Second Workshop on Scholarly [40] F. Boudin, Unsupervised keyphrase extrac- Document Processing, 2021, pp. 49–55. tion with multipartite graphs, arXiv preprint [28] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. S. arXiv:1803.08721 (2018). Weld, S2orc: The semantic scholar open research [41] S. Danesh, T. Sumner, J. H. Martin, Sgrank: Com- corpus, arXiv preprint arXiv:1911.02782 (2019). bining statistical and graphical methods to improve [29] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran, the state of the art in unsupervised keyphrase extrac- A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Be- tion, in: Proceedings of the fourth joint conference rant, et al., Scrolls: Standardized comparison on lexical and computational semantics, 2015, pp. over long language sequences, arXiv preprint 117–126. arXiv:2201.03533 (2022). [42] F. Boudin, Pke: an open source python-based [30] K. Garg, J. R. Chowdhury, C. Caragea, Keyphrase keyphrase extraction toolkit, in: Proceedings of generation beyond the boundaries of title and ab- COLING 2016, the 26th international conference on stract, arXiv preprint arXiv:2112.06776 (2021). computational linguistics: system demonstrations, [31] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. 2016, pp. 69–73. Hsu, K. Wang, An overview of microsoft academic [43] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, service (mas) and applications, in: Proceedings C. G. Nevill-Manning, Kea: Practical automated of the 24th international conference on world wide keyphrase extraction, in: Design and Usability of web, 2015, pp. 243–246. Digital Libraries: Case Studies in the Asia Pacific, [32] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, IGI global, 2005, pp. 129–152. Arnetminer: extraction and mining of academic so- [44] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, cial networks, in: Proceedings of the 14th ACM BERT: Pre-training of deep bidirectional trans- SIGKDD international conference on Knowledge formers for language understanding, in: Pro- discovery and data mining, 2008, pp. 990–998. ceedings of the 2019 Conference of the North [33] D. Sahrawat, D. Mahata, H. Zhang, M. Kulkarni, American Chapter of the Association for Com- A. Sharma, R. Gosangi, A. Stent, Y. Kumar, R. R. putational Linguistics: Human Language Tech- Shah, R. Zimmermann, Keyphrase extraction as nologies, Volume 1 (Long and Short Papers), sequence labeling using contextualized embeddings, Association for Computational Linguistics, Min- Advances in Information Retrieval 12036 (2020) neapolis, Minnesota, 2019, pp. 4171–4186. 328. URL: https://www.aclweb.org/anthology/ [34] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase N19-1423. doi:10.18653/v1/N19-1423. extraction system for english and arabic documents, [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Information systems 34 (2009) 132–144. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, [35] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, Roberta: A robustly optimized bert pretraining ap- C. Nunes, A. Jatowt, Yake! keyword extraction proach, arXiv preprint arXiv:1907.11692 (2019). from single documents using multiple local features, [46] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Information Sciences 509 (2020) 257–289. long-document transformer, CoRR abs/2004.05150 [36] R. Mihalcea, P. Tarau, Textrank: Bringing order (2020). URL: https://arxiv.org/abs/2004. into text, in: Proceedings of the 2004 conference on 05150. arXiv:2004.05150. empirical methods in natural language processing, [47] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, 2004, pp. 404–411. C. Alberti, S. Ontanon, P. Pham, A. Ravula, [37] C. Florescu, C. Caragea, Positionrank: An unsuper- Q. Wang, L. Yang, A. Ahmed, Big bird: Trans- vised approach to keyphrase extraction from schol- formers for longer sequences, in: H. Larochelle, arly documents, in: Proceedings of the 55th Annual M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin Meeting of the Association for Computational Lin- (Eds.), Advances in Neural Information Processing guistics (Volume 1: Long Papers), 2017, pp. 1105– Systems, volume 33, Curran Associates, Inc., 2020, 1115. pp. 17283–17297. URL: https://proceedings. [38] X. Wan, J. Xiao, Collabrank: towards a collabo- neurips.cc/paper/2020/file/ rative approach to single-document keyphrase ex- c8512d142a2d849725f31a9a7a361ab9-Paper. traction, in: Proceedings of the 22nd International pdf. Conference on Computational Linguistics (Coling [48] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The 2008), 2008, pp. 969–976. efficient transformer, in: International Conference [39] A. Bougouin, F. Boudin, B. Daille, Topicrank: on Learning Representations, 2020. URL: https: Graph-based topic ranking for keyphrase extraction, //openreview.net/forum?id=rkgNKkHtvB.