1. Introduction and Background

LDKP - A Dataset for Identifying Keyphrases from Long Scientific Documents

Debanjan Mahata

2 3

Navneet Agarwal

Dibya Gautam

Amardeep Kumar

Swapnil Parekh

Yaman Kumar Singla

0 2

Anish Acharya

Rajiv Ratn Shah

2 0 Adobe Media and Data Science Research (MDSR) , India 1 Instabase , India 2 MIDAS Labs , IIIT-Delhi , India 3 Moody's Analytics , USA 4 New York University , USA 5 University of Texas at Austin , USA

2010

Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (≈ 8 sentences). This presents three challenges for real-world applications: i) human-written summaries are unavailable for most documents, ii) a vast majority of the documents are long, and iii) a high percentage of KPs are directly found beyond the limited context of the title and the abstract. Therefore, we release two extensive corpora mapping KPs of ≈ 1.3 and ≈ 100 scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem. Additionally, we also benchmark and report the performances of different unsupervised as well as supervised algorithms for keyphrase extraction on long scientific documents. Our experiments show that formulating keyphrase extraction as a sequence tagging task with modern transformer language models capable of processing long text sequences such as longformer has advantages over the traditional algorithms, not only resulting in better performances in terms of F1 metrics but also in learning to extract optimal number of keyphrases from the input documents.

eol>keyphrase extraction keyphrase generation keyphrasification automatic identification of keyphrases long documents longformer language models

1. Introduction and Background

mining [9] to name a few. This has motivated researchers to explore machine learning algorithms for automatically Identifying keyphrases (KPs) is a form of extreme sum- mapping documents to a set of keyphrases commonly remarization, where given an input document, the task is to ferred as the keyphrase extraction (KPE) task [10, 6], for ifnd a set of representative phrases that can effectively extractive approaches, and keyphrase generation (KPG) summarize it [1]. Over the last decade, we have seen an task [11, 12] for generative approaches. Recently, it was exponential increase in the velocity at which unstructured also referred as Keyphrasification [1]. text is produced on the web, with the vast majority of Various algorithms have been proposed over time to them untagged or poorly tagged. KPs provide an effec- solve the problem of identifying keyphrases from text doctive way to search, summarize, tag, and manage these uments that can primarily be categorized into supervised documents. Identifying KPs have proved to be useful as and unsupervised approaches [18]. Majority of these appreprocessing, pre-training [2], or supplementary tasks proaches take an abstract (a summary) of a text document in other tasks such as search [3, 4, 5], recommendation as the input and produce keyphrases as output. Howsystems [6], advertising [7], summarization [8], opinion ever, in industrial applications across different domains such as advertising [19], search and indexing [20], finance DL4SR’22: Workshop on Deep Learning for Search and Recommen- [21], law [22], and many other real-world use cases, docdation, co-located with the 31st ACM International Conference on ument summaries are not readily available. Moreover, Information and Knowledge Management (CIKM), October 17-21, most of the documents encountered in these applications *20D2e2b,aAnjtalannMtaa,hUaStaAparticipated in this work as an Adjunct Faculty at are greater than 8 sentences (the average length of abIIIT-Delhi. stracts in KP datasets, see Table 1). We also find that a † These authors contributed equally. significant percentage of keyphrases ( >18%) are directly $ debanjanmahata85@gmail.com (D. Mahata) found beyond the limited context of a document’s title and © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License abstract/summary. These constraints limit the potential CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutiRon 4W.0Iontrekrnsathioonapl(CPCrBoYc4e.0e)d.ings (CEUR-WS.org) SemEval 2017 [6]

KDD [13] Inspec [14] KP20K [11] OAGKx [15]

NUS [16] SemEval 2010 [10]

Krapivin [17]

LDKP3K

(S2ORC ← KP20K)

LDKP10K

(S2ORC ← OAGKx) 100K 1.3M

Long

Documents × × × × × ✓ ✓ ✓ ✓ ✓ 280.67 194.76 6027.10 4384.58 76.11% 63.65% 23.89% 36.35% of currently developed KPE and KPG algorithms to only with the datasets1 and transformerkp2 libraries. We hope theoretical pursuits. that researchers working in this area would acknowl

Many previous studies have pointed out the constraints edge the shortcomings of the popularly used datasets and imposed on KPE algorithms due to the short inputs and methods in KPE and KPG and devise exciting new apartificial nature of available datasets [ 23, 24, 25, 26, 27]. proaches for overcoming the challenges related to identifyIn particular, Cano and Bojar [25] while explaining the ing keyphrases from long documents and contexts beyond limitations of their proposed algorithms, note that the summaries. This would make the models more useful in title and the abstract may not carry sufcfiient topical in- practical real-world settings. We think that LDKP can formation about the article, even when joined together. also complement recent efforts towards creating suitable While most datasets in the domain of KPE consist of ti- benchmarks [29] for evaluating methods being developed tles and abstracts [15], there have been some attempts at to understand and process long text sequences. providing long document KP datasets as well (Table 1).

Krapivin et al. [17] released 2,000 full-length scientific papers from the computer science domain. Kim et al. 2. Dataset [10] in a SemEval-2010 challenge released a dataset containing 244 full scientific articles along with their author We propose two datasets resulting from the mapping of and reader assigned keyphrases. Nguyen and Kan [16] S2ORC with KP20K and OAGKx corpus, respectively. released 211 full-length scientific documents with mul- Lo et al. [28] publicly released S2ORC as a huge cortiple annotated keyphrases. All of these datasets were pus of 8.1M scientific documents. While it has full text released more than a decade ago and were more suitable and metadata (see Table 2) the corpus does not contain for machine-learning models available back then. With keyphrases. We took this as an opportunity to create a today’s deep learning paradigms like un/semi-supervised new corpus for identifying keyphrases from full-length learning requiring Wikipedia sized corpora (>6M arti- scientific articles. Therefore, we took the KP20K and cles), it becomes imperative to update the KPE and KPG OAGKx scientific corpus for which keyphrases were altasks with similar sized corpus. ready available and mapped them to their corresponding

In this work, we develop two large datasets (LDKP documents in S2ORC. - Long Document Keyphrase) comprising of 100K and This is the first time in the keyphrase community that 1.3M documents for identifying keyphrases from full- such a large number of full-length documents with comlength scientific articles along with their metadata infor- prehensive metadata information have been made publicly mation such as venue, year of publication, author infor- available for academic use. Here, we want to acknowlmation, inbound and outbound citations, and citation con- edge another concurrent work [30] that looks at the task texts, among others. We achieve this by mapping the of keyphrase generation from a newly constructed corpus existing KP20K [11] and OAGKx [15] corpus to the documents available in S2ORC dataset [28]. We make the dataset publicly available on Huggingface hub (Section 2.2) and also integrate the processing of these datasets

1https://github.com/huggingface/datasets

2transformerkp - is a transformer based deep learning library for training and evaluating keyphrase extraction and generation algorithms, https://github. com/Deep-Learning-for-Keyphrase/ transformerkp 2.1. Dataset Preparation In the absence of any unique identifier shared across datasets, we used paper title to map documents in S2ORC Table 2 with KP20K/OAGKx. This had its own set of challenges. Information available in the metadata of each scientific For example, some papers in KP20K and OAGKx had paper in LDKP corpus. unigram titles like “Editorial" or “Preface". Multiple papers can be found with the same title. We ignored all the papers with unigram and bigram titles. We resolved of long documents - FULLTEXTKP. However, they do the title conflicts through manual vericfiation. We also not make the corpus publicly available and the corpus is found out that some of the keyphrases in OAGKx and significantly smaller than ours containing only ≈ 142 KP20K datasets were parsed incorrectly. Keyphrases that documents. contain delimiters such as comma (which is also used as

We release two datasets LDKP3K and LDKP10K cor- a separator for keyphrase list) have been broken down responding to KP20K and OAGKx, respectively. The into two or more keyphrases, e.g., the keyphrase ‘2,4ifrst corpus consists of ≈ 100K long documents with dichlorophenoxyacetic acid’ has been broken down into keyphrases obtained by mapping KP20K to S2ORC. [‘2’, ‘4- dichlorophenoxyacetic acid’]. In some cases, the The KP20K corpus mainly contains title, abstract and publication year, page number, DOI, e.g., 1999:14:555keyphrases for computer science research articles from 558, were inaccurately added to the list of keyphrases. To online digital libraries like ACM Digital Library, Sci- solve this, we filtered out all the keyphrases that did not enceDirect, and Wiley. Using S2ORC documents, we have any alphabetical characters in them. increase the average length of the documents in KP20K Next, in order to facilitate the usage of particular secfrom 7.42 sentences to 280.67 sentences. This also in- tions in KPE algorithms, we standardized the section creased the percentage of present keyphrases in the input names across all the papers. The section names varied text by 18.7%. across different papers in the S2ORC dataset. For exam

The second corpus corresponding to OAGKx consists ple, some papers have a section named “Introduction" of 1.3M full scientific articles from various domains while others have it as “1.Introduction", “I. Introduction", with their corresponding keyphrases collected from aca- “I Introduction" etc. To deal with this problem, we replaced demic graph [31, 32]. The resulting corpus contains 194.7 the unique section names with a common generic secsentences (up from 8.87 sentences) on an average with tion name, like “introduction", across all the papers. We 10.95% increase in present keyphrases. An increase in did this for common sections which includes introducpercentage of present keyphrases in both the corpus when tion, related work, conclusion, methodology, results and expanded to full length articles clearly indicates the oc- analysis. currence of a significant chunk of the keyphrases beyond In order to make the dataset useful for training a sethe abstract. Since both datasets consist of a large number quence tagging model we also provide token level tags in of documents, we present three versions of each dataset B-I-O format as previously done in [33]. We marked all with the training data split into small, medium and large the words in the document belonging to the keyphrases sizes, as given in Table 3. This was done in order to as ‘B’ or ‘I’ depending on whether they are the first word provide an opportunity to the researchers and practition- of the keyphrase or otherwise. Every other word, which ers with scarcity of computing resources to evaluate the were not a part of a keyphrase were tagged as ‘O’. The Train

Small Medium

Large

Test Validation ground truth keyphrases associated with the documents were identified by searching for the same string pattern in the document’s text. The text is tokenized using a whitespace tokenizer and a mapping between each token and it’s corresponding tag is provided as shown in Figure 1.

The proposed dataset LDKP3k and LDKP10k are further divided into train, test and validation splits as shown in Table-3. For LDKP3k, these splits are based on the original KP20K dataset. For LDKP10k, we resorted to random sampling method to create these splits since OAGKx, the keyphrase dataset corresponding to LDKP10k, wasn’t originally divided into train, test and validation splits. Figures 2 and 3 show the distribution of papers in terms of ifeld of study across all the splits of the LDKP3k and LDKP10k datasets, respectively. 2.2. Dataset Usage We make all the datasets publicly available on Huggingface hub and enable programmatic access to the data using the datasets library. For example, Figure 4 shows a sample code for downloading the LDKP3K dataset with the ‘small’ training data split. Similarly, other configurations like ‘medium’ and ‘large’ can also be downloaded, each having different sizes of the training data but the same validation and test dataset. Figure 4 also shows how each split of the dataset can be accessed.

Please refer to the Huggingface hub pages for LDKP3k and LDKP10k for detailed information about downloading and using the dataset.

1. LDKP3K - https://huggingface.co/

datasets/midas/ldkp3k

2. LDKP10K - https://huggingface.co/

datasets/midas/ldkp10k

We also enable access of the datasets using the transformerkp library, which abstracts away the preprocessing steps and make the data splits readily available to the user for the tasks of keyphrase extraction using sequence tagging and keyphrase generation using seq2seq methods, respectively with different transformer based language models. Details of downloading and using the datasets with transformerkp for the tasks of keyphrase extraction and generation could be found over here - https://deep-learning-for-keyphrase. github.io/transformerkp/how-to-guides/ keyphrase-data/ PositionRank

TextRank TopicRank

SingleRank MultipartiteRank TopicalPageRank

SGRank

3. Experiments

In this section, we evaluate several popular keyphrase extraction algorithms on the proposed LDKP3K and LDKP10K datasets, along with three of the other existing smaller datasets in scientific domain comprising of full length documents - Krapivin, SemEval-2010, and NUS. A majority of the previous works have reported scores for Krapivin, SemEval-2010, and NUS, by only considering the title and abstract as the input We further report the benchmark results and also discuss the comparative advantage of different algorithms to provide future research direction. 3.1. Unsupervised Methods There are multiple unsupervised methods for extracting keyphrases from a document. We used the following popular statistical models: TfIDf, KPMiner [34], YAKE [35] and the following graph-based algorithms: TextRank [36], PositionRank [37], SingleRank [38], TopicRank [39], MultipartiteRank [40] and SGRank [41]. All the implementations were taken from the PKE toolkit [42], except SGRank, for which we used the implementation available in the textacy3 library. These algorithms first identify the candidate keyphrases using lexical rules followed by ranking the candidates using either a statistical approach or a graph-based approach [1]. We directly reported the performance scores of these methods on the test datasets (Table 4).

3https://github.com/chartbeat-labs/

textacy 3.2. Supervised Methods For supervised keyphrase extraction, we report results for two traditional models, namely - KEA [43] and WINGNUS [23], which treat keyphrase extraction as a binary classification task. A recent trend is to treat keyphrase extraction as a sequence tagging task [33, 2, 1]. Transformer based language models like BERT [44], RoBERTa [45], KBIR [2], have already shown to achieve SOTA results on the task of keyphrase extraction when only the title and abstract is taken as the input. However, all these models have a limitation of processing only 512 sub-word tokens. This led us to try Longformer [46], which can handle long sequences of text of up to 4,096 sub-word tokens. We acknowledge that there are several other recent models such as [47, 48] which could have been also tried. We are surely interested to try the others in a future work. Further, we would train larger models on the LDKP large corpus. 3.3. Evaluation Metrics We used 1@5 and 1@10 as our evaluation metrics [10]. Equations 1, 2 and 3 shows how 1@ is calculated. Before evaluating, we lower-cased, stemmed, and removed punctuations from the ground truth as well as the predicted keyphrases, and used actual matching. Let denote the ground truth keyphrases and ¯ = (¯1, ¯2, . . . , ¯ ) denote the predicted keyphrases ordered by their quality of prediction. Then we can define the metrics as follows: @ = (1) | ∩ ¯| {|¯|, } 2 * @ * @

@ + @ where ¯ denotes the top k elements of the set ¯ . 3.4. Results

Algorithm SGRank TopicRank PositionRank TopicalPageRank Singlerank TextRank Multipartite Yake TfIDF KPMiner WINGNUS Kea keyphrases. The other algorithms might get benefited by revisiting their pipeline and make necessary changes for processing long documents and tune their heuristics to generate better quality candidates, which are to be ranked later for identifying the keyphrases.

For the supervised approaches using Longformer in a sequence tagging setup proved to be the most promising technique as shown by the performance reported in Table 6. Treating keyphrase extraction as a sequence tagging problem also automatically learns the optimal amount of keyphrases to be predicted and helps to overcome the challenges with other strategies that has to deal with a large number of candidates as discussed above. The longformer model on an average predicted 6.25 and 6.08 number of keyphrases for the LDKP10k and LDKP3k test sets, respectively.

4. Conclusion

Kea

WINGNUS longformer-base-4096

In this work, we identified the shortage of corpus comprising of long documents for training and evaluating keyphrase extraction and generation models. We created two very large corpus - LDKP3K and LDKP10K comTable 7 prising of ≈ 100K and ≈ 1.3M documents and made it Average number of candidate keyphrases generated by publicly available. The results of keyphrase extraction on the supervised and unsupervised algorithms on LDKP3K long documents with some of the existing unsupervised and LDKP10K datasets. and supervised models clearly depicts the challenging nature of the problem. We hope this would encourage

Unsupervised algorithms did not show better perfor- the researchers to innovate and propose new models camance than their supervised counterparts on long docu- pable of identifying high quality keyphrases from long ments as shown in Tables 4, 5, and 6. For the unsupervised multi-page documents. approaches SGRank and KPMiner outperformed every other algorithm in the graph-based ranking and statistical categories respectively. One possible reason for the References low performance of the other unsupervised techniques could be that during the candidate generation and ranking [1] R. Meng, D. Mahata, F. Boudin, From fundamentals phases these models had to deal with more noise than to recent advances: A tutorial on keyphrasification, what they have been tuned to. Table 7 shows the number in: European Conference on Information Retrieval, of candidates generated by the strategies used by each Springer, 2022, pp. 582–588. of these algorithms. We can easily observe that most of [2] M. Kulkarni, D. Mahata, R. Arora, R. Bhowmik, the techniques resulted in generating a huge number of Learning rich representation of keyphrases from text, candidate keyphrases which might have made the down- arXiv preprint arXiv:2112.08547 (2021). stream ranking process challenging. On the other hand, [3] D. K. Sanyal, P. K. Bhowmick, P. P. Das, S. Chatwe can see that both SGRank and KPMiner had strate- topadhyay, T. Santosh, Enhancing access to scholgies which were able to significantly reduce the number arly publications with surrogate resources, Scientoof generated candidates and come up with better set of metrics 121 (2019) 1129–1164. [4] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, [15] E. Çano, OAGKX keyword generation dataset, E. Frank, Improving browsing in digital libraries 2019. URL: http://hdl.handle.net/11234/ with keyphrase indexes, Decision Support Systems 1-3062, LINDAT/CLARIAH-CZ digital library 27 (1999) 81–104. at the Institute of Formal and Applied Linguis[5] I. Y. Song, R. B. Allen, Z. Obradovic, M. Song, tics (ÚFAL), Faculty of Mathematics and Physics, Keyphrase extraction-based query expansion in dig- Charles University. ital libraries, in: Proceedings of the 6th ACM/IEEE- [16] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in CS Joint Conference on Digital Libraries (JCDL’06), scientific publications, in: D. H.-L. Goh, T. H. Cao, IEEE, 2006, pp. 202–209. I. T. Sølvberg, E. Rasmussen (Eds.), Asian Digi[6] I. Augenstein, M. Das, S. Riedel, L. Vikraman, tal Libraries. Looking Back 10 Years and Forging A. McCallum, Semeval 2017 task 10: Scienceie- New Frontiers, Springer Berlin Heidelberg, Berlin, extracting keyphrases and relations from scientific Heidelberg, 2007, pp. 317–326. publications, arXiv preprint arXiv:1704.02853 [17] M. Krapivin, A. Autayeu, M. Marchese, (2017). E. Blanzieri, N. Segata, Keyphrases extrac[7] W.-t. Yih, J. Goodman, V. R. Carvalho, Finding tion from scientific documents: Improving advertising keywords on web pages, in: Proceedings machine learning approaches with natural language of the 15th international conference on World Wide processing, volume 6102, 2010, pp. 102–111.

Web, 2006, pp. 213–222. doi:10.1007/978-3-642-13654-2_12. [8] V. Qazvinian, D. Radev, A. Özgür, Citation sum- [18] E. Papagiannopoulou, G. Tsoumakas, A review marization through keyphrase extraction, in: Pro- of keyphrase extraction, Wiley Interdisciplinary ceedings of the 23rd international conference on Reviews: Data Mining and Knowledge Discovery computational linguistics (COLING 2010), 2010, 10 (2020) e1339.

pp. 895–903. [19] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, [9] G. Berend, Opinion expression mining by exploiting Z. Agha, N. Ong, A. Kovashka, Automatic underkeyphrase extraction (2011). standing of image and video advertisements, in: [10] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin, Proceedings of the IEEE Conference on Computer Automatic keyphrase extraction from scientific arti- Vision and Pattern Recognition, 2017, pp. 1705– cles, Language resources and evaluation 47 (2013) 1715.

723–742. [20] W. Magdy, K. Darwish, Book search: indexing [11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, the valuable parts, in: Proceedings of the 2008 Y. Chi, Deep keyphrase generation, in: Pro- ACM workshop on Research advances in large digiceedings of the 55th Annual Meeting of the As- tal book repositories, 2008, pp. 53–56. sociation for Computational Linguistics (Volume [21] A. Gupta, V. Dengre, H. A. Kheruwala, M. Shah, 1: Long Papers), Association for Computational Comprehensive review of text-mining applications Linguistics, Vancouver, Canada, 2017, pp. 582–592. in finance, Financial Innovation 6 (2020) 1–25. URL: https://www.aclweb.org/anthology/ [22] R. Bhargava, S. Nigwekar, Y. Sharma, Catchphrase P17-1054. doi:10.18653/v1/P17-1054. extraction from legal documents using lstm net[12] A. Swaminathan, H. Zhang, D. Mahata, R. Gosangi, works., in: FIRE (Working Notes), 2017, pp. 72–73.

R. Shah, A. Stent, A preliminary exploration of [23] T. D. Nguyen, M.-T. Luong, Wingnus: Keyphrase gans for keyphrase generation, in: Proceedings extraction utilizing document logical structure, in: of the 2020 Conference on Empirical Methods in Proceedings of the 5th international workshop on Natural Language Processing (EMNLP), 2020, pp. semantic evaluation, 2010, pp. 166–169. 8021–8030. [24] K. S. Hasan, V. Ng, Automatic keyphrase extrac[13] C. Caragea, F. A. Bulgarov, A. Godea, S. D. Gol- tion: A survey of the state of the art, in: Proceedings lapalli, Citation-enhanced keyphrase extraction of the 52nd Annual Meeting of the Association for from research papers: A supervised approach, in: Computational Linguistics (Volume 1: Long PaEMNLP, 2014. pers), 2014, pp. 1262–1273. [14] A. Hulth, Improved automatic keyword ex- [25] E. Cano, O. Bojar, Keyphrase generation: A traction given more linguistic knowledge, in: text summarization struggle, arXiv preprint Proceedings of the 2003 Conference on Empir- arXiv:1904.00110 (2019). ical Methods in Natural Language Processing, [26] Y. Gallina, F. Boudin, B. Daille, Large-scale evaluEMNLP ’03, Association for Computational Lin- ation of keyphrase extraction models, in: Proceedguistics, USA, 2003, p. 216–223. URL: https:// ings of the ACM/IEEE Joint Conference on Digital doi.org/10.3115/1119355.1119383. doi:10. Libraries in 2020, 2020, pp. 271–278. 3115/1119355.1119383. [27] C. G. Kontoulis, E. Papagiannopoulou, G. Tsoumakas, Keyphrase extraction from in: International joint conference on natural lanscientific articles via extractive summarization, in: guage processing (IJCNLP), 2013, pp. 543–551. Proceedings of the Second Workshop on Scholarly [40] F. Boudin, Unsupervised keyphrase extracDocument Processing, 2021, pp. 49–55. tion with multipartite graphs, arXiv preprint [28] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. S. arXiv:1803.08721 (2018).

Weld, S2orc: The semantic scholar open research [41] S. Danesh, T. Sumner, J. H. Martin, Sgrank: Comcorpus, arXiv preprint arXiv:1911.02782 (2019). bining statistical and graphical methods to improve [29] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran, the state of the art in unsupervised keyphrase extracA. Haviv, A. Gupta, W. Xiong, M. Geva, J. Be- tion, in: Proceedings of the fourth joint conference rant, et al., Scrolls: Standardized comparison on lexical and computational semantics, 2015, pp. over long language sequences, arXiv preprint 117–126.

arXiv:2201.03533 (2022). [42] F. Boudin, Pke: an open source python-based [30] K. Garg, J. R. Chowdhury, C. Caragea, Keyphrase keyphrase extraction toolkit, in: Proceedings of generation beyond the boundaries of title and ab- COLING 2016, the 26th international conference on stract, arXiv preprint arXiv:2112.06776 (2021). computational linguistics: system demonstrations, [31] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. 2016, pp. 69–73.

Hsu, K. Wang, An overview of microsoft academic [43] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, service (mas) and applications, in: Proceedings C. G. Nevill-Manning, Kea: Practical automated of the 24th international conference on world wide keyphrase extraction, in: Design and Usability of web, 2015, pp. 243–246. Digital Libraries: Case Studies in the Asia Pacific, [32] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, IGI global, 2005, pp. 129–152.

Arnetminer: extraction and mining of academic so- [44] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, cial networks, in: Proceedings of the 14th ACM BERT: Pre-training of deep bidirectional transSIGKDD international conference on Knowledge formers for language understanding, in: Prodiscovery and data mining, 2008, pp. 990–998. ceedings of the 2019 Conference of the North [33] D. Sahrawat, D. Mahata, H. Zhang, M. Kulkarni, American Chapter of the Association for ComA. Sharma, R. Gosangi, A. Stent, Y. Kumar, R. R. putational Linguistics: Human Language TechShah, R. Zimmermann, Keyphrase extraction as nologies, Volume 1 (Long and Short Papers), sequence labeling using contextualized embeddings, Association for Computational Linguistics, MinAdvances in Information Retrieval 12036 (2020) neapolis, Minnesota, 2019, pp. 4171–4186. 328. URL: https://www.aclweb.org/anthology/ [34] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase N19-1423. doi:10.18653/v1/N19-1423. extraction system for english and arabic documents, [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Information systems 34 (2009) 132–144. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, [35] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, Roberta: A robustly optimized bert pretraining apC. Nunes, A. Jatowt, Yake! keyword extraction proach, arXiv preprint arXiv:1907.11692 (2019). from single documents using multiple local features, [46] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Information Sciences 509 (2020) 257–289. long-document transformer, CoRR abs/2004.05150 [36] R. Mihalcea, P. Tarau, Textrank: Bringing order (2020). URL: https://arxiv.org/abs/2004. into text, in: Proceedings of the 2004 conference on 05150. arXiv:2004.05150. empirical methods in natural language processing, [47] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, 2004, pp. 404–411. C. Alberti, S. Ontanon, P. Pham, A. Ravula, [37] C. Florescu, C. Caragea, Positionrank: An unsuper- Q. Wang, L. Yang, A. Ahmed, Big bird: Transvised approach to keyphrase extraction from schol- formers for longer sequences, in: H. Larochelle, arly documents, in: Proceedings of the 55th Annual M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin Meeting of the Association for Computational Lin- (Eds.), Advances in Neural Information Processing guistics (Volume 1: Long Papers), 2017, pp. 1105– Systems, volume 33, Curran Associates, Inc., 2020, 1115. pp. 17283–17297. URL: https://proceedings. [38] X. Wan, J. Xiao, Collabrank: towards a collabo- neurips.cc/paper/2020/file/ rative approach to single-document keyphrase ex- c8512d142a2d849725f31a9a7a361ab9-Paper. traction, in: Proceedings of the 22nd International pdf.

Conference on Computational Linguistics (Coling [48] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The 2008), 2008, pp. 969–976. efficient transformer, in: International Conference [39] A. Bougouin, F. Boudin, B. Daille, Topicrank: on Learning Representations, 2020. URL: https: Graph-based topic ranking for keyphrase extraction, //openreview.net/forum?id=rkgNKkHtvB.