LDKP - A Dataset for Identifying Keyphrases from Long
Scientific Documents
Debanjan Mahata1,2,*,† , Navneet Agarwal2,† , Dibya Gautam2,† , Amardeep Kumar3,† ,
Swapnil Parekh4 , Yaman Kumar Singla5,2 , Anish Acharya6 and Rajiv Ratn Shah2
1
  Moody’s Analytics, USA
2
  MIDAS Labs, IIIT-Delhi, India
3
  Instabase, India
4
  New York University, USA
5
  Adobe Media and Data Science Research (MDSR), India
6
  University of Texas at Austin, USA


                                          Abstract
                                          Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information
                                          retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document
                                          title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify
                                          keyphrases from human-written summaries that are often very short (≈ 8 sentences). This presents three challenges for
                                          real-world applications: i) human-written summaries are unavailable for most documents, ii) a vast majority of the documents
                                          are long, and iii) a high percentage of KPs are directly found beyond the limited context of the title and the abstract. Therefore,
                                          we release two extensive corpora mapping KPs of ≈ 1.3𝑀 and ≈ 100𝐾 scientific articles with their fully extracted text
                                          and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on
                                          this real-world problem. Additionally, we also benchmark and report the performances of different unsupervised as well as
                                          supervised algorithms for keyphrase extraction on long scientific documents. Our experiments show that formulating keyphrase
                                          extraction as a sequence tagging task with modern transformer language models capable of processing long text sequences such
                                          as longformer has advantages over the traditional algorithms, not only resulting in better performances in terms of F1 metrics
                                          but also in learning to extract optimal number of keyphrases from the input documents.

                                          Keywords
                                          keyphrase extraction, keyphrase generation, keyphrasification, automatic identification of keyphrases, long documents, long-
                                          former, language models


1. Introduction and Background                                                                         mining [9] to name a few. This has motivated researchers
                                                                                                       to explore machine learning algorithms for automatically
Identifying keyphrases (KPs) is a form of extreme sum- mapping documents to a set of keyphrases commonly re-
marization, where given an input document, the task is to ferred as the keyphrase extraction (KPE) task [10, 6], for
find a set of representative phrases that can effectively extractive approaches, and keyphrase generation (KPG)
summarize it [1]. Over the last decade, we have seen an task [11, 12] for generative approaches. Recently, it was
exponential increase in the velocity at which unstructured also referred as Keyphrasification [1].
text is produced on the web, with the vast majority of                                                    Various algorithms have been proposed over time to
them untagged or poorly tagged. KPs provide an effec- solve the problem of identifying keyphrases from text doc-
tive way to search, summarize, tag, and manage these uments that can primarily be categorized into supervised
documents. Identifying KPs have proved to be useful as and unsupervised approaches [18]. Majority of these ap-
preprocessing, pre-training [2], or supplementary tasks proaches take an abstract (a summary) of a text document
in other tasks such as search [3, 4, 5], recommendation as the input and produce keyphrases as output. How-
systems [6], advertising [7], summarization [8], opinion ever, in industrial applications across different domains
                                                                                                       such as advertising [19], search and indexing [20], finance
DL4SR’22: Workshop on Deep Learning for Search and Recommen- [21], law [22], and many other real-world use cases, doc-
dation, co-located with the 31st ACM International Conference on ument summaries are not readily available. Moreover,
Information and Knowledge Management (CIKM), October 17-21, most of the documents encountered in these applications
2022, Atlanta, USA
*                                                                                                      are greater than 8 sentences (the average length of ab-
  Debanjan Mahata participated in this work as an Adjunct Faculty at
  IIIT-Delhi.                                                                                          stracts in KP datasets, see Table 1). We also find that a
†
  These authors contributed equally.                                                                   significant percentage of keyphrases (>18%) are directly
$ debanjanmahata85@gmail.com (D. Mahata)                                                               found beyond the limited context of a document’s title and
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   abstract/summary. These constraints limit the potential
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                    Size           Long                 Avg no.      Avg no.      Present     Absent
            Dataset
                               (no. of docs)     Documents           of sentences    of words       KPs         KPs
      SemEval 2017 [6]              0.5K             ×                    7.36        176.13      42.01%      57.69%
          KDD [13]                 0.75K             ×                    8.05        188.43      45.99%      54.01%
         Inspec [14]                 2K              ×                    5.45        130.57      55.69%      44.31%
         KP20K [11]                568K              ×                    7.42        188.47       57.4%       42.6%
        OAGKx [15]                  22M              ×                    8.87        228.50       52.7%       47.3%
          NUS [16]                 0.21K             ✓                   375.93      7644.43      67.75%      32.25%
      SemEval 2010 [10]            0.24K             ✓                   319.32      7434.52      42.01%      57.99%
        Krapivin [17]               2.3K             ✓                   370.48      8420.76      44.74%      52.26%
          LDKP3K
                                   100K                ✓                280.67        6027.10     76.11%      23.89%
     (S2ORC ← KP20K)
          LDKP10K
                                   1.3M                ✓                194.76        4384.58     63.65%      36.35%
     (S2ORC ← OAGKx)
Table 1
Characteristics of the proposed datasets compared to the existing datasets.


of currently developed KPE and KPG algorithms to only           with the datasets1 and transformerkp2 libraries. We hope
theoretical pursuits.                                           that researchers working in this area would acknowl-
    Many previous studies have pointed out the constraints      edge the shortcomings of the popularly used datasets and
imposed on KPE algorithms due to the short inputs and           methods in KPE and KPG and devise exciting new ap-
artificial nature of available datasets [23, 24, 25, 26, 27].   proaches for overcoming the challenges related to identify-
In particular, Cano and Bojar [25] while explaining the         ing keyphrases from long documents and contexts beyond
limitations of their proposed algorithms, note that the         summaries. This would make the models more useful in
title and the abstract may not carry sufficient topical in-     practical real-world settings. We think that LDKP can
formation about the article, even when joined together.         also complement recent efforts towards creating suitable
While most datasets in the domain of KPE consist of ti-         benchmarks [29] for evaluating methods being developed
tles and abstracts [15], there have been some attempts at       to understand and process long text sequences.
providing long document KP datasets as well (Table 1).
Krapivin et al. [17] released 2,000 full-length scientific
papers from the computer science domain. Kim et al.             2. Dataset
[10] in a SemEval-2010 challenge released a dataset con-
taining 244 full scientific articles along with their author    We propose two datasets resulting from the mapping of
and reader assigned keyphrases. Nguyen and Kan [16]             S2ORC with KP20K and OAGKx corpus, respectively.
released 211 full-length scientific documents with mul-         Lo et al. [28] publicly released S2ORC as a huge cor-
tiple annotated keyphrases. All of these datasets were          pus of 8.1M scientific documents. While it has full text
released more than a decade ago and were more suitable          and metadata (see Table 2) the corpus does not contain
for machine-learning models available back then. With           keyphrases. We took this as an opportunity to create a
today’s deep learning paradigms like un/semi-supervised         new corpus for identifying keyphrases from full-length
learning requiring Wikipedia sized corpora (>6M arti-           scientific articles. Therefore, we took the KP20K and
cles), it becomes imperative to update the KPE and KPG          OAGKx scientific corpus for which keyphrases were al-
tasks with similar sized corpus.                                ready available and mapped them to their corresponding
    In this work, we develop two large datasets (LDKP           documents in S2ORC.
- Long Document Keyphrase) comprising of 100K and                  This is the first time in the keyphrase community that
1.3M documents for identifying keyphrases from full-            such a large number of full-length documents with com-
length scientific articles along with their metadata infor-     prehensive metadata information have been made publicly
mation such as venue, year of publication, author infor-        available for academic use. Here, we want to acknowl-
mation, inbound and outbound citations, and citation con-       edge another concurrent work [30] that looks at the task
texts, among others. We achieve this by mapping the             of keyphrase generation from a newly constructed corpus
existing KP20K [11] and OAGKx [15] corpus to the doc-           1
uments available in S2ORC dataset [28]. We make the                 https://github.com/huggingface/datasets
                                                                2
                                                                    transformerkp - is a transformer based deep learning
dataset publicly available on Huggingface hub (Section              library for training and evaluating keyphrase extrac-
2.2) and also integrate the processing of these datasets            tion and generation algorithms,    https://github.
                                                                    com/Deep-Learning-for-Keyphrase/
                                                                    transformerkp
Figure 1: B-I-O tagged tokens from a random sample in the LDKP dataset where, ‘B’ - start of a keyphrase span, ‘I’ -
inside keyphrase span, ‘O’ - outside keyphrase span.


  Paper details    Paper Identifier   Citations and References
  Paper ID         ArXiv ID           Outbound Citations         performance of their methods on a smaller dataset.
  Title            ACL ID             Inbound Citations
  Authors          PMC ID             Bibliography
  Year             PubMed ID          References                 2.1. Dataset Preparation
  Venue            MAG ID
  Journal          DOI                                      In the absence of any unique identifier shared across
  Field of Study   S2 URL
                                                            datasets, we used paper title to map documents in S2ORC
Table 2                                                     with KP20K/OAGKx. This had its own set of challenges.
Information available in the metadata of each scientific For example, some papers in KP20K and OAGKx had
paper in LDKP corpus.                                       unigram titles like “Editorial" or “Preface". Multiple pa-
                                                            pers can be found with the same title. We ignored all
                                                            the papers with unigram and bigram titles. We resolved
of long documents - FULLTEXTKP. However, they do the title conflicts through manual verification. We also
not make the corpus publicly available and the corpus is found out that some of the keyphrases in OAGKx and
significantly smaller than ours containing only ≈ 142𝐾 KP20K datasets were parsed incorrectly. Keyphrases that
documents.                                                  contain delimiters such as comma (which is also used as
   We release two datasets LDKP3K and LDKP10K cor- a separator for keyphrase list) have been broken down
responding to KP20K and OAGKx, respectively. The into two or more keyphrases, e.g., the keyphrase ‘2,4-
first corpus consists of ≈ 100K long documents with dichlorophenoxyacetic acid’ has been broken down into
keyphrases obtained by mapping KP20K to S2ORC. [‘2’, ‘4- dichlorophenoxyacetic acid’]. In some cases, the
The KP20K corpus mainly contains title, abstract and publication year, page number, DOI, e.g., 1999:14:555-
keyphrases for computer science research articles from 558, were inaccurately added to the list of keyphrases. To
online digital libraries like ACM Digital Library, Sci- solve this, we filtered out all the keyphrases that did not
enceDirect, and Wiley. Using S2ORC documents, we have any alphabetical characters in them.
increase the average length of the documents in KP20K          Next, in order to facilitate the usage of particular sec-
from 7.42 sentences to 280.67 sentences. This also in-      tions in KPE algorithms, we standardized the section
creased the percentage of present keyphrases in the input names across all the papers. The section names varied
text by 18.7%.                                              across different papers in the S2ORC dataset. For exam-
   The second corpus corresponding to OAGKx consists ple, some papers have a section named “Introduction"
of 1.3M full scientific articles from various domains while others have it as “1.Introduction", “I. Introduction",
with their corresponding keyphrases collected from aca- “I Introduction" etc. To deal with this problem, we replaced
demic graph [31, 32]. The resulting corpus contains 194.7 the unique section names with a common generic sec-
sentences (up from 8.87 sentences) on an average with tion name, like “introduction", across all the papers. We
10.95% increase in present keyphrases. An increase in did this for common sections which includes introduc-
percentage of present keyphrases in both the corpus when tion, related work, conclusion, methodology, results and
expanded to full length articles clearly indicates the oc- analysis.
currence of a significant chunk of the keyphrases beyond       In order to make the dataset useful for training a se-
the abstract. Since both datasets consist of a large number quence tagging model we also provide token level tags in
of documents, we present three versions of each dataset B-I-O format as previously done in [33]. We marked all
with the training data split into small, medium and large the words in the document belonging to the keyphrases
sizes, as given in Table 3. This was done in order to as ‘B’ or ‘I’ depending on whether they are the first word
provide an opportunity to the researchers and practition- of the keyphrase or otherwise. Every other word, which
ers with scarcity of computing resources to evaluate the were not a part of a keyphrase were tagged as ‘O’. The
                            LDKP3K             Size
            Dataset
                          (no. of docs)   (no. of docs)
                 Small       20,000          20,000
        Train   Medium       50,000          50,000
                 Large       90,019         1,296,613
              Test            3,413          10,000
           Validation         3,339          10,000

Table 3
LDKP datasets with their train, validation and test dataset
distributions.


ground truth keyphrases associated with the documents
were identified by searching for the same string pattern in
the document’s text. The text is tokenized using a whites-
pace tokenizer and a mapping between each token and it’s Figure 3: Distribution of field of studies for train, test and
corresponding tag is provided as shown in Figure 1.         validation split of LDKP10k dataset.


                                                                 Figure 4: Sample code for downloading the ‘small’ split
Figure 2: Distribution of field of studies for train, test and   of the LDKP3K dataset.
validation split of LDKP3k dataset.


   The proposed dataset LDKP3k and LDKP10k are fur-                Please refer to the Huggingface hub pages for LDKP3k
ther divided into train, test and validation splits as shown and LDKP10k for detailed information about download-
in Table-3. For LDKP3k, these splits are based on the ing and using the dataset.
original KP20K dataset. For LDKP10k, we resorted to ran-
dom sampling method to create these splits since OAGKx,              1. LDKP3K - https://huggingface.co/
the keyphrase dataset corresponding to LDKP10k, wasn’t                  datasets/midas/ldkp3k
originally divided into train, test and validation splits. Fig-      2. LDKP10K - https://huggingface.co/
ures 2 and 3 show the distribution of papers in terms of                datasets/midas/ldkp10k
field of study across all the splits of the LDKP3k and
                                                                   We also enable access of the datasets using the
LDKP10k datasets, respectively.
                                                                transformerkp library, which abstracts away the prepro-
                                                                cessing steps and make the data splits readily available
2.2. Dataset Usage                                              to the user for the tasks of keyphrase extraction using
                                                                sequence tagging and keyphrase generation using
We make all the datasets publicly available on Hugging-
                                                                seq2seq methods, respectively with different transformer
face hub and enable programmatic access to the data using
                                                                based language models. Details of downloading and
the datasets library. For example, Figure 4 shows a sam-
                                                                using the datasets with transformerkp for the tasks of
ple code for downloading the LDKP3K dataset with the
                                                                keyphrase extraction and generation could be found over
‘small’ training data split. Similarly, other configurations
                                                                here - https://deep-learning-for-keyphrase.
like ‘medium’ and ‘large’ can also be downloaded, each
                                                                github.io/transformerkp/how-to-guides/
having different sizes of the training data but the same
                                                                keyphrase-data/
validation and test dataset. Figure 4 also shows how each
split of the dataset can be accessed.
                          Krapivin               NUS             SemEval-2010             LDKP3k               LDKP10k
        Method
                       F1@5 F1@10          F1@5 F1@10           F1@5 F1@10            F1@5 F1@10            F1@5 F1@10
      PositionRank     0.042   0.052       0.060    0.086       0.074   0.098         0.059   0.062         0.052  0.061
        TextRank       0.036   0.047       0.071    0.090       0.085   0.117         0.082   0.094         0.068  0.074
       TopicRank       0.071   0.080       0.130    0.152       0.111   0.132         0.108   0.110         0.098  0.102
       SingleRank      0.001   0.003       0.005    0.008       0.009   0.010         0.016   0.025         0.011  0.014
    MultipartiteRank   0.103   0.107       0.150    0.193       0.116   0.145         0.129   0.110         0.104  0.106
    TopicalPageRank    0.009   0.012       0.046    0.059       0.014   0.024         0.019   0.027         0.020  0.031
        SGRank         0.140   0.131       0.195    0.203       0.177   0.201         0.138   0.128         0.136  0.132

Table 4
Results on long document datasets using unsupervised graph-based models.

                    Krapivin               NUS           SemEval-2010              LDKP3k               LDKP10k
       Method
                 F1@5 F1@10          F1@5 F1@10         F1@5 F1@10             F1@5 F1@10            F1@5 F1@10
        TFIDF    0.033   0.052       0.063    0.111     0.062   0.070          0.093   0.099         0.072  0.080
       KPMiner   0.125   0.151       0.169    0.212     0.155   0.181          0.164   0.152         0.151  0.142
        Yake     0.105   0.107       0.177    0.235     0.088   0.129          0.140   0.132         0.114  0.114

Table 5
Results on long document datasets using unsupervised statistical models.


3. Experiments                                               3.2. Supervised Methods
In this section, we evaluate several popular keyphrase       For supervised keyphrase extraction, we report results
extraction algorithms on the proposed LDKP3 K and            for two traditional models, namely - KEA [43] and
LDKP10 K datasets, along with three of the other existing    WINGNUS [23], which treat keyphrase extraction as
smaller datasets in scientific domain comprising of full     a binary classification task. A recent trend is to treat
length documents - Krapivin, SemEval-2010, and NUS. A        keyphrase extraction as a sequence tagging task [33, 2, 1].
majority of the previous works have reported scores for      Transformer based language models like BERT [44],
Krapivin, SemEval-2010, and NUS, by only considering         RoBERTa [45], KBIR [2], have already shown to achieve
the title and abstract as the input We further report the    SOTA results on the task of keyphrase extraction when
benchmark results and also discuss the comparative ad-       only the title and abstract is taken as the input. However,
vantage of different algorithms to provide future research   all these models have a limitation of processing only 512
direction.                                                   sub-word tokens. This led us to try Longformer [46],
                                                             which can handle long sequences of text of up to 4,096
                                                             sub-word tokens. We acknowledge that there are several
3.1. Unsupervised Methods                                    other recent models such as [47, 48] which could have
There are multiple unsupervised methods for extracting       been also tried. We are surely interested to try the others
keyphrases from a document. We used the following            in a future work. Further, we would train larger models
popular statistical models: TfIDf, KPMiner [34], YAKE        on the LDKP large corpus.
[35] and the following graph-based algorithms: TextRank
[36], PositionRank [37], SingleRank [38], TopicRank          3.3. Evaluation Metrics
[39], MultipartiteRank [40] and SGRank [41]. All the
implementations were taken from the PKE toolkit [42],        We used 𝐹 1@5 and 𝐹 1@10 as our evaluation metrics
except SGRank, for which we used the implementation          [10]. Equations 1, 2 and 3 shows how 𝐹 1@𝑘 is cal-
available in the textacy3 library. These algorithms first    culated. Before evaluating, we lower-cased, stemmed,
identify the candidate keyphrases using lexical rules fol-   and removed punctuations from the ground truth as well
lowed by ranking the candidates using either a statistical   as the predicted keyphrases, and used actual matching.
approach or a graph-based approach [1]. We directly re-      Let 𝑌 denote the ground truth keyphrases and 𝑌¯ =
ported the performance scores of these methods on the        (𝑦¯1 , 𝑦¯2 , . . . , 𝑦¯𝑚 ) denote the predicted keyphrases ordered
test datasets (Table 4).                                     by their quality of prediction. Then we can define the
                                                             metrics as follows:
                                                                                                 |𝑌 ∩ 𝑌¯𝑘 |
                                                                           𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 =                                  (1)
3
    https://github.com/chartbeat-labs/                                                          𝑚𝑖𝑛{|𝑌¯𝑘 |, 𝑘}
    textacy
                              Krapivin              NUS            SemEval-2010          LDKP3k           LDKP10k
         Method
                           F1@5 F1@10         F1@5 F1@10          F1@5 F1@10         F1@5 F1@10        F1@5 F1@10
          Kea              0.041   0.063      0.069    0.134      0.077   0.090      0.109   0.118     0.087  0.096
        WINGNUS            0.059   0.151      0.057    0.085      0.059   0.152      0.099   0.109     0.093  0.102
  longformer-base-4096     0.229   0.232      0.253    0.284      0.203   0.219      0.240   0.216     0.236  0.212

Table 6
Results on long document datasets for supervised models.


                                                               keyphrases. The other algorithms might get benefited by
                             |𝑌 ∩ 𝑌¯𝑘 |                        revisiting their pipeline and make necessary changes for
                  𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 =                             (2)
                                |𝑌 |                           processing long documents and tune their heuristics to
                                                               generate better quality candidates, which are to be ranked
                   2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 * 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘                 later for identifying the keyphrases.
       𝐹 1@𝑘 =                                           (3)      For the supervised approaches using Longformer in a
                    𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 + 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘
                                                               sequence tagging setup proved to be the most promising
  where 𝑌¯𝑘 denotes the top k elements of the set 𝑌¯ .         technique as shown by the performance reported in Table
                                                               6. Treating keyphrase extraction as a sequence tagging
                                                               problem also automatically learns the optimal amount of
3.4. Results                                                   keyphrases to be predicted and helps to overcome the chal-
                                                               lenges with other strategies that has to deal with a large
        Algorithm           LDKP3K      LDKP10K                number of candidates as discussed above. The longformer
        SGRank                 86.96       85.56
                                                               model on an average predicted 6.25 and 6.08 number
        TopicRank             636.02      520.81
        PositionRank          678.65      547.66               of keyphrases for the LDKP10k and LDKP3k test sets,
        TopicalPageRank       709.51      574.50               respectively.
        Singlerank            773.11      624.25
        TextRank              773.11      624.25
        Multipartite          636.02      520.78               4. Conclusion
        Yake                 2475.20     1965.73
        TfIDF                6472.93     4922.29           In this work, we identified the shortage of corpus com-
        KPMiner                79.51       74.81
                                                           prising of long documents for training and evaluating
        WINGNUS               659.47      544.91
        Kea                  2534.71     2032.84
                                                           keyphrase extraction and generation models. We created
                                                           two very large corpus - LDKP3K and LDKP10K com-
Table 7                                                    prising of ≈ 100K and ≈ 1.3M documents and made it
Average number of candidate keyphrases generated by publicly available. The results of keyphrase extraction on
the supervised and unsupervised algorithms on LDKP3K long documents with some of the existing unsupervised
and LDKP10K datasets.
                                                           and supervised models clearly depicts the challenging
                                                           nature of the problem. We hope this would encourage
   Unsupervised algorithms did not show better perfor- the researchers to innovate and propose new models ca-
mance than their supervised counterparts on long docu- pable of identifying high quality keyphrases from long
ments as shown in Tables 4, 5, and 6. For the unsupervised multi-page documents.
approaches SGRank and KPMiner outperformed every
other algorithm in the graph-based ranking and statisti-
cal categories respectively. One possible reason for the References
low performance of the other unsupervised techniques
could be that during the candidate generation and ranking   [1] R. Meng, D. Mahata, F. Boudin, From fundamentals
phases these models had to deal with more noise than             to recent advances: A tutorial on keyphrasification,
what they have been tuned to. Table 7 shows the number           in: European Conference on Information Retrieval,
of candidates generated by the strategies used by each           Springer, 2022, pp. 582–588.
of these algorithms. We can easily observe that most of     [2]  M. Kulkarni, D. Mahata, R. Arora, R. Bhowmik,
the techniques resulted in generating a huge number of           Learning rich representation of keyphrases from text,
candidate keyphrases which might have made the down-             arXiv preprint arXiv:2112.08547 (2021).
stream ranking process challenging. On the other hand,      [3]  D. K. Sanyal, P. K. Bhowmick, P. P. Das, S. Chat-
we can see that both SGRank and KPMiner had strate-              topadhyay, T. Santosh, Enhancing access to schol-
gies which were able to significantly reduce the number          arly publications with surrogate resources, Sciento-
of generated candidates and come up with better set of           metrics 121 (2019) 1129–1164.
 [4] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, [15] E. Çano, OAGKX keyword generation dataset,
     E. Frank, Improving browsing in digital libraries         2019. URL: http://hdl.handle.net/11234/
     with keyphrase indexes, Decision Support Systems          1-3062, LINDAT/CLARIAH-CZ digital library
     27 (1999) 81–104.                                         at the Institute of Formal and Applied Linguis-
 [5] I. Y. Song, R. B. Allen, Z. Obradovic, M. Song,           tics (ÚFAL), Faculty of Mathematics and Physics,
     Keyphrase extraction-based query expansion in dig-        Charles University.
     ital libraries, in: Proceedings of the 6th ACM/IEEE- [16] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in
     CS Joint Conference on Digital Libraries (JCDL’06),       scientific publications, in: D. H.-L. Goh, T. H. Cao,
     IEEE, 2006, pp. 202–209.                                  I. T. Sølvberg, E. Rasmussen (Eds.), Asian Digi-
 [6] I. Augenstein, M. Das, S. Riedel, L. Vikraman,            tal Libraries. Looking Back 10 Years and Forging
     A. McCallum, Semeval 2017 task 10: Scienceie-             New Frontiers, Springer Berlin Heidelberg, Berlin,
     extracting keyphrases and relations from scientific       Heidelberg, 2007, pp. 317–326.
     publications, arXiv preprint arXiv:1704.02853 [17] M. Krapivin, A. Autayeu, M. Marchese,
     (2017).                                                   E. Blanzieri, N. Segata,          Keyphrases extrac-
 [7] W.-t. Yih, J. Goodman, V. R. Carvalho, Finding            tion from scientific documents:              Improving
     advertising keywords on web pages, in: Proceedings        machine learning approaches with natural language
     of the 15th international conference on World Wide        processing, volume 6102, 2010, pp. 102–111.
     Web, 2006, pp. 213–222.                                   doi:10.1007/978-3-642-13654-2_12.
 [8] V. Qazvinian, D. Radev, A. Özgür, Citation sum- [18] E. Papagiannopoulou, G. Tsoumakas, A review
     marization through keyphrase extraction, in: Pro-         of keyphrase extraction, Wiley Interdisciplinary
     ceedings of the 23rd international conference on          Reviews: Data Mining and Knowledge Discovery
     computational linguistics (COLING 2010), 2010,            10 (2020) e1339.
     pp. 895–903.                                         [19] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas,
 [9] G. Berend, Opinion expression mining by exploiting        Z. Agha, N. Ong, A. Kovashka, Automatic under-
     keyphrase extraction (2011).                              standing of image and video advertisements, in:
[10] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin,            Proceedings of the IEEE Conference on Computer
     Automatic keyphrase extraction from scientific arti-      Vision and Pattern Recognition, 2017, pp. 1705–
     cles, Language resources and evaluation 47 (2013)         1715.
     723–742.                                             [20] W. Magdy, K. Darwish, Book search: indexing
[11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky,          the valuable parts, in: Proceedings of the 2008
     Y. Chi, Deep keyphrase generation, in: Pro-               ACM workshop on Research advances in large digi-
     ceedings of the 55th Annual Meeting of the As-            tal book repositories, 2008, pp. 53–56.
     sociation for Computational Linguistics (Volume [21] A. Gupta, V. Dengre, H. A. Kheruwala, M. Shah,
     1: Long Papers), Association for Computational            Comprehensive review of text-mining applications
     Linguistics, Vancouver, Canada, 2017, pp. 582–592.        in finance, Financial Innovation 6 (2020) 1–25.
     URL: https://www.aclweb.org/anthology/ [22] R. Bhargava, S. Nigwekar, Y. Sharma, Catchphrase
     P17-1054. doi:10.18653/v1/P17-1054.                       extraction from legal documents using lstm net-
[12] A. Swaminathan, H. Zhang, D. Mahata, R. Gosangi,          works., in: FIRE (Working Notes), 2017, pp. 72–73.
     R. Shah, A. Stent, A preliminary exploration of [23] T. D. Nguyen, M.-T. Luong, Wingnus: Keyphrase
     gans for keyphrase generation, in: Proceedings            extraction utilizing document logical structure, in:
     of the 2020 Conference on Empirical Methods in            Proceedings of the 5th international workshop on
     Natural Language Processing (EMNLP), 2020, pp.            semantic evaluation, 2010, pp. 166–169.
     8021–8030.                                           [24] K. S. Hasan, V. Ng, Automatic keyphrase extrac-
[13] C. Caragea, F. A. Bulgarov, A. Godea, S. D. Gol-          tion: A survey of the state of the art, in: Proceedings
     lapalli, Citation-enhanced keyphrase extraction           of the 52nd Annual Meeting of the Association for
     from research papers: A supervised approach, in:          Computational Linguistics (Volume 1: Long Pa-
     EMNLP, 2014.                                              pers), 2014, pp. 1262–1273.
[14] A. Hulth,         Improved automatic keyword ex- [25] E. Cano, O. Bojar, Keyphrase generation: A
     traction given more linguistic knowledge, in:             text summarization struggle,            arXiv preprint
     Proceedings of the 2003 Conference on Empir-              arXiv:1904.00110 (2019).
     ical Methods in Natural Language Processing, [26] Y. Gallina, F. Boudin, B. Daille, Large-scale evalu-
     EMNLP ’03, Association for Computational Lin-             ation of keyphrase extraction models, in: Proceed-
     guistics, USA, 2003, p. 216–223. URL: https://            ings of the ACM/IEEE Joint Conference on Digital
     doi.org/10.3115/1119355.1119383. doi:10.                  Libraries in 2020, 2020, pp. 271–278.
     3115/1119355.1119383.                                [27] C. G. Kontoulis,            E. Papagiannopoulou,
     G. Tsoumakas,           Keyphrase extraction from            in: International joint conference on natural lan-
     scientific articles via extractive summarization, in:        guage processing (IJCNLP), 2013, pp. 543–551.
     Proceedings of the Second Workshop on Scholarly         [40] F. Boudin,         Unsupervised keyphrase extrac-
     Document Processing, 2021, pp. 49–55.                        tion with multipartite graphs, arXiv preprint
[28] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. S.              arXiv:1803.08721 (2018).
     Weld, S2orc: The semantic scholar open research         [41] S. Danesh, T. Sumner, J. H. Martin, Sgrank: Com-
     corpus, arXiv preprint arXiv:1911.02782 (2019).              bining statistical and graphical methods to improve
[29] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran,            the state of the art in unsupervised keyphrase extrac-
     A. Haviv, A. Gupta, W. Xiong, M. Geva, J. Be-                tion, in: Proceedings of the fourth joint conference
     rant, et al., Scrolls: Standardized comparison               on lexical and computational semantics, 2015, pp.
     over long language sequences, arXiv preprint                 117–126.
     arXiv:2201.03533 (2022).                                [42] F. Boudin, Pke: an open source python-based
[30] K. Garg, J. R. Chowdhury, C. Caragea, Keyphrase              keyphrase extraction toolkit, in: Proceedings of
     generation beyond the boundaries of title and ab-            COLING 2016, the 26th international conference on
     stract, arXiv preprint arXiv:2112.06776 (2021).              computational linguistics: system demonstrations,
[31] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J.            2016, pp. 69–73.
     Hsu, K. Wang, An overview of microsoft academic         [43] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin,
     service (mas) and applications, in: Proceedings              C. G. Nevill-Manning, Kea: Practical automated
     of the 24th international conference on world wide           keyphrase extraction, in: Design and Usability of
     web, 2015, pp. 243–246.                                      Digital Libraries: Case Studies in the Asia Pacific,
[32] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su,           IGI global, 2005, pp. 129–152.
     Arnetminer: extraction and mining of academic so-       [44] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     cial networks, in: Proceedings of the 14th ACM               BERT: Pre-training of deep bidirectional trans-
     SIGKDD international conference on Knowledge                 formers for language understanding, in: Pro-
     discovery and data mining, 2008, pp. 990–998.                ceedings of the 2019 Conference of the North
[33] D. Sahrawat, D. Mahata, H. Zhang, M. Kulkarni,               American Chapter of the Association for Com-
     A. Sharma, R. Gosangi, A. Stent, Y. Kumar, R. R.             putational Linguistics: Human Language Tech-
     Shah, R. Zimmermann, Keyphrase extraction as                 nologies, Volume 1 (Long and Short Papers),
     sequence labeling using contextualized embeddings,           Association for Computational Linguistics, Min-
     Advances in Information Retrieval 12036 (2020)               neapolis, Minnesota, 2019, pp. 4171–4186.
     328.                                                         URL: https://www.aclweb.org/anthology/
[34] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase            N19-1423. doi:10.18653/v1/N19-1423.
     extraction system for english and arabic documents,     [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     Information systems 34 (2009) 132–144.                       O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
[35] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,            Roberta: A robustly optimized bert pretraining ap-
     C. Nunes, A. Jatowt, Yake! keyword extraction                proach, arXiv preprint arXiv:1907.11692 (2019).
     from single documents using multiple local features,    [46] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The
     Information Sciences 509 (2020) 257–289.                     long-document transformer, CoRR abs/2004.05150
[36] R. Mihalcea, P. Tarau, Textrank: Bringing order              (2020). URL: https://arxiv.org/abs/2004.
     into text, in: Proceedings of the 2004 conference on         05150. arXiv:2004.05150.
     empirical methods in natural language processing,       [47] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
     2004, pp. 404–411.                                           C. Alberti, S. Ontanon, P. Pham, A. Ravula,
[37] C. Florescu, C. Caragea, Positionrank: An unsuper-           Q. Wang, L. Yang, A. Ahmed, Big bird: Trans-
     vised approach to keyphrase extraction from schol-           formers for longer sequences, in: H. Larochelle,
     arly documents, in: Proceedings of the 55th Annual           M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin
     Meeting of the Association for Computational Lin-            (Eds.), Advances in Neural Information Processing
     guistics (Volume 1: Long Papers), 2017, pp. 1105–            Systems, volume 33, Curran Associates, Inc., 2020,
     1115.                                                        pp. 17283–17297. URL: https://proceedings.
[38] X. Wan, J. Xiao, Collabrank: towards a collabo-              neurips.cc/paper/2020/file/
     rative approach to single-document keyphrase ex-             c8512d142a2d849725f31a9a7a361ab9-Paper.
     traction, in: Proceedings of the 22nd International          pdf.
     Conference on Computational Linguistics (Coling         [48] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The
     2008), 2008, pp. 969–976.                                    efficient transformer, in: International Conference
[39] A. Bougouin, F. Boudin, B. Daille, Topicrank:                on Learning Representations, 2020. URL: https:
     Graph-based topic ranking for keyphrase extraction,          //openreview.net/forum?id=rkgNKkHtvB.