<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LDKP - A Dataset for Identifying Keyphrases from Long Scientific Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Debanjan Mahata</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Navneet Agarwal</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dibya Gautam</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amardeep Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swapnil Parekh</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaman Kumar Singla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anish Acharya</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajiv Ratn Shah</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adobe Media and Data Science Research (MDSR)</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instabase</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MIDAS Labs</institution>
          ,
          <addr-line>IIIT-Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Moody's Analytics</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>New York University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Texas at Austin</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (≈ 8 sentences). This presents three challenges for real-world applications: i) human-written summaries are unavailable for most documents, ii) a vast majority of the documents are long, and iii) a high percentage of KPs are directly found beyond the limited context of the title and the abstract. Therefore, we release two extensive corpora mapping KPs of ≈ 1.3 and ≈ 100 scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem. Additionally, we also benchmark and report the performances of different unsupervised as well as supervised algorithms for keyphrase extraction on long scientific documents. Our experiments show that formulating keyphrase extraction as a sequence tagging task with modern transformer language models capable of processing long text sequences such as longformer has advantages over the traditional algorithms, not only resulting in better performances in terms of F1 metrics but also in learning to extract optimal number of keyphrases from the input documents.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;keyphrase extraction</kwd>
        <kwd>keyphrase generation</kwd>
        <kwd>keyphrasification</kwd>
        <kwd>automatic identification of keyphrases</kwd>
        <kwd>long documents</kwd>
        <kwd>longformer</kwd>
        <kwd>language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>mining [9] to name a few. This has motivated researchers
to explore machine learning algorithms for automatically
Identifying keyphrases (KPs) is a form of extreme sum- mapping documents to a set of keyphrases commonly
remarization, where given an input document, the task is to ferred as the keyphrase extraction (KPE) task [10, 6], for
ifnd a set of representative phrases that can effectively extractive approaches, and keyphrase generation (KPG)
summarize it [1]. Over the last decade, we have seen an task [11, 12] for generative approaches. Recently, it was
exponential increase in the velocity at which unstructured also referred as Keyphrasification [1].
text is produced on the web, with the vast majority of Various algorithms have been proposed over time to
them untagged or poorly tagged. KPs provide an effec- solve the problem of identifying keyphrases from text
doctive way to search, summarize, tag, and manage these uments that can primarily be categorized into supervised
documents. Identifying KPs have proved to be useful as and unsupervised approaches [18]. Majority of these
appreprocessing, pre-training [2], or supplementary tasks proaches take an abstract (a summary) of a text document
in other tasks such as search [3, 4, 5], recommendation as the input and produce keyphrases as output.
Howsystems [6], advertising [7], summarization [8], opinion ever, in industrial applications across different domains
such as advertising [19], search and indexing [20], finance
DL4SR’22: Workshop on Deep Learning for Search and Recommen- [21], law [22], and many other real-world use cases,
docdation, co-located with the 31st ACM International Conference on ument summaries are not readily available. Moreover,
Information and Knowledge Management (CIKM), October 17-21, most of the documents encountered in these applications
*20D2e2b,aAnjtalannMtaa,hUaStaAparticipated in this work as an Adjunct Faculty at are greater than 8 sentences (the average length of
abIIIT-Delhi. stracts in KP datasets, see Table 1). We also find that a
† These authors contributed equally. significant percentage of keyphrases ( &gt;18%) are directly
$ debanjanmahata85@gmail.com (D. Mahata) found beyond the limited context of a document’s title and
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License abstract/summary. These constraints limit the potential
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutiRon 4W.0Iontrekrnsathioonapl(CPCrBoYc4e.0e)d.ings (CEUR-WS.org)
SemEval 2017 [6]</p>
      <p>KDD [13]
Inspec [14]
KP20K [11]
OAGKx [15]</p>
      <p>NUS [16]
SemEval 2010 [10]</p>
      <p>Krapivin [17]</p>
      <sec id="sec-1-1">
        <title>LDKP3K</title>
        <p>(S2ORC ← KP20K)</p>
      </sec>
      <sec id="sec-1-2">
        <title>LDKP10K</title>
        <p>(S2ORC ← OAGKx)
100K
1.3M</p>
      </sec>
      <sec id="sec-1-3">
        <title>Long</title>
        <p>Documents
×
×
×
×
×
✓
✓
✓
✓
✓
280.67
194.76
6027.10
4384.58
76.11%
63.65%
23.89%
36.35%
of currently developed KPE and KPG algorithms to only with the datasets1 and transformerkp2 libraries. We hope
theoretical pursuits. that researchers working in this area would
acknowl</p>
        <p>Many previous studies have pointed out the constraints edge the shortcomings of the popularly used datasets and
imposed on KPE algorithms due to the short inputs and methods in KPE and KPG and devise exciting new
apartificial nature of available datasets [ 23, 24, 25, 26, 27]. proaches for overcoming the challenges related to
identifyIn particular, Cano and Bojar [25] while explaining the ing keyphrases from long documents and contexts beyond
limitations of their proposed algorithms, note that the summaries. This would make the models more useful in
title and the abstract may not carry sufcfiient topical in- practical real-world settings. We think that LDKP can
formation about the article, even when joined together. also complement recent efforts towards creating suitable
While most datasets in the domain of KPE consist of ti- benchmarks [29] for evaluating methods being developed
tles and abstracts [15], there have been some attempts at to understand and process long text sequences.
providing long document KP datasets as well (Table 1).</p>
        <p>Krapivin et al. [17] released 2,000 full-length scientific
papers from the computer science domain. Kim et al. 2. Dataset
[10] in a SemEval-2010 challenge released a dataset
containing 244 full scientific articles along with their author We propose two datasets resulting from the mapping of
and reader assigned keyphrases. Nguyen and Kan [16] S2ORC with KP20K and OAGKx corpus, respectively.
released 211 full-length scientific documents with mul- Lo et al. [28] publicly released S2ORC as a huge
cortiple annotated keyphrases. All of these datasets were pus of 8.1M scientific documents. While it has full text
released more than a decade ago and were more suitable and metadata (see Table 2) the corpus does not contain
for machine-learning models available back then. With keyphrases. We took this as an opportunity to create a
today’s deep learning paradigms like un/semi-supervised new corpus for identifying keyphrases from full-length
learning requiring Wikipedia sized corpora (&gt;6M arti- scientific articles. Therefore, we took the KP20K and
cles), it becomes imperative to update the KPE and KPG OAGKx scientific corpus for which keyphrases were
altasks with similar sized corpus. ready available and mapped them to their corresponding</p>
        <p>In this work, we develop two large datasets (LDKP documents in S2ORC.
- Long Document Keyphrase) comprising of 100K and This is the first time in the keyphrase community that
1.3M documents for identifying keyphrases from full- such a large number of full-length documents with
comlength scientific articles along with their metadata infor- prehensive metadata information have been made publicly
mation such as venue, year of publication, author infor- available for academic use. Here, we want to
acknowlmation, inbound and outbound citations, and citation con- edge another concurrent work [30] that looks at the task
texts, among others. We achieve this by mapping the of keyphrase generation from a newly constructed corpus
existing KP20K [11] and OAGKx [15] corpus to the
documents available in S2ORC dataset [28]. We make the
dataset publicly available on Huggingface hub (Section
2.2) and also integrate the processing of these datasets</p>
        <sec id="sec-1-3-1">
          <title>1https://github.com/huggingface/datasets</title>
          <p>2transformerkp - is a transformer based deep learning
library for training and evaluating keyphrase
extraction and generation algorithms, https://github.
com/Deep-Learning-for-Keyphrase/
transformerkp
2.1. Dataset Preparation
In the absence of any unique identifier shared across
datasets, we used paper title to map documents in S2ORC
Table 2 with KP20K/OAGKx. This had its own set of challenges.
Information available in the metadata of each scientific For example, some papers in KP20K and OAGKx had
paper in LDKP corpus. unigram titles like “Editorial" or “Preface". Multiple
papers can be found with the same title. We ignored all
the papers with unigram and bigram titles. We resolved
of long documents - FULLTEXTKP. However, they do the title conflicts through manual vericfiation. We also
not make the corpus publicly available and the corpus is found out that some of the keyphrases in OAGKx and
significantly smaller than ours containing only ≈ 142 KP20K datasets were parsed incorrectly. Keyphrases that
documents. contain delimiters such as comma (which is also used as</p>
          <p>We release two datasets LDKP3K and LDKP10K cor- a separator for keyphrase list) have been broken down
responding to KP20K and OAGKx, respectively. The into two or more keyphrases, e.g., the keyphrase
‘2,4ifrst corpus consists of ≈ 100K long documents with dichlorophenoxyacetic acid’ has been broken down into
keyphrases obtained by mapping KP20K to S2ORC. [‘2’, ‘4- dichlorophenoxyacetic acid’]. In some cases, the
The KP20K corpus mainly contains title, abstract and publication year, page number, DOI, e.g.,
1999:14:555keyphrases for computer science research articles from 558, were inaccurately added to the list of keyphrases. To
online digital libraries like ACM Digital Library, Sci- solve this, we filtered out all the keyphrases that did not
enceDirect, and Wiley. Using S2ORC documents, we have any alphabetical characters in them.
increase the average length of the documents in KP20K Next, in order to facilitate the usage of particular
secfrom 7.42 sentences to 280.67 sentences. This also in- tions in KPE algorithms, we standardized the section
creased the percentage of present keyphrases in the input names across all the papers. The section names varied
text by 18.7%. across different papers in the S2ORC dataset. For
exam</p>
          <p>The second corpus corresponding to OAGKx consists ple, some papers have a section named “Introduction"
of 1.3M full scientific articles from various domains while others have it as “1.Introduction", “I. Introduction",
with their corresponding keyphrases collected from aca- “I Introduction" etc. To deal with this problem, we replaced
demic graph [31, 32]. The resulting corpus contains 194.7 the unique section names with a common generic
secsentences (up from 8.87 sentences) on an average with tion name, like “introduction", across all the papers. We
10.95% increase in present keyphrases. An increase in did this for common sections which includes
introducpercentage of present keyphrases in both the corpus when tion, related work, conclusion, methodology, results and
expanded to full length articles clearly indicates the oc- analysis.
currence of a significant chunk of the keyphrases beyond In order to make the dataset useful for training a
sethe abstract. Since both datasets consist of a large number quence tagging model we also provide token level tags in
of documents, we present three versions of each dataset B-I-O format as previously done in [33]. We marked all
with the training data split into small, medium and large the words in the document belonging to the keyphrases
sizes, as given in Table 3. This was done in order to as ‘B’ or ‘I’ depending on whether they are the first word
provide an opportunity to the researchers and practition- of the keyphrase or otherwise. Every other word, which
ers with scarcity of computing resources to evaluate the were not a part of a keyphrase were tagged as ‘O’. The
Train</p>
          <p>Small
Medium</p>
          <p>Large</p>
          <p>Test
Validation
ground truth keyphrases associated with the documents
were identified by searching for the same string pattern in
the document’s text. The text is tokenized using a
whitespace tokenizer and a mapping between each token and it’s
corresponding tag is provided as shown in Figure 1.</p>
          <p>The proposed dataset LDKP3k and LDKP10k are
further divided into train, test and validation splits as shown
in Table-3. For LDKP3k, these splits are based on the
original KP20K dataset. For LDKP10k, we resorted to
random sampling method to create these splits since OAGKx,
the keyphrase dataset corresponding to LDKP10k, wasn’t
originally divided into train, test and validation splits.
Figures 2 and 3 show the distribution of papers in terms of
ifeld of study across all the splits of the LDKP3k and
LDKP10k datasets, respectively.
2.2. Dataset Usage
We make all the datasets publicly available on
Huggingface hub and enable programmatic access to the data using
the datasets library. For example, Figure 4 shows a
sample code for downloading the LDKP3K dataset with the
‘small’ training data split. Similarly, other configurations
like ‘medium’ and ‘large’ can also be downloaded, each
having different sizes of the training data but the same
validation and test dataset. Figure 4 also shows how each
split of the dataset can be accessed.</p>
          <p>Please refer to the Huggingface hub pages for LDKP3k
and LDKP10k for detailed information about
downloading and using the dataset.</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>1. LDKP3K - https://huggingface.co/</title>
          <p>datasets/midas/ldkp3k</p>
        </sec>
        <sec id="sec-1-3-3">
          <title>2. LDKP10K - https://huggingface.co/</title>
          <p>datasets/midas/ldkp10k</p>
          <p>We also enable access of the datasets using the
transformerkp library, which abstracts away the
preprocessing steps and make the data splits readily available
to the user for the tasks of keyphrase extraction using
sequence tagging and keyphrase generation using
seq2seq methods, respectively with different transformer
based language models. Details of downloading and
using the datasets with transformerkp for the tasks of
keyphrase extraction and generation could be found over
here - https://deep-learning-for-keyphrase.
github.io/transformerkp/how-to-guides/
keyphrase-data/
PositionRank</p>
          <p>TextRank
TopicRank</p>
          <p>SingleRank
MultipartiteRank
TopicalPageRank</p>
          <p>SGRank</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments</title>
      <p>In this section, we evaluate several popular keyphrase
extraction algorithms on the proposed LDKP3K and
LDKP10K datasets, along with three of the other existing
smaller datasets in scientific domain comprising of full
length documents - Krapivin, SemEval-2010, and NUS. A
majority of the previous works have reported scores for
Krapivin, SemEval-2010, and NUS, by only considering
the title and abstract as the input We further report the
benchmark results and also discuss the comparative
advantage of different algorithms to provide future research
direction.
3.1. Unsupervised Methods
There are multiple unsupervised methods for extracting
keyphrases from a document. We used the following
popular statistical models: TfIDf, KPMiner [34], YAKE
[35] and the following graph-based algorithms: TextRank
[36], PositionRank [37], SingleRank [38], TopicRank
[39], MultipartiteRank [40] and SGRank [41]. All the
implementations were taken from the PKE toolkit [42],
except SGRank, for which we used the implementation
available in the textacy3 library. These algorithms first
identify the candidate keyphrases using lexical rules
followed by ranking the candidates using either a statistical
approach or a graph-based approach [1]. We directly
reported the performance scores of these methods on the
test datasets (Table 4).</p>
      <sec id="sec-2-1">
        <title>3https://github.com/chartbeat-labs/</title>
        <p>textacy
3.2. Supervised Methods
For supervised keyphrase extraction, we report results
for two traditional models, namely - KEA [43] and
WINGNUS [23], which treat keyphrase extraction as
a binary classification task. A recent trend is to treat
keyphrase extraction as a sequence tagging task [33, 2, 1].
Transformer based language models like BERT [44],
RoBERTa [45], KBIR [2], have already shown to achieve
SOTA results on the task of keyphrase extraction when
only the title and abstract is taken as the input. However,
all these models have a limitation of processing only 512
sub-word tokens. This led us to try Longformer [46],
which can handle long sequences of text of up to 4,096
sub-word tokens. We acknowledge that there are several
other recent models such as [47, 48] which could have
been also tried. We are surely interested to try the others
in a future work. Further, we would train larger models
on the LDKP large corpus.
3.3. Evaluation Metrics
We used  1@5 and  1@10 as our evaluation metrics
[10]. Equations 1, 2 and 3 shows how  1@ is
calculated. Before evaluating, we lower-cased, stemmed,
and removed punctuations from the ground truth as well
as the predicted keyphrases, and used actual matching.
Let  denote the ground truth keyphrases and ¯  =
(¯1, ¯2, . . . , ¯ ) denote the predicted keyphrases ordered
by their quality of prediction. Then we can define the
metrics as follows:
 @ =
(1)
| ∩ ¯|
{|¯|, }
2 *  @ * @</p>
        <p>@ + @
where ¯ denotes the top k elements of the set ¯  .
3.4. Results</p>
        <p>Algorithm
SGRank
TopicRank
PositionRank
TopicalPageRank
Singlerank
TextRank
Multipartite
Yake
TfIDF
KPMiner
WINGNUS
Kea
keyphrases. The other algorithms might get benefited by
revisiting their pipeline and make necessary changes for
processing long documents and tune their heuristics to
generate better quality candidates, which are to be ranked
later for identifying the keyphrases.</p>
        <p>For the supervised approaches using Longformer in a
sequence tagging setup proved to be the most promising
technique as shown by the performance reported in Table
6. Treating keyphrase extraction as a sequence tagging
problem also automatically learns the optimal amount of
keyphrases to be predicted and helps to overcome the
challenges with other strategies that has to deal with a large
number of candidates as discussed above. The longformer
model on an average predicted 6.25 and 6.08 number
of keyphrases for the LDKP10k and LDKP3k test sets,
respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>Kea</p>
      <p>WINGNUS
longformer-base-4096</p>
      <p>In this work, we identified the shortage of corpus
comprising of long documents for training and evaluating
keyphrase extraction and generation models. We created
two very large corpus - LDKP3K and LDKP10K
comTable 7 prising of ≈ 100K and ≈ 1.3M documents and made it
Average number of candidate keyphrases generated by publicly available. The results of keyphrase extraction on
the supervised and unsupervised algorithms on LDKP3K long documents with some of the existing unsupervised
and LDKP10K datasets. and supervised models clearly depicts the challenging
nature of the problem. We hope this would encourage</p>
      <p>Unsupervised algorithms did not show better perfor- the researchers to innovate and propose new models
camance than their supervised counterparts on long docu- pable of identifying high quality keyphrases from long
ments as shown in Tables 4, 5, and 6. For the unsupervised multi-page documents.
approaches SGRank and KPMiner outperformed every
other algorithm in the graph-based ranking and
statistical categories respectively. One possible reason for the References
low performance of the other unsupervised techniques
could be that during the candidate generation and ranking [1] R. Meng, D. Mahata, F. Boudin, From fundamentals
phases these models had to deal with more noise than to recent advances: A tutorial on keyphrasification,
what they have been tuned to. Table 7 shows the number in: European Conference on Information Retrieval,
of candidates generated by the strategies used by each Springer, 2022, pp. 582–588.
of these algorithms. We can easily observe that most of [2] M. Kulkarni, D. Mahata, R. Arora, R. Bhowmik,
the techniques resulted in generating a huge number of Learning rich representation of keyphrases from text,
candidate keyphrases which might have made the down- arXiv preprint arXiv:2112.08547 (2021).
stream ranking process challenging. On the other hand, [3] D. K. Sanyal, P. K. Bhowmick, P. P. Das, S.
Chatwe can see that both SGRank and KPMiner had strate- topadhyay, T. Santosh, Enhancing access to
scholgies which were able to significantly reduce the number arly publications with surrogate resources,
Scientoof generated candidates and come up with better set of metrics 121 (2019) 1129–1164.
[4] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, [15] E. Çano, OAGKX keyword generation dataset,
E. Frank, Improving browsing in digital libraries 2019. URL: http://hdl.handle.net/11234/
with keyphrase indexes, Decision Support Systems 1-3062, LINDAT/CLARIAH-CZ digital library
27 (1999) 81–104. at the Institute of Formal and Applied
Linguis[5] I. Y. Song, R. B. Allen, Z. Obradovic, M. Song, tics (ÚFAL), Faculty of Mathematics and Physics,
Keyphrase extraction-based query expansion in dig- Charles University.
ital libraries, in: Proceedings of the 6th ACM/IEEE- [16] T. D. Nguyen, M.-Y. Kan, Keyphrase extraction in
CS Joint Conference on Digital Libraries (JCDL’06), scientific publications, in: D. H.-L. Goh, T. H. Cao,
IEEE, 2006, pp. 202–209. I. T. Sølvberg, E. Rasmussen (Eds.), Asian
Digi[6] I. Augenstein, M. Das, S. Riedel, L. Vikraman, tal Libraries. Looking Back 10 Years and Forging
A. McCallum, Semeval 2017 task 10: Scienceie- New Frontiers, Springer Berlin Heidelberg, Berlin,
extracting keyphrases and relations from scientific Heidelberg, 2007, pp. 317–326.
publications, arXiv preprint arXiv:1704.02853 [17] M. Krapivin, A. Autayeu, M. Marchese,
(2017). E. Blanzieri, N. Segata, Keyphrases
extrac[7] W.-t. Yih, J. Goodman, V. R. Carvalho, Finding tion from scientific documents: Improving
advertising keywords on web pages, in: Proceedings machine learning approaches with natural language
of the 15th international conference on World Wide processing, volume 6102, 2010, pp. 102–111.</p>
      <p>Web, 2006, pp. 213–222. doi:10.1007/978-3-642-13654-2_12.
[8] V. Qazvinian, D. Radev, A. Özgür, Citation sum- [18] E. Papagiannopoulou, G. Tsoumakas, A review
marization through keyphrase extraction, in: Pro- of keyphrase extraction, Wiley Interdisciplinary
ceedings of the 23rd international conference on Reviews: Data Mining and Knowledge Discovery
computational linguistics (COLING 2010), 2010, 10 (2020) e1339.</p>
      <p>pp. 895–903. [19] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas,
[9] G. Berend, Opinion expression mining by exploiting Z. Agha, N. Ong, A. Kovashka, Automatic
underkeyphrase extraction (2011). standing of image and video advertisements, in:
[10] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin, Proceedings of the IEEE Conference on Computer
Automatic keyphrase extraction from scientific arti- Vision and Pattern Recognition, 2017, pp. 1705–
cles, Language resources and evaluation 47 (2013) 1715.</p>
      <p>723–742. [20] W. Magdy, K. Darwish, Book search: indexing
[11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, the valuable parts, in: Proceedings of the 2008
Y. Chi, Deep keyphrase generation, in: Pro- ACM workshop on Research advances in large
digiceedings of the 55th Annual Meeting of the As- tal book repositories, 2008, pp. 53–56.
sociation for Computational Linguistics (Volume [21] A. Gupta, V. Dengre, H. A. Kheruwala, M. Shah,
1: Long Papers), Association for Computational Comprehensive review of text-mining applications
Linguistics, Vancouver, Canada, 2017, pp. 582–592. in finance, Financial Innovation 6 (2020) 1–25.
URL: https://www.aclweb.org/anthology/ [22] R. Bhargava, S. Nigwekar, Y. Sharma, Catchphrase
P17-1054. doi:10.18653/v1/P17-1054. extraction from legal documents using lstm
net[12] A. Swaminathan, H. Zhang, D. Mahata, R. Gosangi, works., in: FIRE (Working Notes), 2017, pp. 72–73.</p>
      <p>R. Shah, A. Stent, A preliminary exploration of [23] T. D. Nguyen, M.-T. Luong, Wingnus: Keyphrase
gans for keyphrase generation, in: Proceedings extraction utilizing document logical structure, in:
of the 2020 Conference on Empirical Methods in Proceedings of the 5th international workshop on
Natural Language Processing (EMNLP), 2020, pp. semantic evaluation, 2010, pp. 166–169.
8021–8030. [24] K. S. Hasan, V. Ng, Automatic keyphrase
extrac[13] C. Caragea, F. A. Bulgarov, A. Godea, S. D. Gol- tion: A survey of the state of the art, in: Proceedings
lapalli, Citation-enhanced keyphrase extraction of the 52nd Annual Meeting of the Association for
from research papers: A supervised approach, in: Computational Linguistics (Volume 1: Long
PaEMNLP, 2014. pers), 2014, pp. 1262–1273.
[14] A. Hulth, Improved automatic keyword ex- [25] E. Cano, O. Bojar, Keyphrase generation: A
traction given more linguistic knowledge, in: text summarization struggle, arXiv preprint
Proceedings of the 2003 Conference on Empir- arXiv:1904.00110 (2019).
ical Methods in Natural Language Processing, [26] Y. Gallina, F. Boudin, B. Daille, Large-scale
evaluEMNLP ’03, Association for Computational Lin- ation of keyphrase extraction models, in:
Proceedguistics, USA, 2003, p. 216–223. URL: https:// ings of the ACM/IEEE Joint Conference on Digital
doi.org/10.3115/1119355.1119383. doi:10. Libraries in 2020, 2020, pp. 271–278.
3115/1119355.1119383. [27] C. G. Kontoulis, E. Papagiannopoulou,
G. Tsoumakas, Keyphrase extraction from in: International joint conference on natural
lanscientific articles via extractive summarization, in: guage processing (IJCNLP), 2013, pp. 543–551.
Proceedings of the Second Workshop on Scholarly [40] F. Boudin, Unsupervised keyphrase
extracDocument Processing, 2021, pp. 49–55. tion with multipartite graphs, arXiv preprint
[28] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. S. arXiv:1803.08721 (2018).</p>
      <p>Weld, S2orc: The semantic scholar open research [41] S. Danesh, T. Sumner, J. H. Martin, Sgrank:
Comcorpus, arXiv preprint arXiv:1911.02782 (2019). bining statistical and graphical methods to improve
[29] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran, the state of the art in unsupervised keyphrase
extracA. Haviv, A. Gupta, W. Xiong, M. Geva, J. Be- tion, in: Proceedings of the fourth joint conference
rant, et al., Scrolls: Standardized comparison on lexical and computational semantics, 2015, pp.
over long language sequences, arXiv preprint 117–126.</p>
      <p>arXiv:2201.03533 (2022). [42] F. Boudin, Pke: an open source python-based
[30] K. Garg, J. R. Chowdhury, C. Caragea, Keyphrase keyphrase extraction toolkit, in: Proceedings of
generation beyond the boundaries of title and ab- COLING 2016, the 26th international conference on
stract, arXiv preprint arXiv:2112.06776 (2021). computational linguistics: system demonstrations,
[31] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. 2016, pp. 69–73.</p>
      <p>Hsu, K. Wang, An overview of microsoft academic [43] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin,
service (mas) and applications, in: Proceedings C. G. Nevill-Manning, Kea: Practical automated
of the 24th international conference on world wide keyphrase extraction, in: Design and Usability of
web, 2015, pp. 243–246. Digital Libraries: Case Studies in the Asia Pacific,
[32] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, IGI global, 2005, pp. 129–152.</p>
      <p>Arnetminer: extraction and mining of academic so- [44] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
cial networks, in: Proceedings of the 14th ACM BERT: Pre-training of deep bidirectional
transSIGKDD international conference on Knowledge formers for language understanding, in:
Prodiscovery and data mining, 2008, pp. 990–998. ceedings of the 2019 Conference of the North
[33] D. Sahrawat, D. Mahata, H. Zhang, M. Kulkarni, American Chapter of the Association for
ComA. Sharma, R. Gosangi, A. Stent, Y. Kumar, R. R. putational Linguistics: Human Language
TechShah, R. Zimmermann, Keyphrase extraction as nologies, Volume 1 (Long and Short Papers),
sequence labeling using contextualized embeddings, Association for Computational Linguistics,
MinAdvances in Information Retrieval 12036 (2020) neapolis, Minnesota, 2019, pp. 4171–4186.
328. URL: https://www.aclweb.org/anthology/
[34] S. R. El-Beltagy, A. Rafea, Kp-miner: A keyphrase N19-1423. doi:10.18653/v1/N19-1423.
extraction system for english and arabic documents, [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
Information systems 34 (2009) 132–144. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
[35] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, Roberta: A robustly optimized bert pretraining
apC. Nunes, A. Jatowt, Yake! keyword extraction proach, arXiv preprint arXiv:1907.11692 (2019).
from single documents using multiple local features, [46] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The
Information Sciences 509 (2020) 257–289. long-document transformer, CoRR abs/2004.05150
[36] R. Mihalcea, P. Tarau, Textrank: Bringing order (2020). URL: https://arxiv.org/abs/2004.
into text, in: Proceedings of the 2004 conference on 05150. arXiv:2004.05150.
empirical methods in natural language processing, [47] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
2004, pp. 404–411. C. Alberti, S. Ontanon, P. Pham, A. Ravula,
[37] C. Florescu, C. Caragea, Positionrank: An unsuper- Q. Wang, L. Yang, A. Ahmed, Big bird:
Transvised approach to keyphrase extraction from schol- formers for longer sequences, in: H. Larochelle,
arly documents, in: Proceedings of the 55th Annual M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin
Meeting of the Association for Computational Lin- (Eds.), Advances in Neural Information Processing
guistics (Volume 1: Long Papers), 2017, pp. 1105– Systems, volume 33, Curran Associates, Inc., 2020,
1115. pp. 17283–17297. URL: https://proceedings.
[38] X. Wan, J. Xiao, Collabrank: towards a collabo- neurips.cc/paper/2020/file/
rative approach to single-document keyphrase ex- c8512d142a2d849725f31a9a7a361ab9-Paper.
traction, in: Proceedings of the 22nd International pdf.</p>
      <p>Conference on Computational Linguistics (Coling [48] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The
2008), 2008, pp. 969–976. efficient transformer, in: International Conference
[39] A. Bougouin, F. Boudin, B. Daille, Topicrank: on Learning Representations, 2020. URL: https:
Graph-based topic ranking for keyphrase extraction, //openreview.net/forum?id=rkgNKkHtvB.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>