=Paper= {{Paper |id=Vol-2816/paper1 |storemode=property |title=Efficient Keyphrase Generation with GANs |pdfUrl=https://ceur-ws.org/Vol-2816/paper1.pdf |volume=Vol-2816 |authors=Giuseppe Lancioni,Saida S.Mohamed,Beatrice Portelli,Giuseppe Serra,Carlo Tasso |dblpUrl=https://dblp.org/rec/conf/ircdl/LancioniMP0T21 }} ==Efficient Keyphrase Generation with GANs== https://ceur-ws.org/Vol-2816/paper1.pdf
      Efficient Keyphrase Generation with GANs

Giuseppe Lancioni[0000−0001−6211−9195] , Saida S. Mohamed[0000−0002−2552−3356] ,
Beatrice Portelli[0000−0001−8887−616X] , Giuseppe Serra[0000−0002−4269−4501] , and
                        Carlo Tasso[0000−0001−5162−185X]

                     Università degli Studi di Udine, Udine, Italy
                                 http://ailab.uniud.it/
                  {lancioni.giuseppe,mahmoud.saidasaadmohamed,
                        portelli.beatrice}@spes.uniud.it
                     {giuseppe.serra,carlo.tasso}@uniud.it



        Abstract. Keyphrase Generation is the task of predicting keyphrases:
        short text sequences that convey the main semantic meaning of a doc-
        ument. In this paper, we introduce a keyphrase generation approach
        that makes use of a Generative Adversarial Networks (GANs) architec-
        ture. In our system, the Generator produces a sequence of keyphrases
        for an input document. The Discriminator, in turn, tries to distinguish
        between machine generated and human curated keyphrases. We propose
        a novel Discriminator architecture based on a BERT pretrained model
        fine-tuned for Sequence Classification. We train our proposed architec-
        ture using only a small subset of the standard available training dataset,
        amounting to less than 1% of the total, achieving a great level of data effi-
        ciency. The resulting model is evaluated on five public datasets, obtaining
        competitive and promising results with respect to four state-of-the-art
        generative models.

        Keywords: Keyphrase Generation · GAN · Reinforcement Learning.


1     Introduction

A keyphrase is a sequence of words that summarizes the content of a whole
document and expresses its core concepts. High quality keyphrases (KPs) can
facilitate the understanding of a document and they are used to provide and
retrieve information regarding the whole document on a high level. The world
wide growth of digital libraries has made the task of automatic KP prediction
both useful and necessary. There are many topics in which such ability could be
positively applied, such as text summarization [34], opinion mining [1], document
clustering [9], information retrieval [12] and text categorization [11].
    KPs can be either present or absent. Present KPs are exact substrings of the
document and can be extracted from its text, while absent KPs are sequences of
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). This volume is published
    and copyrighted by its editors. IRCDL 2021, February 18-19, 2021, Padua, Italy.
words which do not exist in the text, but can be abstracted from its contents.
They are also referred to as extractive and abstractive KPs, respectively. The
research community has made great efforts in the task of predicting KPs so
far, and the proposed solutions all rely on the following two main approaches:
1. extraction of sequences of words from the document (automatic extractive
methods); 2. generation of words and phrases related to the document (automatic
abstractive methods).


   Extractive approaches are only able to deal with present KPs [29, 18, 33]: the
greatest drawback in this case is that predicted KPs are related to the written
content of the source document, and not to its semantic meaning.


    The other main approach is the abstractive one, which has been introduced
to address the limitations of the extractive approaches and to better mimic the
way humans assign KPs to a given document. Despite it being a more recent
line of research, there are a good number of studies tackling this problem [19,
3, 4]. Abstractive methods are designed to produce sets of KPs which are not
strictly related to the words of source text. In principle this method could be
used to predict both absent and present KPs from a given source text. Gener-
ative models are best suited for abstractive approaches. A lot of examples can
be found in literature in which such kind of models are used, mainly leveraging
the Encode-Decoder framework [19, 3]. This architecture works by compressing
the contents of the input (e.g. the text document) into a hidden representation
using an Encoder module. The same representation is then decompressed using
the Decoder module, which returns the desired output (e.g. a sequence of KPs).
Recently, Generative Adversarial Networks (GANs) have been introduced in text
generation task [31], and in particular in keyphrase generation [24]. GANs are
based on an architecture that simultaneously trains two models: a generative
model that captures the data distribution, and a discriminative model that es-
timates the probability that a sample came from the (real) training data rather
than from the generator. The aforementioned approaches rely on the use of large
datasets to perform training, with a high consumption in terms of computational
resources.


    In this paper we introduce a new GAN architecture for keyphrase generation
with a focus on data efficiency: the aim is to only use a small subset of the
training data and still achieve reasonably good results. The main contribution of
our approach is the introduction of a novel Discriminator model based on BERT
that is able to distinguish between human and machine-generated keyphrases
leveraging on the powerful language model of BERT. A Reinforcement Learning
strategy has been used in our architecture to overcome the problems given by
the direct application of GAN in text generation. Our architecture achieved
competitive results using only 1% of the available training samples compared to
previous approaches.
2     Related Work
2.1   Automatic Keyphrase Extraction
Many different extractive approaches have been proposed in the literature, but
most of them consist of the following two steps. Firstly, a reasonable number
of KP candidates are extracted. The number of candidates usually exceeds the
number of correct candidates and it is selected using heuristic methods. Sec-
ondly, a ranking algorithm is used to give a score to each candidate based on
the source text. This whole process can be performed either in a supervised or
unsupervised fashion. For supervised methods, this task is treated as a binary
classification task [26, 21], and gives positive scores to the correct candidates in
the list. Unsupervised methods aim to find central nodes of the text graph [20],
or detect phrases from topical clusters [16].
    There are also other studies that differ from the previously described pipeline.
For example the authors in [33] applied an alignment model to learn the con-
version from the source text to target KPs. Also, recurrent neural network have
been used to build sequence labeling models to extract KPs from tweets [33].

2.2   Automatic Keyphrase Generation
Abstractive methods represent an important approach which is gaining a grow-
ing attention, as they allow to generate results which are more in the line of hu-
man expectations. Sequence-to-sequence (Seq2seq) models showed great success
in keyphrase generation and can generate human-like results. They are based
on the Encoder-Decoder framework, where the Encoder generates a semantic
representation of the source text and the Decoder is responsible for generat-
ing target texts based on such semantic representation. CopyRNN [19] was the
first specific Encoder-Decoder model to be used in the topic of keyphrase gen-
eration; it incorporates an attention mechanism. CorrRNN [3] was introduced
later and focused on capturing the correlation between KPs. TG-Net [5] exploits
the information given by the title to learn a better representation for the input
documents. Chen et al. [4] leveraged extractive models to improve the perfor-
mance of the (abstractive) keyphrase generation one. Ye et al. [30] proposed a
semi-supervised approach considering a limited training dataset to improve the
performance. All the previous approaches used the beam search algorithm to
generate large number of KPs from which to choose the k-best ones as final
predictions. CatSeq and CatSeqD [32] were the first two recurrent generative
models with the ability to predict the appropriate number of predicted KPs for
each document (instead of predicting a fixed number of KPs for each sample).
CatSeq proposed several novelties. Firstly, an orthogonal regularization module
to prevent the model from predicting the same word after generating the KP
separator token. Secondly, semantic coverage, a self-supervised technique with
the aim of enhancing the semantic content of the predictions.
    Reinforcement Learning has been used in a wide range of text generation
tasks [28, 22]. The generative models CatSeq, CatSeqD, CorrRNN and TG-
Net have been improved by applying a Reinforcement Learning (RL) approach
with adaptive reward to produce their improved versions catSeq-2RF1, catSeqD-
2RF1, catSeqCorr-2RF1 and catSeqTG-2RF1 [2]. In [24], the authors propose a
keyphrase generation approach using Generative Adversarial Networks (GAN)
conditioned on scientific documents. The architecture is composed of a CatSeq
model as Generator and a hierarchical attention-based model as Discriminator.
This was the first attempt to apply GAN in the Keyphrase generation task. This
approach was able to show improvements in the generation of abstractive KPs,
but no significant improvements in extractive KPs.


3     The proposed approach
The novelty of the approach presented in this paper is two-fold: first, we in-
troduce a BERT Discriminator as part of a Generative Adversarial Networks
(GAN) architecture for the keyphrase generation task; and second, we train our
system with only a small amount of the available data to pursue data efficiency.
    A general overview of the implemented system is given in Figure 1. It is
based on two main components: a state-of-the-art Generator that relies on the
Encoder-Decoder model and is able to generate a list of KPs for a given input
text, and the new BERT-based Discriminator that is trained to separate the true
KPs from the fake ones by giving them a score: the higher the score, the more
likely is the keyphrase list to be real.
    To overcome the well known problems of differentiability that arise when
employing GAN architectures for text generation, Reinforcement Learning (RL)
paradigm is adopted for training the system [31].

3.1   Formal Problem Definition
A source document x and the related list of M ground-truth keyphrases y =
(y1 , y2 , . . . , yM ) (True KPs) are represented by the pair (x, y). Both x and yi
are sequences of words:
                                    x = x1 , x2 , . . . , xL
                                yi = y1i , y2i , . . . , yK
                                                          i
                                                            i

where L and Ki are the number of words of x and of its i-th KP respectively.
     A keyphrase generation model will predict a set of keyphrases ŷ =
(ŷ1 , ŷ2 , . . . , ŷN ) (Fake KPs) with the aim to reproduce the true ones, so that
ŷ ≡ y.

3.2   Details of the System
Generator The task of the Generator G is to take a source document x and
generate a sequence of predicted KPs ŷ. For our system we chose catSeq [32]
that is based on the CopyRNN [19], a generative model optimized for KP gener-
ation. It introduces the ability to predict a sequence of KPs that is obtained by
concatenating together target KPs separated by a special token. In this way the
training schema moves from one-to-many to one-to-seq, and the system can be
trained to generate a variable number of KPs. It also employs the Copy Mech-
anism [8] to deal with long-tail words, which are the less frequent words in the
vocabulary of the input samples. They are removed to gain efficiency during
training, but being frequently very specific for the topic of the document, they
could be part of KPs. The Copy Mechanism employs a positional attention to
give a score to the words surroundings the ones which were removed, recovering
the best scoring ones. Implementation relies on a bidirectional Gated Recurrent
Unit (GRU) [6] for the encoder, and a forward GRU for the decoder.


                                                                                                                   r                    high = real KP
 (       ,           )                                      +1.5
                             true




                     …
                                                                                                           regression score             low = fake KP
     x           y                       Discriminator      -0.4
 (           ,     … )
                                                            -3.6                                    ⇑ Regression Layer ⇑
     x           y^
                             fake




                                                                                              Embedding[Document, KP 1, . . . , KP M]
 (           ,     … )                    Reinforcement   regression
                                            Learning        scores                                 ⇑ Output Aggregation ⇑
     x           y^
                                              Cycle
                                                                       H[CLS]          ...        H[SEP]       ...          ; ...   ;        ...         H[SEP]

         Generator                          rewards                                                 ⇑ BERT Modelling ⇑

                                                                       [CLS]           ...        [SEP]        ...          ; ...   ;        ...         [SEP]

         (       x       ,     y
                                  ) …                                                               ⇑ Input Preparation ⇑

                                                                                x1 x 2 . . . xu            y11 . . . yK
                                                                                                                      1
                                                                                                                              ...       y1M . . . yK
                                                                                                                                                   M

         (       x       ,     y
                                …
                                  )                                              Document
                                                                                                                        1

                                                                                                              Keyphrases (KP 1, . . . , KP M)
                                                                                                                                                     M




Fig. 1. Left. Schema of the proposed approach. Right. Detailed schema of the imple-
mented BERT Discriminator.




Discriminator The Discriminator D is basically a binary classifier whose aim
is to separate the true samples (x, y) from the fake ones (x, ŷ). It performs this
task by computing a regression score for each sample, giving a high value to the
reputedly real samples and a low value to the others.
    We introduce a novel BERT-based model for our Discriminator. BERT [7]
is part of the Transformer architecture, and since its introduction has achieved
state-of-the-art results in many Natural Language Processing tasks. Our im-
plementation is based on a BERT pretrained model, fine-tuned for Sequence
Classification. The input samples are processed in four steps as (see Figure 1):
    1. Input Preparation. Input pairs (x, y) are first lower-cased and tokenized,
then the tokens are concatenated together in the form
                                        [CLS][SEP]<;>...<;>[SEP]
    where  and  are the sequences of tokens of the input document and of
the i-th KP; [CLS] is the BERT special token marking the start of the sequence;
[SEP] is the BERT special token for marking the end of the sequence, also used
to separate the input document from the list of related KPs; and the semicolon
<;> is the KP separator.
    2. BERT Modelling. The prepared input sequence is passed through the 12
consecutive Encoder blocks of the pretrained BERT model. Pretrained weights
act as initialization, and are optimized during training. Since BERT processing
is positional, each input token is mapped to its corresponding output.
    3. Output Aggregation. Output tokens are averaged together to obtain an Em-
bedding of the whole input sequence. Note that generally when using a BERT-
based model, the output of the [CLS] token is usually considered as a sentence
embedding. Nevertheless, we averaged all the output tokens, as this aggregated
value has proven to be a better estimate of the semantic content of the input
(see also [7]).
    4. Regression Layer. The sentence embedding is passed through a dense layer
that evaluates a Regression score. This score is used to perform the classification
of the input samples: the higher the score is, the more probable that the sample is
a real one. The same score is also used as the Reward given by the Discriminator
to the Generator in the Reinforcement Learning schema.


Reinforcement Learning with Policy Gradient We follow the Reinforce-
ment Learning paradigm to train the system, as proposed in [31, 24].
    In detail, we consider the Generator G as an agent whose action a at step t is
to generate a word ŷt which is part of the set ŷ of predicted KPs for the document
x. Action a is performed following the policy π(ŷt |st , x, θ) that represents the
probability of sampling ŷt given the state st = (ŷ1 , . . . , ŷt−1 ), the sequence of
words generated up to step t − 1. The policy is differentiable with respect to
the parameters θ of G. As the agent G generates the predicted list of KPs, the
Discriminator D, that plays the role of the environment, evaluates them and
gives back a reward:

                             R(ŷ) = r(ŷT |sT ) = D(ŷ|x)                          (1)
   where r(ŷt |st ) is the expected accumulative reward at step t and T denotes
the steps needed to generate the whole prediction ŷ. The aim of the agent G is
to maximize the function J(θ) defined as the expected value of the final reward
under the distribution of probability given by the policy π:
                                          X
                      J(θ) = Eπ [R(ŷ)] =   r(ŷ|s) · π(ŷ|s, x, θ)          (2)
                                            ŷ

   The gradient of J(θ) is evaluated by means of the policy gradient theorem
and the REINFORCE algorithm [25]:
                            " T                                        #
                             X                                       
               ∇J(θ) = Eπ        r(ŷt |st ) · ∇log π(ŷt |st , x, θ)     (3)
                                  t=1

    Expectation Eπ in Equation 3 can be approximated by sampling with ŷ ∼
π(·|x, θ). Then, defining the loss function of G as L(θ) = −J(θ), an estimator of
its gradient is:
                            X                                            
                 ∇L(θ) ≈ −       r(ŷt |st ) − bt · ∇log π(ŷt |st , x, θ)    (4)
                              t
    A regularization term bt has been introduced as an expected accumulative
reward evaluated on a greedy decoded sequence of predictions, as suggested in
[23]. Its aim is two-fold: to lower the variance of the process, and to support the
predictions with a higher reward with respect to the greedy decoded sequence.


GAN Training Training is an iterative process in which G and D are trained
separately, see Algorithm 1.
    At the first step, an initial version G0 of the Generator is trained using
Maximum Likelihood Estimation (MLE) loss. Its predictions (x, ŷ) together with
ground truth (x, y) are used to train the first version of the Discriminator D0
with Mean Squared Error (MSE) loss. The regression scores evaluated by D0 are
then employed in the training of next Generator G1 , using the RL optimization
as defined in Section 3.2. All the subsequent generators Gi are trained in the same
way by means of the rewards given by Di−1 . Discriminators are always trained
with MSE loss. g-steps and d-steps refer to the updating iterations during G and
D training.
    Algorithm 1: GAN training
     Data: Samples (x, y)
     Pre-train G0 with MLE loss; generate ŷ0 = G0 (x);
     Pre-train D0 with MSE loss; evaluate D0 (y) and D0 (ŷ0 );
     while Di (ŷ) ≪ Di (y) do
        i=i+1;
        for g-steps do
            Generate predictions: ŷ = Gi (x);
            Evaluate rewards: R = Di−1 (ŷ);
            Update Gi with Policy Gradient RL maximizing R;
        end
        for d-steps do
            Generate predictions: ŷ = Gi (x);
            Evaluate Di (y) and Di (ŷ);
            Update Di with MSE loss;
        end
     end
     G is evaluated on test datasets


4     Experiments and Results

4.1     Datasets

Five well known datasets largely used in literature have been considered in this
work:
    KP20k [19] 567,830 titles and abstracts from papers about computer science;
of them, 20,000 samples are usually employed for testing, another 20,000 for
validation, and the remaining 527,830 samples for training. This is the only
dataset used for training duties. In our data-efficient training approach we only
use 2,000 out of the >500,000 training samples.
    Inspec [10] 2,000 abstracts from disciplines: Computers and Control, and
Information Technology. Only 500 samples are used for testing.
    Krapivin [15] 500 articles about computer science. Since no hint is given by
the authors on how to split testing data, the first 400 samples in alphabetical
order are taken for testing.
    NUS [21] 211 papers selected from scientific publications; used for testing.
    Semeval2010 [13] 288 articles from the ACL Computer Library, of whose
100 are used for testing.
    Some statistics about test samples are given in Table 1. Procedures that are
standard protocol in KP generation are applied to data (see for example [2]):
all duplicate documents are removed from training set; for each document the
sequence of KPs is given in order of appearance; digits are replaced with the
 token; out of vocabulary words are replaced with the  token.
    The vocabulary of the generator VG consists of the 50,000 most frequent
words in the training dataset. The vocabulary of the discriminator VD is the
one of the pretrained BERT base uncased (english version): it contains 30,522
words and chunks of words (called wordpieces) eventually used to compose all the
possible flections (e.g.: ’hexahedral’ is tokenized as ’he’, ’##xa’, ’##hedral’).


                Table 1. Statistics on test samples for the five datasets.

                KP20k          Inspec     Krapivin        NUS      Semeval2010
               #       %      #     %      #     %      #     %      #    %
 Present KPs 66,267 62.91 3,602 73.59 1,297 55.57 1,191 52.26 612         42.41
 Absent KPs 39,076 37.09 1,293 26.41 1,037 44.43 1,088 47.74 831          57.59
 Total KPs   105,343 100.00 4,895 100.00 2,334 100.00 2,279 100.00 1,443 100.00
 Test samples       20,000          500           400           211          100




4.2    Details of Implementation

Optimization of generator G is performed with Adam [14]. G0 is trained with
MLE loss and a batch size of 12; following Gi are trained with RL optimization
and batch size of 32. Optimization of the discriminator D is performed with
AdamW optimizer [17]. It is trained with MSE loss and a batch size of 3. For the
Discriminator, we refer to the BERT implementation provided by huggingface
[27]1 . The model is a bert-base-uncased with 12 layers, 12 attention heads, and
hidden size of 768. Input sequences are trimmed to 384 tokens. The model is
fine-tuned for Sequence Classification with one label (regression). Training and
testing run on a PC with a Titan RTX GPU, 24GiB.
1
    https://github.com/huggingface/transformers
4.3   Comparative Results

The system has been trained with 2,000 samples randomly extracted from
KP20k training dataset, and then evaluated using the five baseline datasets
described in Section 4.1. A comparison has been carried out with four state-of-
the-art approaches, namely catSeqD [32]; catSeqCorr-2RF1 and catSeqTG-2RF1
[2], and GAN [24]. Results are shown in terms of F 1 score: F 1@5 is evaluated
over the top 5 high scoring KPs, while F 1@M takes into account all the predic-
tions. Results are shown in Table 2.
    Our approach achieves competitive results with respect of the above men-
tioned models. In depth, it is by far the best on Inspec, both in F 1@M and
F 1@5 scores. Also it performs very well on Semeval2010, where we match the
best F 1@M score and are close to the best F 1@5. Note that Semeval2010 is
the smallest of the five testing datasets and contains the smallest amount of
target KPs. This makes it a very difficult test set to perform well on.
    We point out that our approach shows good performance in the F 1@5 score.
Since the F 1@5 score is evaluated taking the best 5 predicted KPs, we claim
that our approach is able to generate good quality KPs.
    A final consideration has to be made about Equation 3: the expectation
of the policy function is evaluated using only complete sequences ŷ, and this
determines relatively large oscillations in the ∇J, inducing instability in the
training process [31]. Our system has shown a great efficiency in dealing with
this problem, even in a training scenario characterized by the scarce availability
of resources in terms of data. In fact, we observed a trend to a quick convergence
of the training, obtaining the best results at second iteration of the Generator
(G2 ). We consider that this quick convergence was achieved due to the strength
of the language model embedded in the architecture.


               Table 2. Results of present keyphrases for five datasets.

                      KP20k     Inspec   Krapivin    NUS    Semeval2010
      Model
                    F1@M F1@5 F1@M F1@5 F1@M F1@5 F1@M F1@5 F1@M F1@5
catSeqD [32]        -       0.348 -       0.276   -       0.325 -       0.374 -       0.327
catSeqCorr-2RF1 [2] 0.382   0.308 0.291   0.240   0.369   0.286 0.414   0.349 0.322   0.278
catSeqTG-2RF1 [2] 0.386     0.321 0.301   0.253   0.369   0.300 0.433   0.375 0.329   0.287
GAN [24]            0.381   0.300 0.297   0.248   0.370   0.286 0.430   0.368 -       -
Our approach        0.318   0.309 0.383 0.356 0.332       0.317 0.388   0.366 0.329 0.319




Ranking Analysis In order to better analyze our method of data-efficient
training, we performed a comparison in terms of ranking metrics between our
system (trained on 2,000 samples) and the original catSeq model (trained on the
whole training set). Two evaluation measures have been used: the Mean Average
Precision MAP and the normalized Discounted Cumulative Gain nDCG.
Table 3. Ranking measures for present KPs. Comparison of our approach (trained on
2,000 samples) and catSeq (trained on the whole dataset), on the five test datasets.

                KP20k       Inspec     Krapivin      NUS     Semeval2010
    Model
             MAP nDCG MAP nDCG MAP nDCG MAP nDCG MAP nDCG
catSeq       0.305 0.585 0.164 0.570 0.303 0.576 0.285 0.740 0.189 0.663
Our approach 0.300 0.560 0.268 0.720 0.308 0.576 0.290 0.736 0.228 0.683



    MAP is defined as the mean of the average precision P scores evaluated for
each set of predicted KPs. It is a measure of the proportion of relevant KPs
among the predicted ones. nDCG is a measure of the usefulness, or gain, of a
document based on its position, or rank, in the result list. It is widely used in
information retrieval, specifically in web search and related tasks. For both the
metrics the higher the scores, the better the accordance between the relevance
of the predicted KPs with respect to the real ones.
    Results are shown in Table 3. For four out of five datasets and for both
measures, our approach achieves better or nearly the same results than the
baseline catSeq, clearly showing the strength of our method.

5    Conclusion
In this paper we presented a system for Keyphrase Generation using a GAN
architecture with Reinforcement Learning. Thanks to the characteristics of our
approach, we have been able to train the system in a data-efficient way using
only a small fraction of the available data. We tested it on five baseline datasets,
achieving results that are competitive with some state-of-the-art generative mod-
els. To the best of our knowledge, this is the first attempt to train such a complex
architecture for the demanding task of Keyphrase Generation in an scenario in
which only a small amount of data is available.


References
 1. Berend, G.: Opinion Expression Mining by Exploiting Keyphrase Extraction. In:
    IJCNLP (2011)
 2. Chan, H.P., Chen, W., Wang, L., King, I.: Neural Keyphrase Generation via Re-
    inforcement Learning with Adaptive Rewards. In: ACL (2019)
 3. Chen, J., Zhang, X., Wu, Y., Yan, Z., Li, Z.: Keyphrase Generation with Correla-
    tion Constraints. In: EMNLP (2018)
 4. Chen, W., Chan, H.P., Li, P., Bing, L., King, I.: An Integrated Approach for
    Keyphrase Generation via Exploring the Power of Retrieval and Extraction. In:
    NAACL-HLT (2019)
 5. Chen, W., Gao, Y., Zhang, J., King, I., Lyu, M.R.: Title-Guided Encoding for
    Keyphrase Generation. In: AAAI (2019)
 6. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
    H., Bengio, Y.: Learning Phrase Representations using RNN Encoder–Decoder for
    Statistical Machine Translation. In: EMNLP (2014)
 7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
    rectional Transformers for Language Understanding. In: NAACL-HLT (2018)
 8. Gu, J., Lu, Z., Li, H., Li, V.O.K.: Incorporating Copying Mechanism in Sequence-
    to-Sequence Learning. In: ACL (2016)
 9. Hammouda, K.M., Matute, D.N., Kamel, M.S.: CorePhrase: Keyphrase Extraction
    for Document Clustering. In: MLDM (2005)
10. Hulth, A.: Improved Automatic Keyword Extraction Given More Linguistic Knowl-
    edge. In: EMNLP (2003)
11. Hulth, A., Megyesi, B.: A Study on Automatically Extracted Keywords in Text
    Categorization. In: ACL (2006)
12. Jones, S., Staveley, M.S.: Phrasier: A System for Interactive Document Retrieval
    Using Keyphrases. In: SIGIR (1999)
13. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 Task 5 : Auto-
    matic Keyphrase Extraction from Scientific Articles. In: Workshop on Semantic
    Evaluation (2010)
14. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: ICLR
    (2015)
15. Krapivin, M., Autaeu, A., Marchese, M.: Large Dataset for Keyphrases Extraction.
    Technical Report DISI-09-055, University of Trento (2009)
16. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to Find Exemplar Terms for
    Keyphrase Extraction. In: EMNLP (2009)
17. Loshchilov, I., Hutter, F.: Fixing Weight Decay Regularization in Adam. ICLR
    (2017)
18. Luan, Y., Ostendorf, M., Hajishirzi, H.: Scientific Information Extraction with
    Semi-supervised Neural Tagging. In: EMNLP (2017)
19. Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep Keyphrase
    Generation. In: ACL (2017)
20. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Text. In: EMNLP (2004)
21. Nguyen, T.D., Kan, M.: Keyphrase Extraction in Scientific Publications. In:
    ICADL (2007)
22. Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence Level Training with
    Recurrent Neural Networks. In: ICLR (2016)
23. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-Critical Sequence
    Training for Image Captioning. In: CVPR (2017)
24. Swaminathan, A., Gupta, R.K., Zhang, H., Mahata, D., Gosangi, R., Shah, R.R.:
    Keyphrase Generation for Scientific Articles using GANs. In: AAAI (2019)
25. Williams, R.J.: Simple Statistical Gradient-Following Algorithms for Connectionist
    Reinforcement Learning. Machine Learning (1992)
26. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA:
    Practical Automatic Keyphrase Extraction. In: ACM (1999)
27. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Brew, J.: HuggingFace’s Transformers: State-
    of-the-art Natural Language Processing. ArXiv:abs/1910.03771 (2019)
28. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
    M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,
    Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G.,
    Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O.,
    Corrado, G., Hughes, M., Dean, J.: Google’s Neural Machine Translation System:
    Bridging the Gap between Human and Machine Translation. CoRR (2016)
29. Ye, H., Wang, L.: Semi-Supervised Learning for Neural Keyphrase Generation. In:
    EMNLP (2018)
30. Ye, H., Wang, L.: Semi-Supervised Learning for Neural Keyphrase Generation.
    arXiv preprint arXiv:1808.06773 (2018)
31. Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: Sequence Generative Adversarial
    Nets with Policy Gradient. In: AAAI (2016)
32. Yuan, X., Wang, T., Meng, R., Thaker, K., He, D., Trischler, A.: Generating Di-
    verse Numbers of Diverse Keyphrases. ArXiv:abs/1810.05241 (2018)
33. Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase Extraction Using Deep
    Recurrent Neural Networks on Twitter. In: EMNLP (2016)
34. Zhang, Y., Zincir-Heywood, A.N., Milios, E.E.: World Wide Web site summariza-
    tion. Web Intelligence and Agent Systems (2004)