<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Changing the Narrative Perspective: From Ranking to Prompt-Based Generation of Entity Mentions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mike Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Razvan Bunescu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of North Carolina at Charlotte</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Electrical Engineering and Computer Science, Ohio University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Changing the point of view of a character in a story alters the reading experience by reshaping the reader's involvement and identification with the character's thoughts and feelings. An efective NLP solution to the task of changing the narrative perspective requires the capability to select mention strings that refer to the character in a natural and non-ambiguous manner. In this paper, we introduce and evaluate three mention selection architectures: LSTMs with attention over frozen BERT embeddings, ifne-tuned BERT with coreference-modulated self-attention, and prompt-based tuning over either frozen or fine-tuned T5. Experimental evaluations show that the prompt-tuning approach over frozen T5 obtains the best performance, also outperforming the previous state-of-the-art on this task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;narrative perspective</kwd>
        <kwd>mention selection</kwd>
        <kwd>ranking</kwd>
        <kwd>prompt-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The narrative point of view (PoV), or perspective, is the position from which the events in a
story are observed and communicated. There are 3 types of narrative perspective: first , second,
and third person. In fiction, characters are most commonly developed using the first or third
person perspective. When employing the first person mode of storytelling, the narrator is
usually a character inside the story, recounting events from their own point of view. Conversely,
in the third person point of view, the narrator places themselves outside the events in the
story. The second person point of view is more common in poetry, how-to guides, technical
writing, and self-help texts. In this type of narrative perspective, the reader becomes a character
who is addressed by the writer using second person pronouns. Motivated by potential
styletransfer applications in fiction writing, in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we introduced the task of changing the narrative
perspective and described an end-to-end NLP pipeline for shifting the PoV from deictic (1st or
2nd person) to anaphoric (3rd person), as shown in the example below taken from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
For a second, [I] → [ ] had the idea of getting up, slapping the [Self-Taught Man]
on the shoulder and starting a conversation with [him] . But just at that moment [he] → [
] caught [my] → [ ] look.
      </p>
      <p>
        Table 1, reproduced from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], shows potential applications of the PoV change task in the
automatic generation of biographical text, self-diagnosis material, or related work paragraphs,
starting from auto-biographical text, educational material, or abstract paragraphs, respectively.
While changing the narrative perspective bears similarities with other NLP tasks, such as
paraphrasing, referential expression generation, and style transfer, it has unique aspects that
require customized, if not entirely novel solutions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Following the end-to-end pipeline described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a coreference resolution system is used
ifrst to identify the entity mentions for all characters in the story. The example above contains
two discourse entities: the narrator Jean-Paul, using a 1st person PoV, and the Self-Taught Man,
using a 3rd person PoV. Given a focus entity  = Jean-Paul that is described in the 1st person
PoV, the task is to change its mention strings to reflect a 3rd person PoV. This is done by a
mention selection model that is tasked with choosing among strings in a set () that contains
names, nouns, as well as suitable 3rd person pronouns. This set of candidate strings is created
automatically by a separate component of the pipeline. Another pipeline module is tasked with
changing verb conjugations from 1st to 3rd person whenever the focus mention is the subject of
a verb at present tense singular. The mention selection task is non-trivial, as there may already
be other confounding entities, e.g. the Self-Taught Man, that are mentioned using the same third
person pronouns, as such their mention strings might need to be changed too, in order to avoid
referential ambiguity. At the same time, repeated uses of names should be avoided, to maintain
naturalness. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we describe a ranking approach to mention selection that processes the text
auto-regressively using LSTMs on top of BERT embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this paper, we show that adding self-attention to the LSTM architecture (Section 2.1)
improves both mention selection and end-to-end performance. Furthermore, we introduce two
new architectures: a coreference-augmented self-attention model for BERT (Section 2.2) that
eliminates the LSTM layer, and a prompt-tuning approach (Section 3) for the T5 text-to-text
Transformer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Overall, prompt-tuning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] over T5 with frozen parameters performs the best,
with further gains observed when fine-tuned on the PoV dataset in a 2-fold evaluation.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Mention Selection as Ranking</title>
      <p>
        Given a set of candidate strings () that can be used for referring to an entity  , mention
selection is about determining the most appropriate string to use in a given textual context
 . For a confounding entity, this set is determined from all unique strings that are used in its
coreference chain, e.g. ( )= {the Self-Taught Man, the man, he, his, ...} in the example above.
For the focus entity, which originally is in a 1st or 2nd person PoV, we use 3rd person pronouns
that agree in number and gender with the given name, as well as noun phrases extracted from
the document using the methods described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A scoring function  ( |) is trained to
capture how appropriate a string  ∈ () is to be used as a mention of an entity  in context C.
Given  ,  ∈ () , we use ⟨ ≻  |⟩ to denote that  is more appropriate than  in the context C.
Correspondingly, the ranking system is trained to compute a  ( |) &gt;  ( |) +  , where
 is a margin hyper-parameter, which results in the margin-based ranking loss shown below:
∑ {0,  −  (
 
|) +  (
|)}
At training time, we use the observed mention string  ∈̂ ()
for all  ∈ (),  ≠  ̂ , as training examples.
      </p>
      <sec id="sec-2-1">
        <title>2.1. LSTMs over Tokens and Mentions</title>
        <p>
          to create ranking pairs ⟨ ≻̂ |⟩ ,
The best performing approach in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is composed of 4 LSTM models. For any given entity
mention, the text is split into a left context  that ends with the current mention and a
right context  , i.e.  = [ ,  ]. Mention strings in  corresponding to focus
and confounding entities are replaced with a special ⟨⟩ token, to ensure models do not use
information unavailable at test time. The left context is processed sequentially, at token-level by
  and at mention-level by   , producing the final states ℎ and ℎ , respectively.
A similar processing is done for the right context, to produce final states ℎ and ℎ . All
LSTMs are run on top of contextual embeddings produced by a frozen BERT [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The final states
are then concatenated and used as input to a fully connected network with one hidden layer
and a linear output node that computes the  ( |) .
        </p>
        <p>
          Given the significant gains in performance brought about by attention when used with LSTMs
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] or within Transformer [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], we added an attention mechanism to each of the four LSTM
models. Using the concatenation formulation of Luong et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we compute a context vector
 for the last token of the current mention, where the attention weights span all the tokens to
the left or to the right of the current mention, depending on whether the left or right LSTM is
used. For the token-level LSTMs, the two context vectors are concatenated and used as input
to a fully connected network with one hidden layer to compute a  ( ,  ). Similarly,
a  ( ,  )is computed using the two context vectors from the mention-level LSTMs.
Finally, the two attention-based scores are added to the original LSTM-based ranking score
described above in order to compute the final ranking  ( |) . The overall architecture
containing the original LSTMs and the new attention mechanisms is illustrated in Figure 1.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Coreference-Modulated Self-Attention</title>
        <p>
          The LSTM-based architecture uses as input contextual embeddings computed by BERT. Inspired
by the concept of pseudo self-attention [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we developed a new approach that does away with
the LSTM layers, and instead adapts the BERT model itself to compute a vector representation
of the candidate mention string  in context. To incorporate coreference information in the
underlying Transformer model, we first introduce a square matrix  ∈ ℤ that represents
the coreference information in the N input tokens, where  [, ] = 1 if and only if the tokens at
positions  and  belong to mentions that are coreferent, otherwise  [, ] = 0 . Correspondingly,
the diagonal vector d of the coreference matrix  has [] = 1 if the token  belongs to a person
entity mention, otherwise [] = 0 . If we update each row in  as  [, ∶] = 2 × d −  [, ∶] , then
 [, ] = 0 if the tokens at positions  and  do not belong to any entity mentions;  [, ] = 1 if
the two tokens belong to mentions that are coreferent; and  [, ] = 2 if token  belongs to an
entity mention that does not corefer with the entity mention to which token  belongs, if any.
Thus, for each token position  , the corresponding row  [, ∶] will contain one of the numbers 0,
1, and 2, distinguishing among the 3 situations. We map each of the 3 numbers to their own
trainable embedding with size  , and then transform the row vector  [, ∶] into a coreference
embedding matrix [] ∈ ℝ  by replacing the numbers 0, 1, and 2 with their corresponding
embeddings. When the matrices [] are stacked over all token positions  in the input, they
create a 3-dimensional coreference embedding tensor  ∈ ℝ  . Let  ∈ ℝ 
be the 3-dimensional tensor of relative positional embeddings [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Then, for each attention
layer in BERT, we define an additional attention mechanism where the unnormalized attention
weights  are computed using the input embedding x at that layer, the coreference embedding
[, ]
and the relative positional embedding  [, ]
between positions  and  . The corresponding
vectorization is then done using an Einstein summation operator ⋆ as shown below, where the
input embedding matrix  is broadcast over the first dimension of  :
 =
((x + [, ])
        </p>
        <p>)( [, ]
√

)
;
softmax (
(( + )
) ⋆ (</p>
        <p>)
√

) ( 
)</p>
        <p>
          At each layer, the coreference-based embeddings computed above are added to the
original self-attention embeddings computed by the frozen BERT model, resulting in a layer of
coreference-modulated embeddings. The new set of parameters 
,  , and  , which mirror
the original BERT parameters 
, 
, and 
, are instantiated and trained for each layer in
the Transformer. The coreference-modulated embedding computed in the final layer for the
last token of the current mention  is concatenated with the binary features 
introduced in
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and used as input for a fully connected network with one hidden layer, followed by a linear
node that outputs the final ranking  (
        </p>
        <p>|) .</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Mention Selection as Prompt-Based Generation</title>
      <p>
        The coreference-modulated self attention approach, while keeping the original BERT parameters
frozen, still requires training the new sets of attention parameter matrices 
, 
, 
for each
Transformer block. For the particular BERT model used in the experiments, this means over 8.5
new million new parameters will need to be trained from scratch, which is time consuming
and liable to lead to overfitting. In this section we describe an alternative, much less
parameterintensive method of utilizing coreference information, based on P*-tuning. This general class of
techniques subsumes methods such as soft prompt-tuning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], p-tuning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and prefix-tuning
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in which the encoder of the language model (LM) is run on an input composed of two parts:
(1) the usual textual input, possibly augmented with additional tokens that convey information
about the task; and (2) a set of token embeddings that are trained from scratch. These continuous
embeddings, also called soft tokens , do not have to correspond to actual language tokens and
are meant to help the LM adapt its ouput for the target task. Of the three P*-tuning approaches
listed above we chose to use the soft prompt-tuning approach [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which is relatively simpler.
For the LM, we use the T5 text-to-text Transformer of Rafel et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in the span corruption +
sentinel mode. Figure 2 illustrates the prompt-based approach where a frozen T5 is run on an
input that contains two prompts:
1. A soft prompt  composed of 20 continuous embeddings that are initialized with random
token embeddings from the T5 vocabulary.
      </p>
      <p>mention, augmented with the strings in () .</p>
      <p>2. A text prompt  composed of the left and right context words around the current entity
In order for the T5 model to know what the candidate mention strings are, the strings in () are
included in the text prompt at the end, separated by ⟨ ⟩
tags. The current mention is indicated
by the sentinel token ⟨extra_id ⟩, which during T5’s pre-training was used to indicate the span
of text that needs to be generated by the decoder. Furthermore, focus mentions are delimited
by ⟨ ⟩ and ⟨/ ⟩ tags, whereas confounding mentions are enclosed between ⟨⟩ and ⟨/⟩ tags.
The ⟨⟩ tag is used to specify unknown strings for future mentions, i.e. mentions of the focus
or confounding entities that appear to the right of the current mention. An example prompt is
shown in Figure 2. The decoder is then tasked with generating a sequence  containing the
correct mention string his, formatted using the T5 sentinels. While T5’s parameter are kept
ifxed, we expect the tuning of the soft-prompt to enable it to learn to generate the correct
mention string by copying it from the input text prompt sequence.</p>
      <p>The resulting prompt-based model is trained under two scenarios: (1) prompt-tuning with
pre-trained T5 and (2) prompt-tuning with fine-tuned T5 . In the first scenario, the T5 parameters
are frozen and the only parameters that are updated are the soft-prompt embeddings in  and
the embeddings for the special tags introduced above. In the second scenario, the T5 parameters
are allowed to change too during backpropagation of the loss.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results and Error Analysis</title>
      <p>
        All mention selection models are trained on the training portion of the CoNLL-2012 dataset
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. For each document in the corpus, we use one 3rd person coreference chain at a time
as a focus entity, which is assumed to have been transformed from 1st to 3rd person PoV,
whereas the remaining chains that agree in number and gender are used as confounding entities.
Early stopping and hyper-parameter tuning are done on the development portion of CoNLL.
The trained models are then evaluated within-distribution for mention selection on the test
portion of CoNLL, and out-of-distribution on the PoV dataset introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that consists of 21
documents covering a wide array of types of narratives, where 300 entities are mentioned 8,682
times in total. To evaluate end-to-end performance on changing the narrative perspective, we
plug each mention selection model into the PoV change pipeline that also performs coreference
resolution, extraction of candidate mention strings () , syntactic parsing, and verb conjugation
change. Additionally, for each prompt-tuning model we also evaluate its within-distribution
performance on the PoV dataset. This is done in a 2-fold evaluation scenario where the PoV
dataset is first partitioned at random into 2 folds: fold containing 11 documents and fold
containing the remaining 10 documents. In the first evaluation step, the prompt-based models
that were trained on CoNLL are further fine-tuned on fold and tested on fold ; in the second
step, the roles of the two folds are swapped, and the prompt-based models are fine-tuned on
fold and tested on fold . The test results are then pooled over the 2 folds in order to compute
the overall within-distribution performance on the PoV dataset.
      </p>
      <p>The overall results are shown in Table 2, using accuracy for mention selection and precision
(P), recall (R), and F1-measure (F1) for the end-to-end performance on the PoV dataset. The
results show that adding attention to the original LSTM model improves performance across
all evaluations. The coreference-augmented self-attention model, while matching LSTMs
with attention on CoNLL, is under-performing when tested on the PoV dataset, which could
be explained by overfitting to CoNLL. The best performance in terms of out-of-distribution
generalization to the PoV dataset is obtained by prompt-tuning using the frozen T5 model, with
an F1 measure of 75.7%. When fine-tuned on the PoV dataset in the 2-fold evaluation setting, F1
measure is further increased to 77.3%. Compared to the other approaches, prompt-tuning is
overall simpler, is faster at training due to the much smaller number of trainable parameters,
and does not use engineered features, i.e. the binary features  used in the other models.</p>
      <p>
        It is important to note here that the results are likely to be much better when the system
outputs are evaluated by human readers, as there may be multiple good solutions for choosing
mention strings that achieve felicitous, non-ambiguous reference while also maintaining the
naturalness of a text. This was verified in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for the LSTM over BERT model, where Amazon
Mechanical Turk workers were observed to give referential and naturalness scores to the system
output that were not very far from the scores given to manual annotations. Indeed, upon doing
error analysis on the output of the prompt-tuning model, we found many instances like the ones
below, where the mention string chosen by the model (shown in light red ) was comparable in
naturalness and referential clarity with the annotated string (shown in light gray ):
1. It was eight in the morning and [Katz] looked very happy. [He] was always happy when
[he] was drunk, and [he] was always drunk. Two weeks after that, [I] → [ ] later
heard, police found [him] → [ ] in an upended car in a field outside the little town
of Mingo, hanging upside down by his seatbelt.
2. [I] → [ ] found [myself] → [ ] , six days later, standing at [our] → [ ]
local airport watching a tin commuter plane containing [Katz] touch down ... For the past
three years [Katz] had devoted [himself] to rectitude and – [I] → [ ] instantly saw
now as [he] stooped out the door of the plane – growing a stomach. [Katz] → [ ]
was arrestingly larger than when [I] → [ ] had last seen [him] .
3. [Both boys] had closed [their] dictionaries. [The brown haired one]
] was not talking, [his] face, stamped with deference and interest, ...
4. As it had for many of the guides [I] → [ ] had met, the mystical experience [Fritz]
had on psychedelics launched [him] on a decades long spiritual quest that eventually “blew
my linear, empirical mind”, opening [him] up to the possibility of past lives, telepathy,
precognition, and “synchronicities” that defy our conceptions of space and time. [He] →
[ ] spent time on an ashram in India, where [he] witnessed specific scenes that
had been prefigured in [his] psychedelic journeys.
      </p>
      <p>In the first example, the system generates proper names instead of pronouns, which improves
referential clarity, with perhaps a slight decrease in naturalness. The second example illustrates
the opposite behavior, where the system generates a pronoun instead of the manually annotated
proper name, which actually makes the text sound more natural while maintaining referential
clarity. In the third example, the noun phrase chosen by the system is equally good in terms of
referential clarity, albeit its syntactic head ”boy” may sound repetitive and hence less natural
than the annotated ”one”. In the last example, instead of a pronoun, the system selects a nominal
mentions string, which appears to be as appropriate in the context as the pronoun.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We introduced new mention selection models targeted at the task of changing the narrative
perspective from deictic (1st or 2nd person) to anaphoric (3rd person). Adding an attention
mechanism to a previous state-of-the-art LSTM model that is trained on top of frozen BERT
embeddings was shown to improve its performance. We also introduced a new BERT model
with coreference-modulated self-attention, and a soft prompt-tuning approach for the T5
text-to-text Transformer, with the later shown to significantly improve both the within- and
outof-distribution generalization performance. Code, hyper-parameter settings, and pre-trained
models are made publicly available1. More general models that can also modify the text between
entity mentions are planned for future work.</p>
      <p>1https://github.com/chenmike1986/change_pov</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bunescu</surname>
          </string-name>
          ,
          <article-title>Changing the narrative perspective: From deictic to anaphoric point of view</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <article-title>102559</article-title>
          . doi:https://doi.org/ 10.1016/j.ipm.
          <year>2021</year>
          .
          <volume>102559</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Sartre</surname>
          </string-name>
          , Nausea, New Directions,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of NAACL</source>
          <year>2019</year>
          , ACL, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurosawa</surname>
          </string-name>
          , Akira Kurosawa:
          <article-title>Something Like an Autobiography</article-title>
          , Vintage,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Mondimore</surname>
          </string-name>
          , Bipolar Disorder:
          <article-title>A Guide for Patients and Families</article-title>
          , John Hopkins University Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Gast</surname>
          </string-name>
          , D.-C. Tomozei, J.
          <string-name>
            <surname>-Y. Le Boudec</surname>
          </string-name>
          ,
          <article-title>Optimal generation and storage scheduling in the presence of renewable forecast uncertainties (</article-title>
          <year>2013</year>
          )
          <article-title>11</article-title>
          . URL: http://infoscience.epfl.ch/ record/183046.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <article-title>The power of scale for parameter-eficient prompt tuning</article-title>
          ,
          <source>in: Proceedings of EMNLP</source>
          <year>2021</year>
          , ACL, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>3045</fpage>
          -
          <lpage>3059</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Bengio,</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          , in: ICLR,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Efective approaches to attention-based neural machine translation</article-title>
          ,
          <source>in: Proceedings of EMNLP</source>
          <year>2015</year>
          , ACL, Lisbon, Portugal,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z. M.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Melas-Kyriazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Encoder-agnostic adaptation for conditional language generation</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1908</year>
          .06938.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <article-title>Self-attention with relative position representations</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Prefix-tuning: Optimizing continuous prompts for generation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , GPT understands, too,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>10385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moschitti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Uryupina</surname>
          </string-name>
          , Y. Zhang, CoNLL
          <article-title>-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes</article-title>
          ,
          <source>in: Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL</source>
          <year>2012</year>
          ), Jeju, Korea,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>