<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Masked prompt learning for formal analogies beyond words</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liyan Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yves Lepage</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waseda University</institution>
          ,
          <addr-line>2-7 Hibikino, Kitakyushu, 808-0135</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Prompt learning, a recent thread in few-shot learning for pre-trained language models (PLMs), has been explored for completing word analogies in the extractive way. In this paper, we reformulate the analogy task as masked analogy completion task with the use of prompting to derive a generative model for analogies beyond words. We introduce a simple prompt-based fine-tuning paradigm for language modeling on answered prompts of analogies in the sequence-to-sequence framework. To convert discrete terms of analogies into linear sequences, we present a symbolic prompt template. The sequence-tosequence model is fine-tuned to fill in the missing span of masked prompts deduced from diferent masking schemes on phrase analogies extracted from a small corpus. We analyze the out-of-distribution performance on sentence analogies which are unseen cases. Our experiments demonstrate that promptbased fine-tuning with the objective of language modeling enables models to achieve significantly better performance on in-distribution cases than PLMs. Masked prompt learning with one-term masking exhibits the best out-of-distribution generalization on sentence analogies, with a diference of only 3 characters from references.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Prompt learning</kwd>
        <kwd>masked analogy completion</kwd>
        <kwd>analogies beyond words</kwd>
        <kwd>fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Analogy, a cognition mechanism that relies on relational similarity, is growing in prominence
in the field of artificial intelligence [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In general, it encapsulates a quadruplet relationship
between terms of the same type. For example, the famous analogy between words king : queen ::
man : woman, implies that king is to queen as man is to woman in terms of gender transition.
Strikingly, analogical quadruples form geometric parallelograms in pre-trained embedding
spaces learnt by the skip-gram model with negative sampling [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The parallelogram generic has
drawn attention in research about the linear algebraic properties of vector analogies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The
simple arithmetic approach, however, was questioned as being applicable only for completing
analogies with respect to certain clearly defined relations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Recent eforts [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] have examined
the potential of machine learning techniques to learn analogies in word embedding spaces.
      </p>
      <p>
        By formulating tasks in the manner of analogical reasoning, sentence analogy has
demonstrated its versatility in various tasks in the area of natural language processing (NLP), such
as machine translation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], text summarization [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and question answering [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In contrast
to word-level analogies, analogies that go beyond words may encompass manifold challenges
that stem from the inherent complexity of language. A recent work [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] has demonstrated that
sentence embedding models struggle to capture analogical regularities in terms of geometric
parallelism. The vector ofsets may not be kept within sentence embeddings. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the
postulate of exchange of the means has been debated for classifying positive and negative
analogies between sentences.
      </p>
      <p>
        Some work on completing sentence analogies has focused on extractive approaches, i.e.,
identifying the optimum solution from candidates (i.e., from a finite answer pool) [
        <xref ref-type="bibr" rid="ref8 ref9">9, 8</xref>
        ]. However,
the extractive mechanism is not geared to text generation. It is relatively expensive to define
candidate sets for analogies that capture complex relations between long sequences. In addition,
hand-crafted candidates may leave weaknesses in linguistic creativity. Therefore, it highlights
the necessity for a generative model that can automatically produce the missing term in analogy
questions, which will contribute to analogy completion in NLP scenarios.
      </p>
      <p>
        Prompt learning [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a fruitful learning paradigm in recent work on the adaptive
performance of PLMs in the few-shot setting. The downstream tasks are reformulated as Cloze-style
problems by converting inputs into natural language prompts with task-specific descriptions,
which allows PLMs to predict target outputs conditioned on prompts [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Following [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a
iflled prompt refers to a prompt where the mask slot is filled with any answer. An answered
prompt refers to a prompt filled with the correct answer. Based on prompt learning, recent works
have investigated the few-shot [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and zero-shot [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] performance of PLMs in identifying
word analogies from candidates by language model (LM) scoring on filled prompts.
      </p>
      <p>In this paper, we present a preliminary study of generative completion of analogies beyond
words by using LMs in conjunction with prompting. To learn analogical regularities, we
introduce a novel prompt-based fine-tuning method to extract sequential features of answered
prompts with the symbolic template by masked sequence-to-sequence learning. We conduct
lightweight fine-tuning on phrase analogies with diferent masking schemes. By language
modeling on answered prompts, fine-tuned models achieve over 97% accuracy on solving phrase
analogies. The fine-tuned sequence-to-sequence LMs have more promising out-of-distribution
generalization on sentence analogies than autoregressive LMs. In particular, one-term masking
is more robust at extracting analogical regularities, which contributes to adapt efectively to
unseen analogies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Based on prompting, recent works have formulated the quadruplet problem as language
modeling. The GPT-3 work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] explored the adaptation of the explosive-size LM (with 175B
parameters, which is over 100 times larger than GPT-2) on Stochastic Aptitude Tests (SAT)
analogy task in the few shot setting with no gradient updates. The discrete texts in analogies are
mapped into natural sentences with a textual prompt template. The pre-trained GPT-3 model
learns within the context consisting of few answered prompts and the query, to infer the answer
with the maximum LM likelihood among the given candidates. This exhibits the potential
of GPT-3 for extractive analogy completion at the word level. It gave rise to the exploration
of analogical capabilities of LMs with the help of prompts. A recent work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] examined the
adaptability of PLMs to recognize word analogies with diferent levels of complexity in the
zero-shot setting. They conduct ablation experiments in several aspects including prompt
engineering, architecture engineering and scoring engineering. With the appropriate choices,
PLMs can achieve meaningful zero-shot performance on analogy identification.
      </p>
      <p>
        Recent eforts on prompt learning found that prompt-based fine-tuning of PLMs can improve
efective learning on downstream tasks. These works have the advantage of being specific to the
task by tuning the LM parameters entirely [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or partially [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] on a small number of examples.
Typically, prompt-based fine-tuning is adopted on masked language models (MLMs) to predict
the mask token fixedly pointing to the target label in prompts. Here, we graft prompt-based
ifne-tuning onto a sequence-to-sequence LM with diferent masking schemes and execute
masked sequence-to-sequence learning on a large number of examples to extensively model
prompt sequences. The analogy task is reformulated as masked analogy completion, where
sequence answers are generated by a fine-tuned LM to fill in the mask token in prompts for
analogy questions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section, we introduce a generative method for completing analogies beyond words by
using prompts with a symbolic template (Section 3.1). To adapt to the analogy task, we propose
a novel prompt-based fine-tuning paradigm (Section 3.3) for PLMs to reconstruct the missing
span in prompts processed by masking schemes (Section 3.2).</p>
      <sec id="sec-3-1">
        <title>3.1. Symbolic Prompt Template</title>
        <p>
          Prompt design is crucial in prompt learning. For analogies beyond words, it is easy to get the
tokens of analogical words mixed up with the context tokens of textual templates, as used
in [
          <xref ref-type="bibr" rid="ref12 ref14">12, 14</xref>
          ]. We introduce a clear prompt in which the contents of the four terms are easily
parsed out from sequences.
        </p>
        <p>Symbolic prompts for analogy are formally identical to analogical equations A : B :: C : D,
where the ratio (:) and proportion (::) characters are two symbolic tokens in the prompt template.
We employ the Unicode characters U+2236 and U+2237 for ratio and proportion in sequences.
Let  ( ∈ {, , ,  }) denote one of the four terms in an analogy. The length of an answered
prompt can be calculated as | | + 1 + | | + 1 + | | + 1 + | | = ∑︀  ∈{,,, } | | + 3.</p>
        <p>Compared to textual tokens, symbolic tokens that facilitate the distinction between the four
terms are straightforward. Based on ordering characteristics, they directly delimit the four
terms by detecting the order of symbolic tokens in sequences. For example, : and :: delimit the
term  , whereas :: and :, in that order, delimit the term  .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Masking Schemes</title>
        <p>In light of the usual notation for analogies A : B :: C : x, it is natural to consider the expected
solution  as the missing text that can be predicted using the left context. To learn sequential
information, we explore three patterns of masking for answered prompts. We present some
examples of masked sequences that result from masking the following sentence analogy to
exemplify masking schemes.</p>
        <p>
          he will come tomorrow. : he will come. :: i have no time tomorrow. : i have no time.
Arbitrary Masking (Mask = any span) Like the document corruption method introduced
in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we randomly mask consecutive tokens whose lengths follow a sampling distribution
 = Poisson(3). This masking strategy does not take into account the structure of analogies.
The starting position of each masking span is selected at random. A masking span can consist
of a symbolic token and parts of adjacent terms. Like the following resulted sequence, part of
the first two terms are masked of along with the left ratio token.
        </p>
        <p>he will [mask]will come. :: i have no time tomorrow. : i have no time.</p>
        <p>One-term Masking (Mask = any term) A masking span is a whole term in analogies,
selected from the quadruplet (, , ,  ). Due to the binding meaning between the four terms
in an analogy, each term can be derived from the other three. This scheme randomly masks one
of the terms. It allows models to capture more comprehensively analogical regularities. The
term  is masked in the following sequence.</p>
        <p>he will come tomorrow. : he will come. :: [mask] : i have no time.</p>
        <p>Target-oriented Masking (Mask = term D) In this setting, we regard the target prediction
as the masking span in analogy prompts. To follow standard notations in analogy, i.e., the
format A : B :: C : x, we specifically mask the fourth term in answered prompts as shown
below.</p>
        <p>he will come tomorrow. : he will come. :: i have no time tomorrow. : [mask]</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Masked Prompt Learning</title>
        <p>
          In general, the paradigm of prompt-based fine-tuning reformulates downstream tasks as masked
language modeling on prompts, with the goal of optimizing the prediction of the mask token
specified as target outputs [
          <xref ref-type="bibr" rid="ref15 ref16">16, 15</xref>
          ]. For text generation, we introduce a novel prompt-based
finetuning paradigm to extract sequential information of answered prompts by masked
sequenceto-sequence learning like [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. On this basis, the analogy task is formulated as masked analogy
completion, which aims to generate unknown terms through a sequence-to-sequence model
trained for reconstructing the masked span in prompts.
        </p>
        <p>Given a pair of sequences (,  ) , where  is the sequence of a masked prompt (including a
single mask token) obtained by applying a masking scheme to the answered prompt,  is the
target sequence of the masking span. As in regular sequence-to-sequence learning, we tune the
model parameters Θ = (  ,   ) to estimate the conditional probabilities of target tokens
given masked prompts. The masked prompt is encoded bidirectionally. Each token in the target
sequence is predicted by the autoregressive decoder by maximizing the conditional probability
given the input sequence  and its preceding sequence  &lt; :
︁∏
| |</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Formal Analogy between Phrases</title>
      <p>Note that finding sentence analogies from a corpus will be dificult unless the corpus is dense.
It is easier to find more analogies between small chunks from a corpus than entire sentences.
Sentence constituents (i.e., phrases) are structural chunks that play grammatical roles in
sentences. We focus on finding formal analogies between phrases, which can indirectly reflect the
linguistic regularities contained in the given corpus.</p>
      <p>
        We build analogies from the English part of a parallel corpus1 released for the news translation
task at the Workshop on Machine Translation (WMT20). The data we used is made up of 3,003
sentences with an average length of 25 words. In order to detect phrases, we use Berkeley
Neural Parser2[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to parse sentences into constituency trees. In each sentence, word sequences
between 2 and 6 in length are collected into a phrase pool from which analogies are identified.
After traversing all sentences, we obtain 25,310 phrases.
      </p>
      <p>
        Starting from the set of phrases, we apply some functions from the Nlg tool3 introduced in
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] to extract analogies. In this work, we explore analogical relations at the formal level. Each
phrase is represented as a bag-of-word vector using Lines2Vectors. The dimension of vectors is
the number of word types in the phrase set. We then run Vectors2Clusters to find analogical
clusters pertaining to the diferences between phrase vectors. Each cluster is made up of pairs
of phrases (i.e. ratios) that have the same syntactic transformations. Thus, any two ratios in
a cluster makes a syntactic analogy. Note that cluster sizes may vary greatly depending on
frequencies of phrase structures contained in the corpus. To alleviate the imbalance of possible
analogies, we set the maximum cluster size to 10, where the minimum size defaults to 2.
      </p>
      <p>In our settings, over 1.5 million analogical clusters are extracted, where each cluster contains
two phrase ratios in average. By combining every two ratios in a cluster, we are able to enumerate</p>
      <sec id="sec-4-1">
        <title>1It can be downloaded from http://www.statmt.org/wmt20/translation-task.html</title>
        <p>2https://github.com/nikitakit/self-attentive-parser
3http://lepage-lab.ips.waseda.ac.jp/en/projects/kakenhi-15k00317/ → Tools - Nlg Module
1,524,293 analogies that capture formal similarities between four distinct phrases. The analogy
data includes 17,480 types of phrases with an average length of 3 words (20 characters). It
can be roughly estimated that the average length of analogy prompts is 3 × 4 + 3 in words
(resp. 20 × 4 + 3 in characters). For example, a collected phrase analogy to say : want to say ::
to go out : want to go out indicates a verb phrase attachment with the modifier of the verb
want, which consists of 15 words including three symbolic tokens in the prompt template.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>We fine-tune LMs on prompts of phrase analogies. To avoid that phrase ratios in the test data
appear also in the training set, we split the cluster data before making analogies. We take
1,000 clusters for building the analogies for testing and another 1,000 for validation, while
the remaining ones are for training. For each cluster set, we assemble every two ratios in the
same cluster into an analogy. The training, validation and test sets consist of approximately 1.5
million, 1,000 and 1,000 phrase analogies respectively.</p>
        <p>
          In addition to the phrase analogy test set, we also explore the performance of Transformer
models on solving sentence analogies, which are unseen analogies and have diferent
distributions than the training data. We sample sentence analogies from the English part of bilingual
analogies4 used in an EBMT by analogy setting [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The set of analogies embraces
formallevel analogous relationships between sentences from the Tatoeba corpus, where the length of
sentences varies from 2 to 10.
        </p>
        <p>Even if we swap the ordering of four phrases, each analogy only appears once in all datasets,
with no duplicates. The length statistics of analogies are shown in Table 1, including lengths
of terms and answered prompts. Analogies in the test sets are processed as prefix prompts,
in which the mask token is the last token in sequences. Each model is tested by infilling the
unknown term in masked prompts with the default format A : B :: C : [mask].</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Training Details</title>
        <p>
          In terms of the sequence-to-sequence Transformer architecture, we experiment with a
pretrained BART [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. For computational eficiency, we use the base-size model consisting of a
6-layer bidirectional encoder and a 6-layer autoregressive decoder. The pre-training paradigm
of BART is to perform denoising learning on corrupted text, to reconstruct the entire completed
sequences. To remove duplicates that to reconstruct the known tokens of masked sequences,
we fine-tune BART through masked prompt learning, which reconstructs only the sequences of
masking fragments.
        </p>
        <p>
          We made some modifications to the BART fine-tuning procedure provided by the Transformers
library [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. To save computational memory, we conducted partial freezing on the pre-trained
BART model and fine-tune only four of the twelve layers, precisely the bottom two layers of
the encoder and the top two layers of the decoder. We analyze the discrepancies on fine-tuning
4The bilingual set of sentence analogies is the 3rd resource of experimental results at http://lepage-lab.ips.waseda.ac.
jp/en/projects/kakenhi-kiban-c-18k11447/.
strategies in Section 5.5. In order to alleviate overfitting, we stop the training if there is no
improvement on the metric of the validation set after two consecutive epochs. We then save
the model with the best performance among the checkpoints.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Main Results</title>
        <p>As for the autoregressive baseline, we explore the performance of the distilled GPT-2 model
consisting of 6 layers. The entire pre-trained GPT-2 is fine-tuned on answered prompts of phrase
analogies for generating the last term given preceding context. To compare diferent methods,
we employ the accuracy metric to measure the percentage of generations that exactly match the
references. In addition, we also compute the Levenshtein distance (including spaces) to measure
diferences at the character level. Table 3 shows the performance of diferent fine-tuned models
on completing phrase analogies and sentence analogies.</p>
        <p>Model Architecture Prompting together with fine-tuning, benefits PLMs to accomplish
efective analogy completion of in-distribution cases (phrase analogies). As far as the model
architecture is concerned, the autoregressive LM excels in inferring answers to phrase analogy
questions where the last term (D) is missing. By learning on answered prompts of phrase
analogies in a feed-forward fashion, GPT-2 achieves the best accuracy 99.7% on phrase analogies.
However, the sequence-to-sequence model fine-tuned for infilling the term D performs
noticeably worse, only reaching 50.9% in accuracy. Except for the setting of target-oriented masking
(term D), BART fine-tuned with masked prompt learning have competitive performance with
GPT-2.</p>
        <p>Masking Scheme The masking deployment of prompts has a large impact on capturing
analogical regularities in masked prompt learning for the sequence-to-sequence model. The
target-oriented masking scheme (term D), like the mechanism of few-shot prompt-based
finetuning in MLMs, performs the worst in phrase analogies. We posit that it makes BART overly
imitate the features of the fourth terms of analogies rather than comprehend analogical
relationships. The arbitrary masking (any span) and the one-term masking (any term) are optimal for
modeling the sequential information of prompts for phrase analogies, which enables the model
to perform well in generating the last phrase in analogies, with only a 0.05 character diference.
Out-of-distribution Generalization As shown in the results of SA in Table 3, the
autoregressive LM exhibits poor out-of-distribution generalization capability, although it achieves
excellent performance on phrase analogies. GPT-2 can only correctly answer 4.2% unseen
sentence analogies where sequences are longer than the training data. This suggests that the
autoregressive LM may overfit to the narrow distribution of phrase analogies, which leads to a
failure in solving analogies between longer sequences. In comparison, the sequence-to-sequence
models exhibit better out-of-distribution generalization. Particularly, BART with similar
finetuning procedure, predicting the term D conditional on previous tokens, is approximately three
times superior to GPT-2.</p>
        <p>It is noticeable that masked prompt learning on prompts with one-term masking (any term)
has the best generalization on sentence analogies. It is 4 times more accurate than any span
masking with competitive performance on phrase analogies. It enables BART to generate
sequences that very closely match the reference sentences, difering by only 3 characters on
average, about 3/20 of the reference length. Table 5 presents some example errors. The
finetuned BART model profiting from the bidirectional learning on previous and future tokens, can
accomplish efective adaptation to unseen sentence analogies, with the achievement of an order
of magnitude greater accuracy than GPT-2.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Fine-grained Exploration of Analogical Ability</title>
        <p>To further explore analogical capabilities of LMs fine-tuned by masked prompt learning, we
conduct a fine-grained probing on analogical questions with diferent formats. For each test
analogy, we replace one of the four terms with the mask token to enumerate four analogy
questions with diferent masking spans. We use an individual accuracy Acc (where  ∈
{, , ,  }) to indicate the performance in solving analogies in a specific format. For example,
Average accuracy results of fine-tuned BART models with diferent masking schemes in solving analogy
questions in diferent formats. We report both individual accuracies
and universal accuracy Accall introduced in Subsection 5.4 .</p>
        <p>Acc (where 
∈ {, , , 
}
)
Data</p>
        <p>Masking Scheme
PA
SA</p>
        <p>Any span
Any term
Term D
Any span
Any term
Term D</p>
        <p>mask settings that take analogical terms into account, the masking scheme of any term achieves
competitive performance in terms of superior accuracy on each individual question, while
term-specific masking enables the model to answer only half of the questions where term  is
masked. For performance on sentence analogies, fine-tuning for any term masking outperforms
substantially other strategies in terms of both individual accuracies and universal accuracy.
Concretely, the fine-tuned model performs relatively well on completing sentence analogies
where one of the last triplets is missing. It is not a surprise that BART with the target-oriented
masking (term D) fails to predict terms other than the term D. The reason is that the mask token
does not appear in any position during the fine-tuning procedure except for the last term.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Ablation Studies</title>
        <p>The masking scheme of any term is more adapted to solving analogies in masked prompt
learning. In this subsection , we use one-term masking (any term) and ablate fine-tuning
strategies of tuned layers and prompt templates.</p>
      </sec>
      <sec id="sec-5-6">
        <title>Fine-tuned Layers</title>
        <sec id="sec-5-6-1">
          <title>As shown in Table 6, we compare various fine-tuning strategies with the of-the-shelf baseline. In the zero-shot setting, the pre-trained model is not able to generate a reliable answer for analogies beyond words. Fine-tuning BART makes a significant impact on understanding the analogical relationships between sequences. We can see that updating the</title>
          <p>Input he is getting better. : tom is getting better. :: he doesn ’t like to lose. : [mask]
Output tom doesn ’t like to lose
Reference tom doesn ’t like to lose.</p>
          <p>Input i know your name. : i believe you. :: i don ’t know your name. : [mask]
Output don ’t believe you.</p>
          <p>Reference i don ’t believe you.</p>
          <p>Input
Output
Reference
Input
Output
Reference
what is going on here? : he is intelligent. :: what’s going on here? : [mask]
he is intelligent.
he’s intelligent.
how did you do this? : what did you say? :: how do you do this? : [mask]
what did you say?
what do you say?
Input he will come tomorrow. : he will come. :: i have no time tomorrow. : [mask]
Output he will have no time tomorrow.</p>
          <p>Reference i have no time.
entire model helps model phrase analogies, with a significant gain of 94.2 points in predicting
the last phrase in analogy questions.</p>
          <p>However, it is imprudent to update the entire model. The full-scale fine-tuning makes the
model specialized in the training distribution, achieving only 11.7% accuracy in completing
sentence analogies. Freezing the entire encoder or decoder degrades the performance of solving
phrase analogies, while it increases the performance by at least 12 points over the unfrozen
BART for generalizing out-of-distribution analogies. In particular, only tuning the decoder and
freeze the encoder is useful to learn phrase distributions, whereas fine-tuning the encoder and
freeze the decoder performs relatively better for capturing analogical regularities.</p>
          <p>By contrast, lightweight joint fine-tuning on both encoder and decoder performs well on
two analogy test sets. Fine-tuning the bottleneck layers (the top two layers of encoder and the
bottom two layers of decoder), which closely updates the encoder-decoder attention, achieves
accuracy scores of 95.8% and 42.0% on phrase and sentence analogies respectively. Our strategy
of fine-tuning only the bottom two layers of the encoder and the top two layers of the decoder,
exhibits the best performance, yielding slight gains of about 2 points over fine-tuning the
bottleneck layers of BART.</p>
          <p>
            Prompt Templates In Table 7, we list results of GPT-2 and BART learned on prompts with
diferent manually written templates including symbolic prompt, textual prompt and null
prompt.5 As findings in [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], diferent manually-written templates lead to similar in-distribution
accuracy. However, prompt templates behave diferently on out-of-distribution cases. We can
5We also perform experiments on pre-trained GPT-2 and BART. Regardless of the template, almost none of the
analogies can be answered correctly by PLMs in the zero-shot setting.
observe that models fine-tuned on simple concatenation of four analogical phrases (null prompt)
are not able to answer sentence analogies, although they achieve 99.9% accuracy on phrase
analogies. The accuracy of non-null prompts (textual and symbolic templates) is increased by
at least 37.9% on sentence analogies.
          </p>
          <p>
            Despite a subtle drop (1.4 points) in the phrase analogy test, our prompt contributes to better
adaptation on unseen analogies than the textual prompt used in [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], attaining an improvement
of 4.5 points for BART (0.6 point for GPT-2). This shows the necessity of clear prompt semantics,
which allow models to better learn analogical relations encapsulated in prompts.
          </p>
          <p>Regardless of the template, GPT-2 models struggle in completing sentence analogies, although
they excel in completing phrase analogies. In contrast, the BART model is over 10 times more
accurate than GPT-2 in sentence analogies with the help of non-null prompts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Our work demonstrated the potential of LMs to complete analogies beyond words. We
introduced a simple but efective prompt-based fine-tuning paradigm for solving analogies beyond
words by masked sequence-to-sequence learning on answered prompts with diferent masking
schemes. To extract useful information about analogical regularities, we proposed three patterns
of masking on answered prompts. We found that fine-tuning with the objective of language
modeling on answered prompts, is efective for adaptation of generative analogy completion on
phrase analogies, except the sequence-to-sequence framework with target-oriented masking
which leads to overfit to narrow features in the training data. Compared to the autoregressive
framework, masked prompt learning is beneficial for out-of-distribution generalization over
sentence analogies. Lightweight fine-tuning in masked prompt learning with one-term masking
has the best potential for learning robust analogical capabilities. In the future, we intend to
refine the fine-tuning paradigm to enhance out-of-distribution performance in the few-shot
scenario. We hope to apply to other languages and build a multilingual generator for analogies
beyond words.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work is supported by China Scholarship Council (CSC) under the CSC Grant No.
202008050136.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Prade</surname>
          </string-name>
          , G. Richard, Analogical proportions:
          <article-title>Why they are useful in ai</article-title>
          , in: Z.
          <string-name>
            <surname>-H. Zhou</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4568</fpage>
          -
          <lpage>4576</lpage>
          . doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2021</year>
          /621, survey Track.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS'13</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2013</year>
          , p.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Linguistic regularities in sparse and explicit word representations</article-title>
          ,
          <source>in: Proceedings of the Eighteenth Conference on Computational Natural Language Learning</source>
          , Association for Computational Linguistics, Ann Arbor, Michigan,
          <year>2014</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          . 3115/v1/
          <fpage>W14</fpage>
          -1618.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bouraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jameel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <article-title>Relation induction in word embeddings revisited</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Computational Linguistics</source>
          , Association for Computational Linguistics, Santa Fe, New Mexico, USA,
          <year>2018</year>
          , pp.
          <fpage>1627</fpage>
          -
          <lpage>1637</lpage>
          . URL: https://aclanthology.org/C18-1138.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Prade</surname>
          </string-name>
          , G. Richard,
          <article-title>Classifying and completing word analogies by machine learning</article-title>
          ,
          <source>International Journal of Approximate Reasoning</source>
          <volume>132</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          . doi:https: //doi.org/10.1016/j.ijar.
          <year>2021</year>
          .
          <volume>02</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alsaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lay</surname>
          </string-name>
          , E. Marquer,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Murena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Couceiro</surname>
          </string-name>
          ,
          <article-title>A neural approach for detecting morphological analogies</article-title>
          ,
          <source>2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lepage</surname>
          </string-name>
          , E. Denoual,
          <article-title>Purest ever example-based machine translation: Detailed presentation and assessment</article-title>
          ,
          <source>Machine Translation</source>
          <volume>19</volume>
          (
          <year>2005</year>
          )
          <fpage>251</fpage>
          -
          <lpage>282</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10590-006-9010-x.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Elayeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chouigui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bounhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. B.</given-names>
            <surname>Khiroun</surname>
          </string-name>
          ,
          <article-title>Automatic Arabic text summarization using analogical proportions</article-title>
          ,
          <source>Cognitive Computation 12</source>
          (
          <year>2020</year>
          )
          <fpage>1043</fpage>
          -
          <lpage>1069</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s12559-020-09748-y.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Diallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          ,
          <article-title>Learning analogy-preserving sentence embeddings for answer selection</article-title>
          ,
          <source>in: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>910</fpage>
          -
          <lpage>919</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>K19</fpage>
          -1085.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G. de Melo,
          <article-title>Sentence analogies: Linguistic regularities in sentence embeddings</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3389</fpage>
          -
          <lpage>3400</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>300</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Afantenos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kunze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Prade</surname>
          </string-name>
          , G. Richard,
          <article-title>Analogies between sentences: Theoretical aspects - preliminary experiments</article-title>
          , in: J.
          <string-name>
            <surname>Vejnarová</surname>
          </string-name>
          , N. Wilson (Eds.),
          <source>Symbolic and Quantitative Approaches to Reasoning with Uncertainty</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig,
          <article-title>Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing</article-title>
          ,
          <source>arXiv preprint arXiv:2107.13586</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ushio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies?</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3609</fpage>
          -
          <lpage>3624</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>280</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Making pre-trained language models better few-shot learners</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3816</fpage>
          -
          <lpage>3830</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>295</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Logan</surname>
          </string-name>
          <string-name>
            <surname>IV</surname>
          </string-name>
          , I. Balazevic,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <article-title>Cutting down on prompts and parameters: Simple few-shot learning with language models, in: Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>2824</fpage>
          -
          <lpage>2835</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-acl.
          <volume>222</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, BART:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>703</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>MASS: Masked sequence to sequence pre-training for language generation</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5926</fpage>
          -
          <lpage>5936</lpage>
          . URL: https://proceedings.mlr.press/v97/ song19d.html.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kitaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <article-title>Constituency parsing with a self-attentive encoder, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>2676</fpage>
          -
          <lpage>2686</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -1249.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lepage</surname>
          </string-name>
          ,
          <article-title>Tools for the production of analogical grids and a resource of ngram analogical grids in 11 languages</article-title>
          , in
          <source>: LREC</source>
          <year>2018</year>
          , Miyazaki, Japan,
          <year>2018</year>
          . URL: https://www.aclweb.org/anthology/L18-1171.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Taillandier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lepage</surname>
          </string-name>
          ,
          <article-title>Réseaux de neurones pour la résolution d'analogies entre phrases en traduction automatique par l'exemple (neural networks for the resolution of analogies between sentences in EBMT ), in: Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition)</article-title>
          . Volume
          <volume>2</volume>
          :
          <string-name>
            <given-names>Traitement</given-names>
            <surname>Automatique des Langues</surname>
          </string-name>
          <string-name>
            <given-names>Naturelles</given-names>
            ,
            <surname>ATALA et</surname>
          </string-name>
          <string-name>
            <surname>AFCP</surname>
          </string-name>
          , Nancy, France,
          <year>2020</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>121</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .jeptalnrecital-taln.
          <volume>9</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>